18
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pages 2685–2702, November 16–20, 2020. c 2020 Association for Computational Linguistics 2685 C OMET: A Neural Framework for MT Evaluation Ricardo Rei Craig Stewart Ana C Farinha Alon Lavie Unbabel AI {ricardo.rei, craig.stewart, catarina.farinha, alon.lavie}@unbabel.com Abstract We present COMET, a neural framework for training multilingual machine translation eval- uation models which obtains new state-of-the- art levels of correlation with human judge- ments. Our framework leverages recent break- throughs in cross-lingual pretrained language modeling resulting in highly multilingual and adaptable MT evaluation models that exploit information from both the source input and a target-language reference translation in order to more accurately predict MT quality. To showcase our framework, we train three mod- els with different types of human judgements: Direct Assessments, Human-mediated Trans- lation Edit Rate and Multidimensional Qual- ity Metrics. Our models achieve new state-of- the-art performance on the WMT 2019 Met- rics shared task and demonstrate robustness to high-performing systems. 1 Introduction Historically, metrics for evaluating the quality of machine translation (MT) have relied on assessing the similarity between an MT-generated hypothesis and a human-generated reference translation in the target language. Traditional metrics have focused on basic, lexical-level features such as counting the number of matching n-grams between the MT hypothesis and the reference translation. Metrics such as BLEU (Papineni et al., 2002) and METEOR (Lavie and Denkowski, 2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation. Modern neural approaches to MT result in much higher quality of translation that often deviates from monotonic lexical transfer between languages. For this reason, it has become increasingly evident that we can no longer rely on metrics such as BLEU to provide an accurate estimate of the quality of MT (Barrault et al., 2019). While an increased research interest in neural methods for training MT models and systems has resulted in a recent, dramatic improvement in MT quality, MT evaluation has fallen behind. The MT research community still relies largely on outdated metrics and no new, widely-adopted standard has emerged. In 2019, the WMT News Translation Shared Task received a total of 153 MT system submissions (Barrault et al., 2019). The Metrics Shared Task of the same year saw only 24 sub- missions, almost half of which were entrants to the Quality Estimation Shared Task, adapted as metrics (Ma et al., 2019). The findings of the above-mentioned task high- light two major challenges to MT evaluation which we seek to address herein (Ma et al., 2019). Namely, that current metrics struggle to accu- rately correlate with human judgement at seg- ment level and fail to adequately differentiate the highest performing MT systems. In this paper, we present COMET 1 , a PyTorch- based framework for training highly multilingual and adaptable MT evaluation models that can func- tion as metrics. Our framework takes advantage of recent breakthroughs in cross-lingual language modeling (Artetxe and Schwenk, 2019; Devlin et al., 2019; Conneau and Lample, 2019; Conneau et al., 2019) to generate prediction estimates of hu- man judgments such as Direct Assessments (DA) (Graham et al., 2013), Human-mediated Transla- tion Edit Rate (HTER) (Snover et al., 2006) and metrics compliant with the Multidimensional Qual- ity Metric framework (Lommel et al., 2014). Inspired by recent work on Quality Estimation (QE) that demonstrated that it is possible to achieve high levels of correlation with human judgements even without a reference translation (Fonseca et al., 2019), we propose a novel approach for incorporat- 1 Crosslingual Optimized Metric for Evaluation of Translation.

COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing pages 2685ndash2702November 16ndash20 2020 ccopy2020 Association for Computational Linguistics

2685

COMET A Neural Framework for MT Evaluation

Ricardo Rei Craig Stewart Ana C Farinha Alon LavieUnbabel AI

ricardorei craigstewart catarinafarinha alonlavieunbabelcom

Abstract

We present COMET a neural framework fortraining multilingual machine translation eval-uation models which obtains new state-of-the-art levels of correlation with human judge-ments Our framework leverages recent break-throughs in cross-lingual pretrained languagemodeling resulting in highly multilingual andadaptable MT evaluation models that exploitinformation from both the source input and atarget-language reference translation in orderto more accurately predict MT quality Toshowcase our framework we train three mod-els with different types of human judgementsDirect Assessments Human-mediated Trans-lation Edit Rate and Multidimensional Qual-ity Metrics Our models achieve new state-of-the-art performance on the WMT 2019 Met-rics shared task and demonstrate robustness tohigh-performing systems

1 Introduction

Historically metrics for evaluating the quality ofmachine translation (MT) have relied on assessingthe similarity between an MT-generated hypothesisand a human-generated reference translation in thetarget language Traditional metrics have focusedon basic lexical-level features such as countingthe number of matching n-grams between the MThypothesis and the reference translation Metricssuch as BLEU (Papineni et al 2002) and METEOR

(Lavie and Denkowski 2009) remain popular asa means of evaluating MT systems due to theirlight-weight and fast computation

Modern neural approaches to MT result in muchhigher quality of translation that often deviatesfrom monotonic lexical transfer between languagesFor this reason it has become increasingly evidentthat we can no longer rely on metrics such as BLEU

to provide an accurate estimate of the quality ofMT (Barrault et al 2019)

While an increased research interest in neuralmethods for training MT models and systems hasresulted in a recent dramatic improvement in MTquality MT evaluation has fallen behind The MTresearch community still relies largely on outdatedmetrics and no new widely-adopted standard hasemerged In 2019 the WMT News TranslationShared Task received a total of 153 MT systemsubmissions (Barrault et al 2019) The MetricsShared Task of the same year saw only 24 sub-missions almost half of which were entrants to theQuality Estimation Shared Task adapted as metrics(Ma et al 2019)

The findings of the above-mentioned task high-light two major challenges to MT evaluation whichwe seek to address herein (Ma et al 2019)Namely that current metrics struggle to accu-rately correlate with human judgement at seg-ment level and fail to adequately differentiatethe highest performing MT systems

In this paper we present COMET1 a PyTorch-based framework for training highly multilingualand adaptable MT evaluation models that can func-tion as metrics Our framework takes advantageof recent breakthroughs in cross-lingual languagemodeling (Artetxe and Schwenk 2019 Devlinet al 2019 Conneau and Lample 2019 Conneauet al 2019) to generate prediction estimates of hu-man judgments such as Direct Assessments (DA)(Graham et al 2013) Human-mediated Transla-tion Edit Rate (HTER) (Snover et al 2006) andmetrics compliant with the Multidimensional Qual-ity Metric framework (Lommel et al 2014)

Inspired by recent work on Quality Estimation(QE) that demonstrated that it is possible to achievehigh levels of correlation with human judgementseven without a reference translation (Fonseca et al2019) we propose a novel approach for incorporat-

1Crosslingual Optimized Metric for Evaluation ofTranslation

2686

ing the source-language input into our MT evalu-ation models Traditionally only QE models havemade use of the source input whereas MT evalu-ation metrics rely instead on the reference transla-tion As in (Takahashi et al 2020) we show thatusing a multilingual embedding space allows usto leverage information from all three inputs anddemonstrate the value added by the source as inputto our MT evaluation models

To illustrate the effectiveness and flexibility ofthe COMET framework we train three models thatestimate different types of human judgements andshow promising progress towards both better cor-relation at segment level and robustness to high-quality MT

We will release both the COMET framework andthe trained MT evaluation models described in thispaper to the research community upon publication

2 Model Architectures

Human judgements of MT quality usually comein the form of segment-level scores such as DAMQM and HTER For DA it is common practice toconvert scores into relative rankings (DARR) whenthe number of annotations per segment is limited(Bojar et al 2017b Ma et al 2018 2019) Thismeans that for two MT hypotheses hi and hj ofthe same source s if the DA score assigned to hiis higher than the score assigned to hj hi is re-garded as a ldquobetterrdquo hypothesis2 To encompassthese differences our framework supports two dis-tinct architectures The Estimator model and theTranslation Ranking model The fundamentaldifference between them is the training objectiveWhile the Estimator is trained to regress directly ona quality score the Translation Ranking model istrained to minimize the distance between a ldquobetterrdquohypothesis and both its corresponding referenceand its original source Both models are composedof a cross-lingual encoder and a pooling layer

21 Cross-lingual Encoder

The primary building block of all the modelsin our framework is a pretrained cross-lingualmodel such as multilingual BERT (Devlin et al2019) XLM (Conneau and Lample 2019) or XLM-RoBERTa (Conneau et al 2019) These modelscontain several transformer encoder layers that are

2In the WMT Metrics Shared Task if the difference be-tween the DA scores is not higher than 25 points those seg-ments are excluded from the DARR data

trained to reconstruct masked tokens by uncover-ing the relationship between those tokens and thesurrounding ones When trained with data frommultiple languages this pretrained objective hasbeen found to be highly effective in cross-lingualtasks such as document classification and naturallanguage inference (Conneau et al 2019) gener-alizing well to unseen languages and scripts (Pireset al 2019) For the experiments in this paperwe rely on XLM-RoBERTa (base) as our encodermodel

Given an input sequence x = [x0 x1 xn]the encoder produces an embedding e(`)j for eachtoken xj and each layer ` isin 0 1 k In ourframework we apply this process to the sourceMT hypothesis and reference in order to map theminto a shared feature space

22 Pooling Layer

The embeddings generated by the last layer of thepretrained encoders are usually used for fine-tuningmodels to new tasks However (Tenney et al2019) showed that different layers within the net-work can capture linguistic information that is rel-evant for different downstream tasks In the caseof MT evaluation (Zhang et al 2020) showed thatdifferent layers can achieve different levels of cor-relation and that utilizing only the last layer oftenresults in inferior performance In this work weused the approach described in Peters et al (2018)and pool information from the most important en-coder layers into a single embedding for each to-ken ej by using a layer-wise attention mechanismThis embedding is then computed as

exj = microEgtxjα (1)

where micro is a trainable weight coefficient Ej =

[e(0)j e

(1)j e

(k)j ] corresponds to the vector of

layer embeddings for token xj and α =softmax([α(1) α(2) α(k)]) is a vector corre-sponding to the layer-wise trainable weights Inorder to avoid overfitting to the information con-tained in any single layer we used layer dropout(Kondratyuk and Straka 2019) in which with aprobability p the weight α(i) is set to minusinfin

Finally as in (Reimers and Gurevych 2019)we apply average pooling to the resulting wordembeddings to derive a sentence embedding foreach segment

2687

Figure 1 Estimator model architecture The sourcehypothesis and reference are independently encoded us-ing a pretrained cross-lingual encoder The resultingword embeddings are then passed through a poolinglayer to create a sentence embedding for each segmentFinally the resulting sentence embeddings are com-bined and concatenated into one single vector that ispassed to a feed-forward regressor The entire model istrained by minimizing the Mean Squared Error (MSE)

Figure 2 Translation Ranking model architectureThis architecture receives 4 segments the source thereference a ldquobetterrdquo hypothesis and a ldquoworserdquo oneThese segments are independently encoded using a pre-trained cross-lingual encoder and a pooling layer ontop Finally using the triplet margin loss (Schroff et al2015) we optimize the resulting embedding space tominimize the distance between the ldquobetterrdquo hypothesisand the ldquoanchorsrdquo (source and reference)

23 Estimator ModelGiven a d-dimensional sentence embedding for thesource the hypothesis and the reference we adoptthe approach proposed in RUSE (Shimanaka et al2018) and extract the following combined features

bull Element-wise source product h s

bull Element-wise reference product h r

bull Absolute element-wise source difference|hminus s|

bull Absolute element-wise reference difference|hminus r|

These combined features are then concatenatedto the reference embedding r and hypothesis em-bedding h into a single vector x = [h rh sh r |h minus s| |h minus r|] that serves as input toa feed-forward regressor The strength of thesefeatures is in highlighting the differences betweenembeddings in the semantic feature space

The model is then trained to minimize the meansquared error between the predicted scores andquality assessments (DA HTER or MQM) Fig-ure 1 illustrates the proposed architecture

Note that we chose not to include the raw sourceembedding (s) in our concatenated input Earlyexperimentation revealed that the value added bythe source embedding as extra input features to ourregressor was negligible at best A variation onour HTER estimator model trained with the vectorx = [h s rh sh r |h minus s| |h minus r|] asinput to the feed-forward only succeed in boost-ing segment-level performance in 8 of the 18 lan-guage pairs outlined in section 5 below and theaverage improvement in Kendallrsquos Tau in those set-tings was +00009 As noted in Zhao et al (2020)while cross-lingual pretrained models are adaptiveto multiple languages the feature space betweenlanguages is poorly aligned On this basis we de-cided in favor of excluding the source embeddingon the intuition that the most important informationcomes from the reference embedding and reduc-ing the feature space would allow the model tofocus more on relevant information This does nothowever negate the general value of the source toour model where we include combination featuressuch as h s and |h minus s| we do note gains incorrelation as explored further in section 55 below

2688

24 Translation Ranking ModelOur Translation Ranking model (Figure 2) receivesas input a tuple χ = (s h+ hminus r) where h+ de-notes an hypothesis that was ranked higher thananother hypothesis hminus We then pass χ throughour cross-lingual encoder and pooling layer to ob-tain a sentence embedding for each segment in theχ Finally using the embeddings sh+hminus rwe compute the triplet margin loss (Schroff et al2015) in relation to the source and reference

L(χ) = L(sh+hminus) + L(rh+hminus) (2)

where

L(sh+hminus) =

max0 d(sh+) minus d(shminus) + ε(3)

L(rh+hminus) =

max0 d(rh+) minus d(rhminus) + ε(4)

d(uv) denotes the euclidean distance between uand v and ε is a margin Thus during training themodel optimizes the embedding space so the dis-tance between the anchors (s and r) and the ldquoworserdquohypothesis hminus is greater by at least ε than the dis-tance between the anchors and ldquobetterrdquo hypothesish+

During inference the described model receivesa triplet (s h r) with only one hypothesis Thequality score assigned to h is the harmonic meanbetween the distance to the source d(s h) and thedistance to the reference d(r h)

f(s h r) =2times d(r h)times d(s h)

d(r h) + d(s h)(5)

Finally we convert the resulting distance into asimilarity score bounded between 0 and 1 as fol-lows

f(s h r) =1

1 + f(s h r)(6)

3 Corpora

To demonstrate the effectiveness of our describedmodel architectures (section 2) we train three MTevaluation models where each model targets a dif-ferent type of human judgment To train thesemodels we use data from three different corporathe QT21 corpus the DARR from the WMT Met-rics shared task (2017 to 2019) and a proprietaryMQM annotated corpus

31 The QT21 corpusThe QT21 corpus is a publicly available3 datasetcontaining industry generated sentences from eitheran information technology or life sciences domains(Specia et al 2017) This corpus contains a totalof 173K tuples with source sentence respectivehuman-generated reference MT hypothesis (eitherfrom a phrase-based statistical MT or from a neu-ral MT) and post-edited MT (PE) The languagepairs represented in this corpus are English to Ger-man (en-de) Latvian (en-lt) and Czech (en-cs) andGerman to English (de-en)

The HTER score is obtained by computing thetranslation edit rate (TER) (Snover et al 2006) be-tween the MT hypothesis and the corresponding PEFinally after computing the HTER for each MTwe built a training dataset D = si hi ri yiNn=1where si denotes the source text hi denotes the MThypothesis ri the reference translation and yi theHTER score for the hypothesis hi In this mannerwe seek to learn a regression f(s h r) rarr y thatpredicts the human-effort required to correct thehypothesis by looking at the source hypothesisand reference (but not the post-edited hypothesis)

32 The WMT DARR corpusSince 2017 the organizers of the WMT NewsTranslation Shared Task (Barrault et al 2019) havecollected human judgements in the form of ad-equacy DAs (Graham et al 2013 2014 2017)These DAs are then mapped into relative rank-ings (DARR) (Ma et al 2019) The resultingdata for each year (2017-19) form a dataset D =si h+i h

minusi riNn=1 where h+i denotes a ldquobetterrdquo

hypothesis and hminusi denotes a ldquoworserdquo one Herewe seek to learn a function r(s h r) such that thescore assigned to h+i is strictly higher than the scoreassigned to hminusi (r(si h+i ri) gt r(si h

minusi ri))

This data4 contains a total of 24 high and low-resource language pairs such as Chinese to English(zh-en) and English to Gujarati (en-gu)

33 The MQM corpusThe MQM corpus is a proprietary internal databaseof MT-generated translations of customer support

3QT21 data httpslindatmffcuniczrepositoryxmluihandle11372LRT-2390

4The raw data for each year of the WMT Metrics sharedtask is publicly available in the results page (2019 ex-ample httpwwwstatmtorgwmt19resultshtml) Note however that in the README files it is high-lighted that this data is not well documented and the scriptsoccasionally require custom utilities that are not available

2689

chat messages that were annotated according to theguidelines set out in Burchardt and Lommel (2014)This data contains a total of 12K tuples cover-ing 12 language pairs from English to German(en-de) Spanish (en-es) Latin-American Span-ish (en-es-latam) French (en-fr) Italian (en-it)Japanese (en-ja) Dutch (en-nl) Portuguese (en-pt)Brazilian Portuguese (en-pt-br) Russian (en-ru)Swedish (en-sv) and Turkish (en-tr) Note that inthis corpus English is always seen as the source lan-guage but never as the target language Each tupleconsists of a source sentence a human-generatedreference a MT hypothesis and its MQM scorederived from error annotations by one (or more)trained annotators The MQM metric referred tothroughout this paper is an internal metric definedin accordance with the MQM framework (Lommelet al 2014) (MQM) Errors are annotated underan internal typology defined under three main er-ror types lsquoStylersquo lsquoFluencyrsquo and lsquoAccuracyrsquo OurMQM scores range from minusinfin to 100 and are de-fined as

MQM = 100minus IMinor + 5times IMajor + 10times ICrit

Sentence Lengthtimes 100(7)

where IMinor denotes the number of minor errorsIMajor the number of major errors and ICrit the num-ber of critical errors

Our MQM metric takes into account the sever-ity of the errors identified in the MT hypothesisleading to a more fine-grained metric than HTERor DA When used in our experiments these val-ues were divided by 100 and truncated at 0 Asin section 31 we constructed a training datasetD = si hi ri yiNn=1 where si denotes thesource text hi denotes the MT hypothesis ri thereference translation and yi the MQM score forthe hypothesis hi

4 Experiments

We train two versions of the Estimator model de-scribed in section 23 one that regresses on HTER(COMET-HTER) trained with the QT21 corpus andanother that regresses on our proprietary implemen-tation of MQM (COMET-MQM) trained with ourinternal MQM corpus For the Translation Rankingmodel described in section 24 we train with theWMT DARR corpus from 2017 and 2018 (COMET-RANK) In this section we introduce the training

setup for these models and corresponding evalua-tion setup

41 Training SetupThe two versions of the Estimators (COMET-HTERMQM) share the same training setup andhyper-parameters (details are included in the Ap-pendices) For training we load the pretrainedencoder and initialize both the pooling layer andthe feed-forward regressor Whereas the layer-wisescalars α from the pooling layer are initially setto zero the weights from the feed-forward are ini-tialized randomly During training we divide themodel parameters into two groups the encoder pa-rameters that include the encoder model and thescalars from α and the regressor parameters thatinclude the parameters from the top feed-forwardnetwork We apply gradual unfreezing and discrim-inative learning rates (Howard and Ruder 2018)meaning that the encoder model is frozen for oneepoch while the feed-forward is optimized with alearning rate of 3eminus5 After the first epoch theentire model is fine-tuned but the learning rate forthe encoder parameters is set to 1eminus5 in order toavoid catastrophic forgetting

In contrast with the two Estimators for theCOMET-RANK model we fine-tune from the outsetFurthermore since this model does not add anynew parameters on top of XLM-RoBERTa (base)other than the layer scalars α we use one singlelearning rate of 1eminus5 for the entire model

42 Evaluation SetupWe use the test data and setup of the WMT 2019Metrics Shared Task (Ma et al 2019) in order tocompare the COMET models with the top perform-ing submissions of the shared task and other recentstate-of-the-art metrics such as BERTSCORE andBLEURT5 The evaluation method used is the of-ficial Kendallrsquos Tau-like formulation τ from theWMT 2019 Metrics Shared Task (Ma et al 2019)defined as

τ =Concordantminus DiscordantConcordant + Discordant

(8)

where Concordant is the number of times a metricassigns a higher score to the ldquobetterrdquo hypothesish+ and Discordant is the number of times a metricassigns a higher score to the ldquoworserdquo hypothesis

5To ease future research we will also provide within ourframework detailed instructions and scripts to run other met-rics such as CHRF BLEU BERTSCORE and BLEURT

2690

Table 1 Kendallrsquos Tau (τ ) correlations on language pairs with English as source for the WMT19 Metrics DARRcorpus For BERTSCORE we report results with the default encoder model for a complete comparison but alsowith XLM-RoBERTa (base) for fairness with our models The values reported for YiSi-1 are taken directly fromthe shared task paper (Ma et al 2019)

Metric en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zhBLEU 0364 0248 0395 0463 0363 0333 0469 0235CHRF 0444 0321 0518 0548 0510 0438 0548 0241YISI-1 0475 0351 0537 0551 0546 0470 0585 0355BERTSCORE (default) 0500 0363 0527 0568 0540 0464 0585 0356BERTSCORE (xlmr-base) 0503 0369 0553 0584 0536 0514 0599 0317COMET-HTER 0524 0383 0560 0552 0508 0577 0539 0380COMET-MQM 0537 0398 0567 0564 0534 0574 0615 0378COMET-RANK 0603 0427 0664 0611 0693 0665 0580 0449

hminus or the scores assigned to both hypotheses is thesame

As mentioned in the findings of (Ma et al 2019)segment-level correlations of all submitted metricswere frustratingly low Furthermore all submit-ted metrics exhibited a dramatic lack of ability tocorrectly rank strong MT systems To evaluatewhether our new MT evaluation models better ad-dress this issue we followed the described evalu-ation setup used in the analysis presented in (Maet al 2019) where correlation levels are examinedfor portions of the DARR data that include only thetop 10 8 6 and 4 MT systems

5 Results

51 From English into X

Table 1 shows results for all eight language pairswith English as source We contrast our threeCOMET models against baseline metrics such asBLEU and CHRF the 2019 task winning metricYISI-1 as well as the more recent BERTSCOREWe observe that across the board our three modelstrained with the COMET framework outperformoften by significant margins all other metrics OurDARR Ranker model outperforms the two Estima-tors in seven out of eight language pairs Also eventhough the MQM Estimator is trained on only 12Kannotated segments it performs roughly on parwith the HTER Estimator for most language-pairsand outperforms all the other metrics in en-ru

52 From X into English

Table 2 shows results for the seven to-English lan-guage pairs Again we contrast our three COMET

models against baseline metrics such as BLEU andCHRF the 2019 task winning metric YISI-1 as

well as the recently published metrics BERTSCORE

and BLEURT As in Table 1 the DARR model showsstrong correlations with human judgements out-performing the recently proposed English-specificBLEURT metric in five out of seven language pairsAgain the MQM Estimator shows surprising strongresults despite the fact that this model was trainedwith data that did not include English as a targetAlthough the encoder used in our trained models ishighly multilingual we hypothesise that this pow-erful ldquozero-shotrdquo result is due to the inclusion ofthe source in our models

53 Language pairs not involving EnglishAll three of our COMET models were trained ondata involving English (either as a source or as atarget) Nevertheless to demonstrate that our met-rics generalize well we test them on the three WMT2019 language pairs that do not include English ineither source or target As can be seen in Table3 our results are consistent with observations inTables 1 and 2

54 Robustness to High-Quality MTFor analysis we use the DARR corpus from the2019 Shared Task and evaluate on the subset ofthe data from the top performing MT systems foreach language pair We included language pairsfor which we could retrieve data for at least tendifferent MT systems (ie all but kk-en and gu-en)We contrast against the strong recently proposedBERTSCORE and BLEURT with BLEU as a base-line Results are presented in Figure 3 For lan-guage pairs where English is the target our threemodels are either better or competitive with all oth-ers where English is the source we note that ingeneral our metrics exceed the performance of oth-

2691

Table 2 Kendallrsquos Tau (τ ) correlations on language pairs with English as a target for the WMT19 Metrics DARRcorpus As for BERTSCORE for BLEURT we report results for two models the base model which is comparablein size with the encoder we used and the large model that is twice the size

Metric de-en fi-en gu-en kk-en lt-en ru-en zh-enBLEU 0053 0236 0194 0276 0249 0177 0321CHRF 0123 0292 0240 0323 0304 0115 0371YISI-1 0164 0347 0312 0440 0376 0217 0426BERTSCORE (default) 0190 0354 0292 0351 0381 0221 0432BERTSCORE (xlmr-base) 0171 0335 0295 0354 0356 0202 0412BLEURT (base-128) 0171 0372 0302 0383 0387 0218 0417BLEURT (large-512) 0174 0374 0313 0372 0388 0220 0436COMET-HTER 0185 0333 0274 0297 0364 0163 0391COMET-MQM 0207 0343 0282 0339 0368 0187 0422COMET-RANK 0202 0399 0341 0358 0407 0180 0445

Table 3 Kendallrsquos Tau (τ ) correlations on languagepairs not involving English for the WMT19 MetricsDARR corpus

Metric de-cs de-fr fr-deBLEU 0222 0226 0173CHRF 0341 0287 0274YISI-1 0376 0349 0310BERTSCORE (default) 0358 0329 0300BERTSCORE (xlmr-base) 0386 0336 0309COMET-HTER 0358 0397 0315COMET-MQM 0386 0367 0296COMET-RANK 0389 0444 0331

ers Even the MQM Estimator trained with only12K segments is competitive which highlights thepower of our proposed framework

55 The Importance of the Source

To shed some light on the actual value and contri-bution of the source language input in our modelsrsquoability to learn accurate predictions we trained twoversions of our DARR Ranker model one that usesonly the reference and another that uses both refer-ence and source Both models were trained usingthe WMT 2017 corpus that only includes languagepairs from English (en-de en-cs en-fi en-tr) Inother words while English was never observed asa target language during training for both variantsof the model the training of the second variant in-cludes English source embeddings We then testedthese two model variants on the WMT 2018 corpusfor these language pairs and for the reversed di-rections (with the exception of en-cs because cs-endoes not exist for WMT 2018) The results in Table

All 10 8 6 40

01

02

03

Top models from X to English

Ken

dall

Tau

(τ)

COMET-RANK BLEUCOMET-MQM BERTSCORECOMET-HTER BLEURT

All 10 8 6 40

02

04

06

Top models from English to X

Ken

dall

Tau

(τ)

Figure 3 Metrics performance over all and the top (108 6 and 4) MT systems

4 clearly show that for the translation ranking archi-tecture including the source improves the overallcorrelation with human judgments Furthermorethe inclusion of the source exposed the second vari-ant of the model to English embeddings which is

2692

Table 4 Comparison between COMET-RANK (section 24) and a reference-only version thereof on WMT18 dataBoth models were trained with WMT17 which means that the reference-only model is never exposed to Englishduring training

Metric en-cs en-de en-fi en-tr cs-en de-en fi-en tr-enCOMET-RANK (ref only) 0660 0764 0630 0539 0249 0390 0159 0128COMET-RANK 0711 0799 0671 0563 0356 0542 0278 0260∆τ 0051 0035 0041 0024 0107 0155 0119 0132

reflected in a higher ∆τ for the language pairs withEnglish as a target

6 Reproducibility

We will release both the code-base of the COMET

framework and the trained MT evaluation modelsdescribed in this paper to the research communityupon publication along with the detailed scriptsrequired in order to run all reported baselines6 Allthe models reported in this paper were trained on asingle Tesla T4 (16GB) GPU Moreover our frame-work builds on top of PyTorch Lightning (Falcon2019) a lightweight PyTorch wrapper that wascreated for maximal flexibility and reproducibility

7 Related Work

Classic MT evaluation metrics are commonly char-acterized as n-gram matching metrics becauseusing hand-crafted features they estimate MT qual-ity by counting the number and fraction of n-grams that appear simultaneous in a candidatetranslation hypothesis and one or more human-references Metrics such as BLEU (Papineni et al2002) METEOR (Lavie and Denkowski 2009)and CHRF (Popovic 2015) have been widely stud-ied and improved (Koehn et al 2007 Popovic2017 Denkowski and Lavie 2011 Guo and Hu2019) but by design they usually fail to recognizeand capture semantic similarity beyond the lexicallevel

In recent years word embeddings (Mikolovet al 2013 Pennington et al 2014 Peters et al2018 Devlin et al 2019) have emerged as a com-monly used alternative to n-gram matching forcapturing word semantics similarity Embedding-based metrics like METEOR-VECTOR (Servanet al 2016) BLEU2VEC (Tattar and Fishel 2017)YISI-1 (Lo 2019) MOVERSCORE (Zhao et al2019) and BERTSCORE (Zhang et al 2020) createsoft-alignments between reference and hypothesis

6These will be hosted at httpsgithubcomUnbabelCOMET

in an embedding space and then compute a scorethat reflects the semantic similarity between thosesegments However human judgements such asDA and MQM capture much more than just se-mantic similarity resulting in a correlation upper-bound between human judgements and the scoresproduced by such metrics

Learnable metrics (Shimanaka et al 2018Mathur et al 2019 Shimanaka et al 2019) at-tempt to directly optimize the correlation with hu-man judgments and have recently shown promis-ing results BLEURT (Sellam et al 2020) a learn-able metric based on BERT (Devlin et al 2019)claims state-of-the-art performance for the last 3years of the WMT Metrics Shared task BecauseBLEURT builds on top of English-BERT (Devlinet al 2019) it can only be used when English is thetarget language which limits its applicability Alsoto the best of our knowledge all the previouslyproposed learnable metrics have focused on opti-mizing DA which due to a scarcity of annotatorscan prove inherently noisy (Ma et al 2019)

Reference-less MT evaluation also known asQuality Estimation (QE) has historically often re-gressed on HTER for segment-level evaluation (Bo-jar et al 2013 2014 2015 2016 2017a) Morerecently MQM has been used for document-levelevaluation (Specia et al 2018 Fonseca et al2019) By leveraging highly multilingual pre-trained encoders such as multilingual BERT (De-vlin et al 2019) and XLM (Conneau and Lam-ple 2019) QE systems have been showing aus-picious correlations with human judgements (Ke-pler et al 2019a) Concurrently the OpenKiwiframework (Kepler et al 2019b) has made it easierfor researchers to push the field forward and buildstronger QE models

8 Conclusions and Future Work

In this paper we present COMET a novel neu-ral framework for training MT evaluation modelsthat can serve as automatic metrics and easily be

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 2: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2686

ing the source-language input into our MT evalu-ation models Traditionally only QE models havemade use of the source input whereas MT evalu-ation metrics rely instead on the reference transla-tion As in (Takahashi et al 2020) we show thatusing a multilingual embedding space allows usto leverage information from all three inputs anddemonstrate the value added by the source as inputto our MT evaluation models

To illustrate the effectiveness and flexibility ofthe COMET framework we train three models thatestimate different types of human judgements andshow promising progress towards both better cor-relation at segment level and robustness to high-quality MT

We will release both the COMET framework andthe trained MT evaluation models described in thispaper to the research community upon publication

2 Model Architectures

Human judgements of MT quality usually comein the form of segment-level scores such as DAMQM and HTER For DA it is common practice toconvert scores into relative rankings (DARR) whenthe number of annotations per segment is limited(Bojar et al 2017b Ma et al 2018 2019) Thismeans that for two MT hypotheses hi and hj ofthe same source s if the DA score assigned to hiis higher than the score assigned to hj hi is re-garded as a ldquobetterrdquo hypothesis2 To encompassthese differences our framework supports two dis-tinct architectures The Estimator model and theTranslation Ranking model The fundamentaldifference between them is the training objectiveWhile the Estimator is trained to regress directly ona quality score the Translation Ranking model istrained to minimize the distance between a ldquobetterrdquohypothesis and both its corresponding referenceand its original source Both models are composedof a cross-lingual encoder and a pooling layer

21 Cross-lingual Encoder

The primary building block of all the modelsin our framework is a pretrained cross-lingualmodel such as multilingual BERT (Devlin et al2019) XLM (Conneau and Lample 2019) or XLM-RoBERTa (Conneau et al 2019) These modelscontain several transformer encoder layers that are

2In the WMT Metrics Shared Task if the difference be-tween the DA scores is not higher than 25 points those seg-ments are excluded from the DARR data

trained to reconstruct masked tokens by uncover-ing the relationship between those tokens and thesurrounding ones When trained with data frommultiple languages this pretrained objective hasbeen found to be highly effective in cross-lingualtasks such as document classification and naturallanguage inference (Conneau et al 2019) gener-alizing well to unseen languages and scripts (Pireset al 2019) For the experiments in this paperwe rely on XLM-RoBERTa (base) as our encodermodel

Given an input sequence x = [x0 x1 xn]the encoder produces an embedding e(`)j for eachtoken xj and each layer ` isin 0 1 k In ourframework we apply this process to the sourceMT hypothesis and reference in order to map theminto a shared feature space

22 Pooling Layer

The embeddings generated by the last layer of thepretrained encoders are usually used for fine-tuningmodels to new tasks However (Tenney et al2019) showed that different layers within the net-work can capture linguistic information that is rel-evant for different downstream tasks In the caseof MT evaluation (Zhang et al 2020) showed thatdifferent layers can achieve different levels of cor-relation and that utilizing only the last layer oftenresults in inferior performance In this work weused the approach described in Peters et al (2018)and pool information from the most important en-coder layers into a single embedding for each to-ken ej by using a layer-wise attention mechanismThis embedding is then computed as

exj = microEgtxjα (1)

where micro is a trainable weight coefficient Ej =

[e(0)j e

(1)j e

(k)j ] corresponds to the vector of

layer embeddings for token xj and α =softmax([α(1) α(2) α(k)]) is a vector corre-sponding to the layer-wise trainable weights Inorder to avoid overfitting to the information con-tained in any single layer we used layer dropout(Kondratyuk and Straka 2019) in which with aprobability p the weight α(i) is set to minusinfin

Finally as in (Reimers and Gurevych 2019)we apply average pooling to the resulting wordembeddings to derive a sentence embedding foreach segment

2687

Figure 1 Estimator model architecture The sourcehypothesis and reference are independently encoded us-ing a pretrained cross-lingual encoder The resultingword embeddings are then passed through a poolinglayer to create a sentence embedding for each segmentFinally the resulting sentence embeddings are com-bined and concatenated into one single vector that ispassed to a feed-forward regressor The entire model istrained by minimizing the Mean Squared Error (MSE)

Figure 2 Translation Ranking model architectureThis architecture receives 4 segments the source thereference a ldquobetterrdquo hypothesis and a ldquoworserdquo oneThese segments are independently encoded using a pre-trained cross-lingual encoder and a pooling layer ontop Finally using the triplet margin loss (Schroff et al2015) we optimize the resulting embedding space tominimize the distance between the ldquobetterrdquo hypothesisand the ldquoanchorsrdquo (source and reference)

23 Estimator ModelGiven a d-dimensional sentence embedding for thesource the hypothesis and the reference we adoptthe approach proposed in RUSE (Shimanaka et al2018) and extract the following combined features

bull Element-wise source product h s

bull Element-wise reference product h r

bull Absolute element-wise source difference|hminus s|

bull Absolute element-wise reference difference|hminus r|

These combined features are then concatenatedto the reference embedding r and hypothesis em-bedding h into a single vector x = [h rh sh r |h minus s| |h minus r|] that serves as input toa feed-forward regressor The strength of thesefeatures is in highlighting the differences betweenembeddings in the semantic feature space

The model is then trained to minimize the meansquared error between the predicted scores andquality assessments (DA HTER or MQM) Fig-ure 1 illustrates the proposed architecture

Note that we chose not to include the raw sourceembedding (s) in our concatenated input Earlyexperimentation revealed that the value added bythe source embedding as extra input features to ourregressor was negligible at best A variation onour HTER estimator model trained with the vectorx = [h s rh sh r |h minus s| |h minus r|] asinput to the feed-forward only succeed in boost-ing segment-level performance in 8 of the 18 lan-guage pairs outlined in section 5 below and theaverage improvement in Kendallrsquos Tau in those set-tings was +00009 As noted in Zhao et al (2020)while cross-lingual pretrained models are adaptiveto multiple languages the feature space betweenlanguages is poorly aligned On this basis we de-cided in favor of excluding the source embeddingon the intuition that the most important informationcomes from the reference embedding and reduc-ing the feature space would allow the model tofocus more on relevant information This does nothowever negate the general value of the source toour model where we include combination featuressuch as h s and |h minus s| we do note gains incorrelation as explored further in section 55 below

2688

24 Translation Ranking ModelOur Translation Ranking model (Figure 2) receivesas input a tuple χ = (s h+ hminus r) where h+ de-notes an hypothesis that was ranked higher thananother hypothesis hminus We then pass χ throughour cross-lingual encoder and pooling layer to ob-tain a sentence embedding for each segment in theχ Finally using the embeddings sh+hminus rwe compute the triplet margin loss (Schroff et al2015) in relation to the source and reference

L(χ) = L(sh+hminus) + L(rh+hminus) (2)

where

L(sh+hminus) =

max0 d(sh+) minus d(shminus) + ε(3)

L(rh+hminus) =

max0 d(rh+) minus d(rhminus) + ε(4)

d(uv) denotes the euclidean distance between uand v and ε is a margin Thus during training themodel optimizes the embedding space so the dis-tance between the anchors (s and r) and the ldquoworserdquohypothesis hminus is greater by at least ε than the dis-tance between the anchors and ldquobetterrdquo hypothesish+

During inference the described model receivesa triplet (s h r) with only one hypothesis Thequality score assigned to h is the harmonic meanbetween the distance to the source d(s h) and thedistance to the reference d(r h)

f(s h r) =2times d(r h)times d(s h)

d(r h) + d(s h)(5)

Finally we convert the resulting distance into asimilarity score bounded between 0 and 1 as fol-lows

f(s h r) =1

1 + f(s h r)(6)

3 Corpora

To demonstrate the effectiveness of our describedmodel architectures (section 2) we train three MTevaluation models where each model targets a dif-ferent type of human judgment To train thesemodels we use data from three different corporathe QT21 corpus the DARR from the WMT Met-rics shared task (2017 to 2019) and a proprietaryMQM annotated corpus

31 The QT21 corpusThe QT21 corpus is a publicly available3 datasetcontaining industry generated sentences from eitheran information technology or life sciences domains(Specia et al 2017) This corpus contains a totalof 173K tuples with source sentence respectivehuman-generated reference MT hypothesis (eitherfrom a phrase-based statistical MT or from a neu-ral MT) and post-edited MT (PE) The languagepairs represented in this corpus are English to Ger-man (en-de) Latvian (en-lt) and Czech (en-cs) andGerman to English (de-en)

The HTER score is obtained by computing thetranslation edit rate (TER) (Snover et al 2006) be-tween the MT hypothesis and the corresponding PEFinally after computing the HTER for each MTwe built a training dataset D = si hi ri yiNn=1where si denotes the source text hi denotes the MThypothesis ri the reference translation and yi theHTER score for the hypothesis hi In this mannerwe seek to learn a regression f(s h r) rarr y thatpredicts the human-effort required to correct thehypothesis by looking at the source hypothesisand reference (but not the post-edited hypothesis)

32 The WMT DARR corpusSince 2017 the organizers of the WMT NewsTranslation Shared Task (Barrault et al 2019) havecollected human judgements in the form of ad-equacy DAs (Graham et al 2013 2014 2017)These DAs are then mapped into relative rank-ings (DARR) (Ma et al 2019) The resultingdata for each year (2017-19) form a dataset D =si h+i h

minusi riNn=1 where h+i denotes a ldquobetterrdquo

hypothesis and hminusi denotes a ldquoworserdquo one Herewe seek to learn a function r(s h r) such that thescore assigned to h+i is strictly higher than the scoreassigned to hminusi (r(si h+i ri) gt r(si h

minusi ri))

This data4 contains a total of 24 high and low-resource language pairs such as Chinese to English(zh-en) and English to Gujarati (en-gu)

33 The MQM corpusThe MQM corpus is a proprietary internal databaseof MT-generated translations of customer support

3QT21 data httpslindatmffcuniczrepositoryxmluihandle11372LRT-2390

4The raw data for each year of the WMT Metrics sharedtask is publicly available in the results page (2019 ex-ample httpwwwstatmtorgwmt19resultshtml) Note however that in the README files it is high-lighted that this data is not well documented and the scriptsoccasionally require custom utilities that are not available

2689

chat messages that were annotated according to theguidelines set out in Burchardt and Lommel (2014)This data contains a total of 12K tuples cover-ing 12 language pairs from English to German(en-de) Spanish (en-es) Latin-American Span-ish (en-es-latam) French (en-fr) Italian (en-it)Japanese (en-ja) Dutch (en-nl) Portuguese (en-pt)Brazilian Portuguese (en-pt-br) Russian (en-ru)Swedish (en-sv) and Turkish (en-tr) Note that inthis corpus English is always seen as the source lan-guage but never as the target language Each tupleconsists of a source sentence a human-generatedreference a MT hypothesis and its MQM scorederived from error annotations by one (or more)trained annotators The MQM metric referred tothroughout this paper is an internal metric definedin accordance with the MQM framework (Lommelet al 2014) (MQM) Errors are annotated underan internal typology defined under three main er-ror types lsquoStylersquo lsquoFluencyrsquo and lsquoAccuracyrsquo OurMQM scores range from minusinfin to 100 and are de-fined as

MQM = 100minus IMinor + 5times IMajor + 10times ICrit

Sentence Lengthtimes 100(7)

where IMinor denotes the number of minor errorsIMajor the number of major errors and ICrit the num-ber of critical errors

Our MQM metric takes into account the sever-ity of the errors identified in the MT hypothesisleading to a more fine-grained metric than HTERor DA When used in our experiments these val-ues were divided by 100 and truncated at 0 Asin section 31 we constructed a training datasetD = si hi ri yiNn=1 where si denotes thesource text hi denotes the MT hypothesis ri thereference translation and yi the MQM score forthe hypothesis hi

4 Experiments

We train two versions of the Estimator model de-scribed in section 23 one that regresses on HTER(COMET-HTER) trained with the QT21 corpus andanother that regresses on our proprietary implemen-tation of MQM (COMET-MQM) trained with ourinternal MQM corpus For the Translation Rankingmodel described in section 24 we train with theWMT DARR corpus from 2017 and 2018 (COMET-RANK) In this section we introduce the training

setup for these models and corresponding evalua-tion setup

41 Training SetupThe two versions of the Estimators (COMET-HTERMQM) share the same training setup andhyper-parameters (details are included in the Ap-pendices) For training we load the pretrainedencoder and initialize both the pooling layer andthe feed-forward regressor Whereas the layer-wisescalars α from the pooling layer are initially setto zero the weights from the feed-forward are ini-tialized randomly During training we divide themodel parameters into two groups the encoder pa-rameters that include the encoder model and thescalars from α and the regressor parameters thatinclude the parameters from the top feed-forwardnetwork We apply gradual unfreezing and discrim-inative learning rates (Howard and Ruder 2018)meaning that the encoder model is frozen for oneepoch while the feed-forward is optimized with alearning rate of 3eminus5 After the first epoch theentire model is fine-tuned but the learning rate forthe encoder parameters is set to 1eminus5 in order toavoid catastrophic forgetting

In contrast with the two Estimators for theCOMET-RANK model we fine-tune from the outsetFurthermore since this model does not add anynew parameters on top of XLM-RoBERTa (base)other than the layer scalars α we use one singlelearning rate of 1eminus5 for the entire model

42 Evaluation SetupWe use the test data and setup of the WMT 2019Metrics Shared Task (Ma et al 2019) in order tocompare the COMET models with the top perform-ing submissions of the shared task and other recentstate-of-the-art metrics such as BERTSCORE andBLEURT5 The evaluation method used is the of-ficial Kendallrsquos Tau-like formulation τ from theWMT 2019 Metrics Shared Task (Ma et al 2019)defined as

τ =Concordantminus DiscordantConcordant + Discordant

(8)

where Concordant is the number of times a metricassigns a higher score to the ldquobetterrdquo hypothesish+ and Discordant is the number of times a metricassigns a higher score to the ldquoworserdquo hypothesis

5To ease future research we will also provide within ourframework detailed instructions and scripts to run other met-rics such as CHRF BLEU BERTSCORE and BLEURT

2690

Table 1 Kendallrsquos Tau (τ ) correlations on language pairs with English as source for the WMT19 Metrics DARRcorpus For BERTSCORE we report results with the default encoder model for a complete comparison but alsowith XLM-RoBERTa (base) for fairness with our models The values reported for YiSi-1 are taken directly fromthe shared task paper (Ma et al 2019)

Metric en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zhBLEU 0364 0248 0395 0463 0363 0333 0469 0235CHRF 0444 0321 0518 0548 0510 0438 0548 0241YISI-1 0475 0351 0537 0551 0546 0470 0585 0355BERTSCORE (default) 0500 0363 0527 0568 0540 0464 0585 0356BERTSCORE (xlmr-base) 0503 0369 0553 0584 0536 0514 0599 0317COMET-HTER 0524 0383 0560 0552 0508 0577 0539 0380COMET-MQM 0537 0398 0567 0564 0534 0574 0615 0378COMET-RANK 0603 0427 0664 0611 0693 0665 0580 0449

hminus or the scores assigned to both hypotheses is thesame

As mentioned in the findings of (Ma et al 2019)segment-level correlations of all submitted metricswere frustratingly low Furthermore all submit-ted metrics exhibited a dramatic lack of ability tocorrectly rank strong MT systems To evaluatewhether our new MT evaluation models better ad-dress this issue we followed the described evalu-ation setup used in the analysis presented in (Maet al 2019) where correlation levels are examinedfor portions of the DARR data that include only thetop 10 8 6 and 4 MT systems

5 Results

51 From English into X

Table 1 shows results for all eight language pairswith English as source We contrast our threeCOMET models against baseline metrics such asBLEU and CHRF the 2019 task winning metricYISI-1 as well as the more recent BERTSCOREWe observe that across the board our three modelstrained with the COMET framework outperformoften by significant margins all other metrics OurDARR Ranker model outperforms the two Estima-tors in seven out of eight language pairs Also eventhough the MQM Estimator is trained on only 12Kannotated segments it performs roughly on parwith the HTER Estimator for most language-pairsand outperforms all the other metrics in en-ru

52 From X into English

Table 2 shows results for the seven to-English lan-guage pairs Again we contrast our three COMET

models against baseline metrics such as BLEU andCHRF the 2019 task winning metric YISI-1 as

well as the recently published metrics BERTSCORE

and BLEURT As in Table 1 the DARR model showsstrong correlations with human judgements out-performing the recently proposed English-specificBLEURT metric in five out of seven language pairsAgain the MQM Estimator shows surprising strongresults despite the fact that this model was trainedwith data that did not include English as a targetAlthough the encoder used in our trained models ishighly multilingual we hypothesise that this pow-erful ldquozero-shotrdquo result is due to the inclusion ofthe source in our models

53 Language pairs not involving EnglishAll three of our COMET models were trained ondata involving English (either as a source or as atarget) Nevertheless to demonstrate that our met-rics generalize well we test them on the three WMT2019 language pairs that do not include English ineither source or target As can be seen in Table3 our results are consistent with observations inTables 1 and 2

54 Robustness to High-Quality MTFor analysis we use the DARR corpus from the2019 Shared Task and evaluate on the subset ofthe data from the top performing MT systems foreach language pair We included language pairsfor which we could retrieve data for at least tendifferent MT systems (ie all but kk-en and gu-en)We contrast against the strong recently proposedBERTSCORE and BLEURT with BLEU as a base-line Results are presented in Figure 3 For lan-guage pairs where English is the target our threemodels are either better or competitive with all oth-ers where English is the source we note that ingeneral our metrics exceed the performance of oth-

2691

Table 2 Kendallrsquos Tau (τ ) correlations on language pairs with English as a target for the WMT19 Metrics DARRcorpus As for BERTSCORE for BLEURT we report results for two models the base model which is comparablein size with the encoder we used and the large model that is twice the size

Metric de-en fi-en gu-en kk-en lt-en ru-en zh-enBLEU 0053 0236 0194 0276 0249 0177 0321CHRF 0123 0292 0240 0323 0304 0115 0371YISI-1 0164 0347 0312 0440 0376 0217 0426BERTSCORE (default) 0190 0354 0292 0351 0381 0221 0432BERTSCORE (xlmr-base) 0171 0335 0295 0354 0356 0202 0412BLEURT (base-128) 0171 0372 0302 0383 0387 0218 0417BLEURT (large-512) 0174 0374 0313 0372 0388 0220 0436COMET-HTER 0185 0333 0274 0297 0364 0163 0391COMET-MQM 0207 0343 0282 0339 0368 0187 0422COMET-RANK 0202 0399 0341 0358 0407 0180 0445

Table 3 Kendallrsquos Tau (τ ) correlations on languagepairs not involving English for the WMT19 MetricsDARR corpus

Metric de-cs de-fr fr-deBLEU 0222 0226 0173CHRF 0341 0287 0274YISI-1 0376 0349 0310BERTSCORE (default) 0358 0329 0300BERTSCORE (xlmr-base) 0386 0336 0309COMET-HTER 0358 0397 0315COMET-MQM 0386 0367 0296COMET-RANK 0389 0444 0331

ers Even the MQM Estimator trained with only12K segments is competitive which highlights thepower of our proposed framework

55 The Importance of the Source

To shed some light on the actual value and contri-bution of the source language input in our modelsrsquoability to learn accurate predictions we trained twoversions of our DARR Ranker model one that usesonly the reference and another that uses both refer-ence and source Both models were trained usingthe WMT 2017 corpus that only includes languagepairs from English (en-de en-cs en-fi en-tr) Inother words while English was never observed asa target language during training for both variantsof the model the training of the second variant in-cludes English source embeddings We then testedthese two model variants on the WMT 2018 corpusfor these language pairs and for the reversed di-rections (with the exception of en-cs because cs-endoes not exist for WMT 2018) The results in Table

All 10 8 6 40

01

02

03

Top models from X to English

Ken

dall

Tau

(τ)

COMET-RANK BLEUCOMET-MQM BERTSCORECOMET-HTER BLEURT

All 10 8 6 40

02

04

06

Top models from English to X

Ken

dall

Tau

(τ)

Figure 3 Metrics performance over all and the top (108 6 and 4) MT systems

4 clearly show that for the translation ranking archi-tecture including the source improves the overallcorrelation with human judgments Furthermorethe inclusion of the source exposed the second vari-ant of the model to English embeddings which is

2692

Table 4 Comparison between COMET-RANK (section 24) and a reference-only version thereof on WMT18 dataBoth models were trained with WMT17 which means that the reference-only model is never exposed to Englishduring training

Metric en-cs en-de en-fi en-tr cs-en de-en fi-en tr-enCOMET-RANK (ref only) 0660 0764 0630 0539 0249 0390 0159 0128COMET-RANK 0711 0799 0671 0563 0356 0542 0278 0260∆τ 0051 0035 0041 0024 0107 0155 0119 0132

reflected in a higher ∆τ for the language pairs withEnglish as a target

6 Reproducibility

We will release both the code-base of the COMET

framework and the trained MT evaluation modelsdescribed in this paper to the research communityupon publication along with the detailed scriptsrequired in order to run all reported baselines6 Allthe models reported in this paper were trained on asingle Tesla T4 (16GB) GPU Moreover our frame-work builds on top of PyTorch Lightning (Falcon2019) a lightweight PyTorch wrapper that wascreated for maximal flexibility and reproducibility

7 Related Work

Classic MT evaluation metrics are commonly char-acterized as n-gram matching metrics becauseusing hand-crafted features they estimate MT qual-ity by counting the number and fraction of n-grams that appear simultaneous in a candidatetranslation hypothesis and one or more human-references Metrics such as BLEU (Papineni et al2002) METEOR (Lavie and Denkowski 2009)and CHRF (Popovic 2015) have been widely stud-ied and improved (Koehn et al 2007 Popovic2017 Denkowski and Lavie 2011 Guo and Hu2019) but by design they usually fail to recognizeand capture semantic similarity beyond the lexicallevel

In recent years word embeddings (Mikolovet al 2013 Pennington et al 2014 Peters et al2018 Devlin et al 2019) have emerged as a com-monly used alternative to n-gram matching forcapturing word semantics similarity Embedding-based metrics like METEOR-VECTOR (Servanet al 2016) BLEU2VEC (Tattar and Fishel 2017)YISI-1 (Lo 2019) MOVERSCORE (Zhao et al2019) and BERTSCORE (Zhang et al 2020) createsoft-alignments between reference and hypothesis

6These will be hosted at httpsgithubcomUnbabelCOMET

in an embedding space and then compute a scorethat reflects the semantic similarity between thosesegments However human judgements such asDA and MQM capture much more than just se-mantic similarity resulting in a correlation upper-bound between human judgements and the scoresproduced by such metrics

Learnable metrics (Shimanaka et al 2018Mathur et al 2019 Shimanaka et al 2019) at-tempt to directly optimize the correlation with hu-man judgments and have recently shown promis-ing results BLEURT (Sellam et al 2020) a learn-able metric based on BERT (Devlin et al 2019)claims state-of-the-art performance for the last 3years of the WMT Metrics Shared task BecauseBLEURT builds on top of English-BERT (Devlinet al 2019) it can only be used when English is thetarget language which limits its applicability Alsoto the best of our knowledge all the previouslyproposed learnable metrics have focused on opti-mizing DA which due to a scarcity of annotatorscan prove inherently noisy (Ma et al 2019)

Reference-less MT evaluation also known asQuality Estimation (QE) has historically often re-gressed on HTER for segment-level evaluation (Bo-jar et al 2013 2014 2015 2016 2017a) Morerecently MQM has been used for document-levelevaluation (Specia et al 2018 Fonseca et al2019) By leveraging highly multilingual pre-trained encoders such as multilingual BERT (De-vlin et al 2019) and XLM (Conneau and Lam-ple 2019) QE systems have been showing aus-picious correlations with human judgements (Ke-pler et al 2019a) Concurrently the OpenKiwiframework (Kepler et al 2019b) has made it easierfor researchers to push the field forward and buildstronger QE models

8 Conclusions and Future Work

In this paper we present COMET a novel neu-ral framework for training MT evaluation modelsthat can serve as automatic metrics and easily be

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 3: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2687

Figure 1 Estimator model architecture The sourcehypothesis and reference are independently encoded us-ing a pretrained cross-lingual encoder The resultingword embeddings are then passed through a poolinglayer to create a sentence embedding for each segmentFinally the resulting sentence embeddings are com-bined and concatenated into one single vector that ispassed to a feed-forward regressor The entire model istrained by minimizing the Mean Squared Error (MSE)

Figure 2 Translation Ranking model architectureThis architecture receives 4 segments the source thereference a ldquobetterrdquo hypothesis and a ldquoworserdquo oneThese segments are independently encoded using a pre-trained cross-lingual encoder and a pooling layer ontop Finally using the triplet margin loss (Schroff et al2015) we optimize the resulting embedding space tominimize the distance between the ldquobetterrdquo hypothesisand the ldquoanchorsrdquo (source and reference)

23 Estimator ModelGiven a d-dimensional sentence embedding for thesource the hypothesis and the reference we adoptthe approach proposed in RUSE (Shimanaka et al2018) and extract the following combined features

bull Element-wise source product h s

bull Element-wise reference product h r

bull Absolute element-wise source difference|hminus s|

bull Absolute element-wise reference difference|hminus r|

These combined features are then concatenatedto the reference embedding r and hypothesis em-bedding h into a single vector x = [h rh sh r |h minus s| |h minus r|] that serves as input toa feed-forward regressor The strength of thesefeatures is in highlighting the differences betweenembeddings in the semantic feature space

The model is then trained to minimize the meansquared error between the predicted scores andquality assessments (DA HTER or MQM) Fig-ure 1 illustrates the proposed architecture

Note that we chose not to include the raw sourceembedding (s) in our concatenated input Earlyexperimentation revealed that the value added bythe source embedding as extra input features to ourregressor was negligible at best A variation onour HTER estimator model trained with the vectorx = [h s rh sh r |h minus s| |h minus r|] asinput to the feed-forward only succeed in boost-ing segment-level performance in 8 of the 18 lan-guage pairs outlined in section 5 below and theaverage improvement in Kendallrsquos Tau in those set-tings was +00009 As noted in Zhao et al (2020)while cross-lingual pretrained models are adaptiveto multiple languages the feature space betweenlanguages is poorly aligned On this basis we de-cided in favor of excluding the source embeddingon the intuition that the most important informationcomes from the reference embedding and reduc-ing the feature space would allow the model tofocus more on relevant information This does nothowever negate the general value of the source toour model where we include combination featuressuch as h s and |h minus s| we do note gains incorrelation as explored further in section 55 below

2688

24 Translation Ranking ModelOur Translation Ranking model (Figure 2) receivesas input a tuple χ = (s h+ hminus r) where h+ de-notes an hypothesis that was ranked higher thananother hypothesis hminus We then pass χ throughour cross-lingual encoder and pooling layer to ob-tain a sentence embedding for each segment in theχ Finally using the embeddings sh+hminus rwe compute the triplet margin loss (Schroff et al2015) in relation to the source and reference

L(χ) = L(sh+hminus) + L(rh+hminus) (2)

where

L(sh+hminus) =

max0 d(sh+) minus d(shminus) + ε(3)

L(rh+hminus) =

max0 d(rh+) minus d(rhminus) + ε(4)

d(uv) denotes the euclidean distance between uand v and ε is a margin Thus during training themodel optimizes the embedding space so the dis-tance between the anchors (s and r) and the ldquoworserdquohypothesis hminus is greater by at least ε than the dis-tance between the anchors and ldquobetterrdquo hypothesish+

During inference the described model receivesa triplet (s h r) with only one hypothesis Thequality score assigned to h is the harmonic meanbetween the distance to the source d(s h) and thedistance to the reference d(r h)

f(s h r) =2times d(r h)times d(s h)

d(r h) + d(s h)(5)

Finally we convert the resulting distance into asimilarity score bounded between 0 and 1 as fol-lows

f(s h r) =1

1 + f(s h r)(6)

3 Corpora

To demonstrate the effectiveness of our describedmodel architectures (section 2) we train three MTevaluation models where each model targets a dif-ferent type of human judgment To train thesemodels we use data from three different corporathe QT21 corpus the DARR from the WMT Met-rics shared task (2017 to 2019) and a proprietaryMQM annotated corpus

31 The QT21 corpusThe QT21 corpus is a publicly available3 datasetcontaining industry generated sentences from eitheran information technology or life sciences domains(Specia et al 2017) This corpus contains a totalof 173K tuples with source sentence respectivehuman-generated reference MT hypothesis (eitherfrom a phrase-based statistical MT or from a neu-ral MT) and post-edited MT (PE) The languagepairs represented in this corpus are English to Ger-man (en-de) Latvian (en-lt) and Czech (en-cs) andGerman to English (de-en)

The HTER score is obtained by computing thetranslation edit rate (TER) (Snover et al 2006) be-tween the MT hypothesis and the corresponding PEFinally after computing the HTER for each MTwe built a training dataset D = si hi ri yiNn=1where si denotes the source text hi denotes the MThypothesis ri the reference translation and yi theHTER score for the hypothesis hi In this mannerwe seek to learn a regression f(s h r) rarr y thatpredicts the human-effort required to correct thehypothesis by looking at the source hypothesisand reference (but not the post-edited hypothesis)

32 The WMT DARR corpusSince 2017 the organizers of the WMT NewsTranslation Shared Task (Barrault et al 2019) havecollected human judgements in the form of ad-equacy DAs (Graham et al 2013 2014 2017)These DAs are then mapped into relative rank-ings (DARR) (Ma et al 2019) The resultingdata for each year (2017-19) form a dataset D =si h+i h

minusi riNn=1 where h+i denotes a ldquobetterrdquo

hypothesis and hminusi denotes a ldquoworserdquo one Herewe seek to learn a function r(s h r) such that thescore assigned to h+i is strictly higher than the scoreassigned to hminusi (r(si h+i ri) gt r(si h

minusi ri))

This data4 contains a total of 24 high and low-resource language pairs such as Chinese to English(zh-en) and English to Gujarati (en-gu)

33 The MQM corpusThe MQM corpus is a proprietary internal databaseof MT-generated translations of customer support

3QT21 data httpslindatmffcuniczrepositoryxmluihandle11372LRT-2390

4The raw data for each year of the WMT Metrics sharedtask is publicly available in the results page (2019 ex-ample httpwwwstatmtorgwmt19resultshtml) Note however that in the README files it is high-lighted that this data is not well documented and the scriptsoccasionally require custom utilities that are not available

2689

chat messages that were annotated according to theguidelines set out in Burchardt and Lommel (2014)This data contains a total of 12K tuples cover-ing 12 language pairs from English to German(en-de) Spanish (en-es) Latin-American Span-ish (en-es-latam) French (en-fr) Italian (en-it)Japanese (en-ja) Dutch (en-nl) Portuguese (en-pt)Brazilian Portuguese (en-pt-br) Russian (en-ru)Swedish (en-sv) and Turkish (en-tr) Note that inthis corpus English is always seen as the source lan-guage but never as the target language Each tupleconsists of a source sentence a human-generatedreference a MT hypothesis and its MQM scorederived from error annotations by one (or more)trained annotators The MQM metric referred tothroughout this paper is an internal metric definedin accordance with the MQM framework (Lommelet al 2014) (MQM) Errors are annotated underan internal typology defined under three main er-ror types lsquoStylersquo lsquoFluencyrsquo and lsquoAccuracyrsquo OurMQM scores range from minusinfin to 100 and are de-fined as

MQM = 100minus IMinor + 5times IMajor + 10times ICrit

Sentence Lengthtimes 100(7)

where IMinor denotes the number of minor errorsIMajor the number of major errors and ICrit the num-ber of critical errors

Our MQM metric takes into account the sever-ity of the errors identified in the MT hypothesisleading to a more fine-grained metric than HTERor DA When used in our experiments these val-ues were divided by 100 and truncated at 0 Asin section 31 we constructed a training datasetD = si hi ri yiNn=1 where si denotes thesource text hi denotes the MT hypothesis ri thereference translation and yi the MQM score forthe hypothesis hi

4 Experiments

We train two versions of the Estimator model de-scribed in section 23 one that regresses on HTER(COMET-HTER) trained with the QT21 corpus andanother that regresses on our proprietary implemen-tation of MQM (COMET-MQM) trained with ourinternal MQM corpus For the Translation Rankingmodel described in section 24 we train with theWMT DARR corpus from 2017 and 2018 (COMET-RANK) In this section we introduce the training

setup for these models and corresponding evalua-tion setup

41 Training SetupThe two versions of the Estimators (COMET-HTERMQM) share the same training setup andhyper-parameters (details are included in the Ap-pendices) For training we load the pretrainedencoder and initialize both the pooling layer andthe feed-forward regressor Whereas the layer-wisescalars α from the pooling layer are initially setto zero the weights from the feed-forward are ini-tialized randomly During training we divide themodel parameters into two groups the encoder pa-rameters that include the encoder model and thescalars from α and the regressor parameters thatinclude the parameters from the top feed-forwardnetwork We apply gradual unfreezing and discrim-inative learning rates (Howard and Ruder 2018)meaning that the encoder model is frozen for oneepoch while the feed-forward is optimized with alearning rate of 3eminus5 After the first epoch theentire model is fine-tuned but the learning rate forthe encoder parameters is set to 1eminus5 in order toavoid catastrophic forgetting

In contrast with the two Estimators for theCOMET-RANK model we fine-tune from the outsetFurthermore since this model does not add anynew parameters on top of XLM-RoBERTa (base)other than the layer scalars α we use one singlelearning rate of 1eminus5 for the entire model

42 Evaluation SetupWe use the test data and setup of the WMT 2019Metrics Shared Task (Ma et al 2019) in order tocompare the COMET models with the top perform-ing submissions of the shared task and other recentstate-of-the-art metrics such as BERTSCORE andBLEURT5 The evaluation method used is the of-ficial Kendallrsquos Tau-like formulation τ from theWMT 2019 Metrics Shared Task (Ma et al 2019)defined as

τ =Concordantminus DiscordantConcordant + Discordant

(8)

where Concordant is the number of times a metricassigns a higher score to the ldquobetterrdquo hypothesish+ and Discordant is the number of times a metricassigns a higher score to the ldquoworserdquo hypothesis

5To ease future research we will also provide within ourframework detailed instructions and scripts to run other met-rics such as CHRF BLEU BERTSCORE and BLEURT

2690

Table 1 Kendallrsquos Tau (τ ) correlations on language pairs with English as source for the WMT19 Metrics DARRcorpus For BERTSCORE we report results with the default encoder model for a complete comparison but alsowith XLM-RoBERTa (base) for fairness with our models The values reported for YiSi-1 are taken directly fromthe shared task paper (Ma et al 2019)

Metric en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zhBLEU 0364 0248 0395 0463 0363 0333 0469 0235CHRF 0444 0321 0518 0548 0510 0438 0548 0241YISI-1 0475 0351 0537 0551 0546 0470 0585 0355BERTSCORE (default) 0500 0363 0527 0568 0540 0464 0585 0356BERTSCORE (xlmr-base) 0503 0369 0553 0584 0536 0514 0599 0317COMET-HTER 0524 0383 0560 0552 0508 0577 0539 0380COMET-MQM 0537 0398 0567 0564 0534 0574 0615 0378COMET-RANK 0603 0427 0664 0611 0693 0665 0580 0449

hminus or the scores assigned to both hypotheses is thesame

As mentioned in the findings of (Ma et al 2019)segment-level correlations of all submitted metricswere frustratingly low Furthermore all submit-ted metrics exhibited a dramatic lack of ability tocorrectly rank strong MT systems To evaluatewhether our new MT evaluation models better ad-dress this issue we followed the described evalu-ation setup used in the analysis presented in (Maet al 2019) where correlation levels are examinedfor portions of the DARR data that include only thetop 10 8 6 and 4 MT systems

5 Results

51 From English into X

Table 1 shows results for all eight language pairswith English as source We contrast our threeCOMET models against baseline metrics such asBLEU and CHRF the 2019 task winning metricYISI-1 as well as the more recent BERTSCOREWe observe that across the board our three modelstrained with the COMET framework outperformoften by significant margins all other metrics OurDARR Ranker model outperforms the two Estima-tors in seven out of eight language pairs Also eventhough the MQM Estimator is trained on only 12Kannotated segments it performs roughly on parwith the HTER Estimator for most language-pairsand outperforms all the other metrics in en-ru

52 From X into English

Table 2 shows results for the seven to-English lan-guage pairs Again we contrast our three COMET

models against baseline metrics such as BLEU andCHRF the 2019 task winning metric YISI-1 as

well as the recently published metrics BERTSCORE

and BLEURT As in Table 1 the DARR model showsstrong correlations with human judgements out-performing the recently proposed English-specificBLEURT metric in five out of seven language pairsAgain the MQM Estimator shows surprising strongresults despite the fact that this model was trainedwith data that did not include English as a targetAlthough the encoder used in our trained models ishighly multilingual we hypothesise that this pow-erful ldquozero-shotrdquo result is due to the inclusion ofthe source in our models

53 Language pairs not involving EnglishAll three of our COMET models were trained ondata involving English (either as a source or as atarget) Nevertheless to demonstrate that our met-rics generalize well we test them on the three WMT2019 language pairs that do not include English ineither source or target As can be seen in Table3 our results are consistent with observations inTables 1 and 2

54 Robustness to High-Quality MTFor analysis we use the DARR corpus from the2019 Shared Task and evaluate on the subset ofthe data from the top performing MT systems foreach language pair We included language pairsfor which we could retrieve data for at least tendifferent MT systems (ie all but kk-en and gu-en)We contrast against the strong recently proposedBERTSCORE and BLEURT with BLEU as a base-line Results are presented in Figure 3 For lan-guage pairs where English is the target our threemodels are either better or competitive with all oth-ers where English is the source we note that ingeneral our metrics exceed the performance of oth-

2691

Table 2 Kendallrsquos Tau (τ ) correlations on language pairs with English as a target for the WMT19 Metrics DARRcorpus As for BERTSCORE for BLEURT we report results for two models the base model which is comparablein size with the encoder we used and the large model that is twice the size

Metric de-en fi-en gu-en kk-en lt-en ru-en zh-enBLEU 0053 0236 0194 0276 0249 0177 0321CHRF 0123 0292 0240 0323 0304 0115 0371YISI-1 0164 0347 0312 0440 0376 0217 0426BERTSCORE (default) 0190 0354 0292 0351 0381 0221 0432BERTSCORE (xlmr-base) 0171 0335 0295 0354 0356 0202 0412BLEURT (base-128) 0171 0372 0302 0383 0387 0218 0417BLEURT (large-512) 0174 0374 0313 0372 0388 0220 0436COMET-HTER 0185 0333 0274 0297 0364 0163 0391COMET-MQM 0207 0343 0282 0339 0368 0187 0422COMET-RANK 0202 0399 0341 0358 0407 0180 0445

Table 3 Kendallrsquos Tau (τ ) correlations on languagepairs not involving English for the WMT19 MetricsDARR corpus

Metric de-cs de-fr fr-deBLEU 0222 0226 0173CHRF 0341 0287 0274YISI-1 0376 0349 0310BERTSCORE (default) 0358 0329 0300BERTSCORE (xlmr-base) 0386 0336 0309COMET-HTER 0358 0397 0315COMET-MQM 0386 0367 0296COMET-RANK 0389 0444 0331

ers Even the MQM Estimator trained with only12K segments is competitive which highlights thepower of our proposed framework

55 The Importance of the Source

To shed some light on the actual value and contri-bution of the source language input in our modelsrsquoability to learn accurate predictions we trained twoversions of our DARR Ranker model one that usesonly the reference and another that uses both refer-ence and source Both models were trained usingthe WMT 2017 corpus that only includes languagepairs from English (en-de en-cs en-fi en-tr) Inother words while English was never observed asa target language during training for both variantsof the model the training of the second variant in-cludes English source embeddings We then testedthese two model variants on the WMT 2018 corpusfor these language pairs and for the reversed di-rections (with the exception of en-cs because cs-endoes not exist for WMT 2018) The results in Table

All 10 8 6 40

01

02

03

Top models from X to English

Ken

dall

Tau

(τ)

COMET-RANK BLEUCOMET-MQM BERTSCORECOMET-HTER BLEURT

All 10 8 6 40

02

04

06

Top models from English to X

Ken

dall

Tau

(τ)

Figure 3 Metrics performance over all and the top (108 6 and 4) MT systems

4 clearly show that for the translation ranking archi-tecture including the source improves the overallcorrelation with human judgments Furthermorethe inclusion of the source exposed the second vari-ant of the model to English embeddings which is

2692

Table 4 Comparison between COMET-RANK (section 24) and a reference-only version thereof on WMT18 dataBoth models were trained with WMT17 which means that the reference-only model is never exposed to Englishduring training

Metric en-cs en-de en-fi en-tr cs-en de-en fi-en tr-enCOMET-RANK (ref only) 0660 0764 0630 0539 0249 0390 0159 0128COMET-RANK 0711 0799 0671 0563 0356 0542 0278 0260∆τ 0051 0035 0041 0024 0107 0155 0119 0132

reflected in a higher ∆τ for the language pairs withEnglish as a target

6 Reproducibility

We will release both the code-base of the COMET

framework and the trained MT evaluation modelsdescribed in this paper to the research communityupon publication along with the detailed scriptsrequired in order to run all reported baselines6 Allthe models reported in this paper were trained on asingle Tesla T4 (16GB) GPU Moreover our frame-work builds on top of PyTorch Lightning (Falcon2019) a lightweight PyTorch wrapper that wascreated for maximal flexibility and reproducibility

7 Related Work

Classic MT evaluation metrics are commonly char-acterized as n-gram matching metrics becauseusing hand-crafted features they estimate MT qual-ity by counting the number and fraction of n-grams that appear simultaneous in a candidatetranslation hypothesis and one or more human-references Metrics such as BLEU (Papineni et al2002) METEOR (Lavie and Denkowski 2009)and CHRF (Popovic 2015) have been widely stud-ied and improved (Koehn et al 2007 Popovic2017 Denkowski and Lavie 2011 Guo and Hu2019) but by design they usually fail to recognizeand capture semantic similarity beyond the lexicallevel

In recent years word embeddings (Mikolovet al 2013 Pennington et al 2014 Peters et al2018 Devlin et al 2019) have emerged as a com-monly used alternative to n-gram matching forcapturing word semantics similarity Embedding-based metrics like METEOR-VECTOR (Servanet al 2016) BLEU2VEC (Tattar and Fishel 2017)YISI-1 (Lo 2019) MOVERSCORE (Zhao et al2019) and BERTSCORE (Zhang et al 2020) createsoft-alignments between reference and hypothesis

6These will be hosted at httpsgithubcomUnbabelCOMET

in an embedding space and then compute a scorethat reflects the semantic similarity between thosesegments However human judgements such asDA and MQM capture much more than just se-mantic similarity resulting in a correlation upper-bound between human judgements and the scoresproduced by such metrics

Learnable metrics (Shimanaka et al 2018Mathur et al 2019 Shimanaka et al 2019) at-tempt to directly optimize the correlation with hu-man judgments and have recently shown promis-ing results BLEURT (Sellam et al 2020) a learn-able metric based on BERT (Devlin et al 2019)claims state-of-the-art performance for the last 3years of the WMT Metrics Shared task BecauseBLEURT builds on top of English-BERT (Devlinet al 2019) it can only be used when English is thetarget language which limits its applicability Alsoto the best of our knowledge all the previouslyproposed learnable metrics have focused on opti-mizing DA which due to a scarcity of annotatorscan prove inherently noisy (Ma et al 2019)

Reference-less MT evaluation also known asQuality Estimation (QE) has historically often re-gressed on HTER for segment-level evaluation (Bo-jar et al 2013 2014 2015 2016 2017a) Morerecently MQM has been used for document-levelevaluation (Specia et al 2018 Fonseca et al2019) By leveraging highly multilingual pre-trained encoders such as multilingual BERT (De-vlin et al 2019) and XLM (Conneau and Lam-ple 2019) QE systems have been showing aus-picious correlations with human judgements (Ke-pler et al 2019a) Concurrently the OpenKiwiframework (Kepler et al 2019b) has made it easierfor researchers to push the field forward and buildstronger QE models

8 Conclusions and Future Work

In this paper we present COMET a novel neu-ral framework for training MT evaluation modelsthat can serve as automatic metrics and easily be

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 4: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2688

24 Translation Ranking ModelOur Translation Ranking model (Figure 2) receivesas input a tuple χ = (s h+ hminus r) where h+ de-notes an hypothesis that was ranked higher thananother hypothesis hminus We then pass χ throughour cross-lingual encoder and pooling layer to ob-tain a sentence embedding for each segment in theχ Finally using the embeddings sh+hminus rwe compute the triplet margin loss (Schroff et al2015) in relation to the source and reference

L(χ) = L(sh+hminus) + L(rh+hminus) (2)

where

L(sh+hminus) =

max0 d(sh+) minus d(shminus) + ε(3)

L(rh+hminus) =

max0 d(rh+) minus d(rhminus) + ε(4)

d(uv) denotes the euclidean distance between uand v and ε is a margin Thus during training themodel optimizes the embedding space so the dis-tance between the anchors (s and r) and the ldquoworserdquohypothesis hminus is greater by at least ε than the dis-tance between the anchors and ldquobetterrdquo hypothesish+

During inference the described model receivesa triplet (s h r) with only one hypothesis Thequality score assigned to h is the harmonic meanbetween the distance to the source d(s h) and thedistance to the reference d(r h)

f(s h r) =2times d(r h)times d(s h)

d(r h) + d(s h)(5)

Finally we convert the resulting distance into asimilarity score bounded between 0 and 1 as fol-lows

f(s h r) =1

1 + f(s h r)(6)

3 Corpora

To demonstrate the effectiveness of our describedmodel architectures (section 2) we train three MTevaluation models where each model targets a dif-ferent type of human judgment To train thesemodels we use data from three different corporathe QT21 corpus the DARR from the WMT Met-rics shared task (2017 to 2019) and a proprietaryMQM annotated corpus

31 The QT21 corpusThe QT21 corpus is a publicly available3 datasetcontaining industry generated sentences from eitheran information technology or life sciences domains(Specia et al 2017) This corpus contains a totalof 173K tuples with source sentence respectivehuman-generated reference MT hypothesis (eitherfrom a phrase-based statistical MT or from a neu-ral MT) and post-edited MT (PE) The languagepairs represented in this corpus are English to Ger-man (en-de) Latvian (en-lt) and Czech (en-cs) andGerman to English (de-en)

The HTER score is obtained by computing thetranslation edit rate (TER) (Snover et al 2006) be-tween the MT hypothesis and the corresponding PEFinally after computing the HTER for each MTwe built a training dataset D = si hi ri yiNn=1where si denotes the source text hi denotes the MThypothesis ri the reference translation and yi theHTER score for the hypothesis hi In this mannerwe seek to learn a regression f(s h r) rarr y thatpredicts the human-effort required to correct thehypothesis by looking at the source hypothesisand reference (but not the post-edited hypothesis)

32 The WMT DARR corpusSince 2017 the organizers of the WMT NewsTranslation Shared Task (Barrault et al 2019) havecollected human judgements in the form of ad-equacy DAs (Graham et al 2013 2014 2017)These DAs are then mapped into relative rank-ings (DARR) (Ma et al 2019) The resultingdata for each year (2017-19) form a dataset D =si h+i h

minusi riNn=1 where h+i denotes a ldquobetterrdquo

hypothesis and hminusi denotes a ldquoworserdquo one Herewe seek to learn a function r(s h r) such that thescore assigned to h+i is strictly higher than the scoreassigned to hminusi (r(si h+i ri) gt r(si h

minusi ri))

This data4 contains a total of 24 high and low-resource language pairs such as Chinese to English(zh-en) and English to Gujarati (en-gu)

33 The MQM corpusThe MQM corpus is a proprietary internal databaseof MT-generated translations of customer support

3QT21 data httpslindatmffcuniczrepositoryxmluihandle11372LRT-2390

4The raw data for each year of the WMT Metrics sharedtask is publicly available in the results page (2019 ex-ample httpwwwstatmtorgwmt19resultshtml) Note however that in the README files it is high-lighted that this data is not well documented and the scriptsoccasionally require custom utilities that are not available

2689

chat messages that were annotated according to theguidelines set out in Burchardt and Lommel (2014)This data contains a total of 12K tuples cover-ing 12 language pairs from English to German(en-de) Spanish (en-es) Latin-American Span-ish (en-es-latam) French (en-fr) Italian (en-it)Japanese (en-ja) Dutch (en-nl) Portuguese (en-pt)Brazilian Portuguese (en-pt-br) Russian (en-ru)Swedish (en-sv) and Turkish (en-tr) Note that inthis corpus English is always seen as the source lan-guage but never as the target language Each tupleconsists of a source sentence a human-generatedreference a MT hypothesis and its MQM scorederived from error annotations by one (or more)trained annotators The MQM metric referred tothroughout this paper is an internal metric definedin accordance with the MQM framework (Lommelet al 2014) (MQM) Errors are annotated underan internal typology defined under three main er-ror types lsquoStylersquo lsquoFluencyrsquo and lsquoAccuracyrsquo OurMQM scores range from minusinfin to 100 and are de-fined as

MQM = 100minus IMinor + 5times IMajor + 10times ICrit

Sentence Lengthtimes 100(7)

where IMinor denotes the number of minor errorsIMajor the number of major errors and ICrit the num-ber of critical errors

Our MQM metric takes into account the sever-ity of the errors identified in the MT hypothesisleading to a more fine-grained metric than HTERor DA When used in our experiments these val-ues were divided by 100 and truncated at 0 Asin section 31 we constructed a training datasetD = si hi ri yiNn=1 where si denotes thesource text hi denotes the MT hypothesis ri thereference translation and yi the MQM score forthe hypothesis hi

4 Experiments

We train two versions of the Estimator model de-scribed in section 23 one that regresses on HTER(COMET-HTER) trained with the QT21 corpus andanother that regresses on our proprietary implemen-tation of MQM (COMET-MQM) trained with ourinternal MQM corpus For the Translation Rankingmodel described in section 24 we train with theWMT DARR corpus from 2017 and 2018 (COMET-RANK) In this section we introduce the training

setup for these models and corresponding evalua-tion setup

41 Training SetupThe two versions of the Estimators (COMET-HTERMQM) share the same training setup andhyper-parameters (details are included in the Ap-pendices) For training we load the pretrainedencoder and initialize both the pooling layer andthe feed-forward regressor Whereas the layer-wisescalars α from the pooling layer are initially setto zero the weights from the feed-forward are ini-tialized randomly During training we divide themodel parameters into two groups the encoder pa-rameters that include the encoder model and thescalars from α and the regressor parameters thatinclude the parameters from the top feed-forwardnetwork We apply gradual unfreezing and discrim-inative learning rates (Howard and Ruder 2018)meaning that the encoder model is frozen for oneepoch while the feed-forward is optimized with alearning rate of 3eminus5 After the first epoch theentire model is fine-tuned but the learning rate forthe encoder parameters is set to 1eminus5 in order toavoid catastrophic forgetting

In contrast with the two Estimators for theCOMET-RANK model we fine-tune from the outsetFurthermore since this model does not add anynew parameters on top of XLM-RoBERTa (base)other than the layer scalars α we use one singlelearning rate of 1eminus5 for the entire model

42 Evaluation SetupWe use the test data and setup of the WMT 2019Metrics Shared Task (Ma et al 2019) in order tocompare the COMET models with the top perform-ing submissions of the shared task and other recentstate-of-the-art metrics such as BERTSCORE andBLEURT5 The evaluation method used is the of-ficial Kendallrsquos Tau-like formulation τ from theWMT 2019 Metrics Shared Task (Ma et al 2019)defined as

τ =Concordantminus DiscordantConcordant + Discordant

(8)

where Concordant is the number of times a metricassigns a higher score to the ldquobetterrdquo hypothesish+ and Discordant is the number of times a metricassigns a higher score to the ldquoworserdquo hypothesis

5To ease future research we will also provide within ourframework detailed instructions and scripts to run other met-rics such as CHRF BLEU BERTSCORE and BLEURT

2690

Table 1 Kendallrsquos Tau (τ ) correlations on language pairs with English as source for the WMT19 Metrics DARRcorpus For BERTSCORE we report results with the default encoder model for a complete comparison but alsowith XLM-RoBERTa (base) for fairness with our models The values reported for YiSi-1 are taken directly fromthe shared task paper (Ma et al 2019)

Metric en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zhBLEU 0364 0248 0395 0463 0363 0333 0469 0235CHRF 0444 0321 0518 0548 0510 0438 0548 0241YISI-1 0475 0351 0537 0551 0546 0470 0585 0355BERTSCORE (default) 0500 0363 0527 0568 0540 0464 0585 0356BERTSCORE (xlmr-base) 0503 0369 0553 0584 0536 0514 0599 0317COMET-HTER 0524 0383 0560 0552 0508 0577 0539 0380COMET-MQM 0537 0398 0567 0564 0534 0574 0615 0378COMET-RANK 0603 0427 0664 0611 0693 0665 0580 0449

hminus or the scores assigned to both hypotheses is thesame

As mentioned in the findings of (Ma et al 2019)segment-level correlations of all submitted metricswere frustratingly low Furthermore all submit-ted metrics exhibited a dramatic lack of ability tocorrectly rank strong MT systems To evaluatewhether our new MT evaluation models better ad-dress this issue we followed the described evalu-ation setup used in the analysis presented in (Maet al 2019) where correlation levels are examinedfor portions of the DARR data that include only thetop 10 8 6 and 4 MT systems

5 Results

51 From English into X

Table 1 shows results for all eight language pairswith English as source We contrast our threeCOMET models against baseline metrics such asBLEU and CHRF the 2019 task winning metricYISI-1 as well as the more recent BERTSCOREWe observe that across the board our three modelstrained with the COMET framework outperformoften by significant margins all other metrics OurDARR Ranker model outperforms the two Estima-tors in seven out of eight language pairs Also eventhough the MQM Estimator is trained on only 12Kannotated segments it performs roughly on parwith the HTER Estimator for most language-pairsand outperforms all the other metrics in en-ru

52 From X into English

Table 2 shows results for the seven to-English lan-guage pairs Again we contrast our three COMET

models against baseline metrics such as BLEU andCHRF the 2019 task winning metric YISI-1 as

well as the recently published metrics BERTSCORE

and BLEURT As in Table 1 the DARR model showsstrong correlations with human judgements out-performing the recently proposed English-specificBLEURT metric in five out of seven language pairsAgain the MQM Estimator shows surprising strongresults despite the fact that this model was trainedwith data that did not include English as a targetAlthough the encoder used in our trained models ishighly multilingual we hypothesise that this pow-erful ldquozero-shotrdquo result is due to the inclusion ofthe source in our models

53 Language pairs not involving EnglishAll three of our COMET models were trained ondata involving English (either as a source or as atarget) Nevertheless to demonstrate that our met-rics generalize well we test them on the three WMT2019 language pairs that do not include English ineither source or target As can be seen in Table3 our results are consistent with observations inTables 1 and 2

54 Robustness to High-Quality MTFor analysis we use the DARR corpus from the2019 Shared Task and evaluate on the subset ofthe data from the top performing MT systems foreach language pair We included language pairsfor which we could retrieve data for at least tendifferent MT systems (ie all but kk-en and gu-en)We contrast against the strong recently proposedBERTSCORE and BLEURT with BLEU as a base-line Results are presented in Figure 3 For lan-guage pairs where English is the target our threemodels are either better or competitive with all oth-ers where English is the source we note that ingeneral our metrics exceed the performance of oth-

2691

Table 2 Kendallrsquos Tau (τ ) correlations on language pairs with English as a target for the WMT19 Metrics DARRcorpus As for BERTSCORE for BLEURT we report results for two models the base model which is comparablein size with the encoder we used and the large model that is twice the size

Metric de-en fi-en gu-en kk-en lt-en ru-en zh-enBLEU 0053 0236 0194 0276 0249 0177 0321CHRF 0123 0292 0240 0323 0304 0115 0371YISI-1 0164 0347 0312 0440 0376 0217 0426BERTSCORE (default) 0190 0354 0292 0351 0381 0221 0432BERTSCORE (xlmr-base) 0171 0335 0295 0354 0356 0202 0412BLEURT (base-128) 0171 0372 0302 0383 0387 0218 0417BLEURT (large-512) 0174 0374 0313 0372 0388 0220 0436COMET-HTER 0185 0333 0274 0297 0364 0163 0391COMET-MQM 0207 0343 0282 0339 0368 0187 0422COMET-RANK 0202 0399 0341 0358 0407 0180 0445

Table 3 Kendallrsquos Tau (τ ) correlations on languagepairs not involving English for the WMT19 MetricsDARR corpus

Metric de-cs de-fr fr-deBLEU 0222 0226 0173CHRF 0341 0287 0274YISI-1 0376 0349 0310BERTSCORE (default) 0358 0329 0300BERTSCORE (xlmr-base) 0386 0336 0309COMET-HTER 0358 0397 0315COMET-MQM 0386 0367 0296COMET-RANK 0389 0444 0331

ers Even the MQM Estimator trained with only12K segments is competitive which highlights thepower of our proposed framework

55 The Importance of the Source

To shed some light on the actual value and contri-bution of the source language input in our modelsrsquoability to learn accurate predictions we trained twoversions of our DARR Ranker model one that usesonly the reference and another that uses both refer-ence and source Both models were trained usingthe WMT 2017 corpus that only includes languagepairs from English (en-de en-cs en-fi en-tr) Inother words while English was never observed asa target language during training for both variantsof the model the training of the second variant in-cludes English source embeddings We then testedthese two model variants on the WMT 2018 corpusfor these language pairs and for the reversed di-rections (with the exception of en-cs because cs-endoes not exist for WMT 2018) The results in Table

All 10 8 6 40

01

02

03

Top models from X to English

Ken

dall

Tau

(τ)

COMET-RANK BLEUCOMET-MQM BERTSCORECOMET-HTER BLEURT

All 10 8 6 40

02

04

06

Top models from English to X

Ken

dall

Tau

(τ)

Figure 3 Metrics performance over all and the top (108 6 and 4) MT systems

4 clearly show that for the translation ranking archi-tecture including the source improves the overallcorrelation with human judgments Furthermorethe inclusion of the source exposed the second vari-ant of the model to English embeddings which is

2692

Table 4 Comparison between COMET-RANK (section 24) and a reference-only version thereof on WMT18 dataBoth models were trained with WMT17 which means that the reference-only model is never exposed to Englishduring training

Metric en-cs en-de en-fi en-tr cs-en de-en fi-en tr-enCOMET-RANK (ref only) 0660 0764 0630 0539 0249 0390 0159 0128COMET-RANK 0711 0799 0671 0563 0356 0542 0278 0260∆τ 0051 0035 0041 0024 0107 0155 0119 0132

reflected in a higher ∆τ for the language pairs withEnglish as a target

6 Reproducibility

We will release both the code-base of the COMET

framework and the trained MT evaluation modelsdescribed in this paper to the research communityupon publication along with the detailed scriptsrequired in order to run all reported baselines6 Allthe models reported in this paper were trained on asingle Tesla T4 (16GB) GPU Moreover our frame-work builds on top of PyTorch Lightning (Falcon2019) a lightweight PyTorch wrapper that wascreated for maximal flexibility and reproducibility

7 Related Work

Classic MT evaluation metrics are commonly char-acterized as n-gram matching metrics becauseusing hand-crafted features they estimate MT qual-ity by counting the number and fraction of n-grams that appear simultaneous in a candidatetranslation hypothesis and one or more human-references Metrics such as BLEU (Papineni et al2002) METEOR (Lavie and Denkowski 2009)and CHRF (Popovic 2015) have been widely stud-ied and improved (Koehn et al 2007 Popovic2017 Denkowski and Lavie 2011 Guo and Hu2019) but by design they usually fail to recognizeand capture semantic similarity beyond the lexicallevel

In recent years word embeddings (Mikolovet al 2013 Pennington et al 2014 Peters et al2018 Devlin et al 2019) have emerged as a com-monly used alternative to n-gram matching forcapturing word semantics similarity Embedding-based metrics like METEOR-VECTOR (Servanet al 2016) BLEU2VEC (Tattar and Fishel 2017)YISI-1 (Lo 2019) MOVERSCORE (Zhao et al2019) and BERTSCORE (Zhang et al 2020) createsoft-alignments between reference and hypothesis

6These will be hosted at httpsgithubcomUnbabelCOMET

in an embedding space and then compute a scorethat reflects the semantic similarity between thosesegments However human judgements such asDA and MQM capture much more than just se-mantic similarity resulting in a correlation upper-bound between human judgements and the scoresproduced by such metrics

Learnable metrics (Shimanaka et al 2018Mathur et al 2019 Shimanaka et al 2019) at-tempt to directly optimize the correlation with hu-man judgments and have recently shown promis-ing results BLEURT (Sellam et al 2020) a learn-able metric based on BERT (Devlin et al 2019)claims state-of-the-art performance for the last 3years of the WMT Metrics Shared task BecauseBLEURT builds on top of English-BERT (Devlinet al 2019) it can only be used when English is thetarget language which limits its applicability Alsoto the best of our knowledge all the previouslyproposed learnable metrics have focused on opti-mizing DA which due to a scarcity of annotatorscan prove inherently noisy (Ma et al 2019)

Reference-less MT evaluation also known asQuality Estimation (QE) has historically often re-gressed on HTER for segment-level evaluation (Bo-jar et al 2013 2014 2015 2016 2017a) Morerecently MQM has been used for document-levelevaluation (Specia et al 2018 Fonseca et al2019) By leveraging highly multilingual pre-trained encoders such as multilingual BERT (De-vlin et al 2019) and XLM (Conneau and Lam-ple 2019) QE systems have been showing aus-picious correlations with human judgements (Ke-pler et al 2019a) Concurrently the OpenKiwiframework (Kepler et al 2019b) has made it easierfor researchers to push the field forward and buildstronger QE models

8 Conclusions and Future Work

In this paper we present COMET a novel neu-ral framework for training MT evaluation modelsthat can serve as automatic metrics and easily be

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 5: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2689

chat messages that were annotated according to theguidelines set out in Burchardt and Lommel (2014)This data contains a total of 12K tuples cover-ing 12 language pairs from English to German(en-de) Spanish (en-es) Latin-American Span-ish (en-es-latam) French (en-fr) Italian (en-it)Japanese (en-ja) Dutch (en-nl) Portuguese (en-pt)Brazilian Portuguese (en-pt-br) Russian (en-ru)Swedish (en-sv) and Turkish (en-tr) Note that inthis corpus English is always seen as the source lan-guage but never as the target language Each tupleconsists of a source sentence a human-generatedreference a MT hypothesis and its MQM scorederived from error annotations by one (or more)trained annotators The MQM metric referred tothroughout this paper is an internal metric definedin accordance with the MQM framework (Lommelet al 2014) (MQM) Errors are annotated underan internal typology defined under three main er-ror types lsquoStylersquo lsquoFluencyrsquo and lsquoAccuracyrsquo OurMQM scores range from minusinfin to 100 and are de-fined as

MQM = 100minus IMinor + 5times IMajor + 10times ICrit

Sentence Lengthtimes 100(7)

where IMinor denotes the number of minor errorsIMajor the number of major errors and ICrit the num-ber of critical errors

Our MQM metric takes into account the sever-ity of the errors identified in the MT hypothesisleading to a more fine-grained metric than HTERor DA When used in our experiments these val-ues were divided by 100 and truncated at 0 Asin section 31 we constructed a training datasetD = si hi ri yiNn=1 where si denotes thesource text hi denotes the MT hypothesis ri thereference translation and yi the MQM score forthe hypothesis hi

4 Experiments

We train two versions of the Estimator model de-scribed in section 23 one that regresses on HTER(COMET-HTER) trained with the QT21 corpus andanother that regresses on our proprietary implemen-tation of MQM (COMET-MQM) trained with ourinternal MQM corpus For the Translation Rankingmodel described in section 24 we train with theWMT DARR corpus from 2017 and 2018 (COMET-RANK) In this section we introduce the training

setup for these models and corresponding evalua-tion setup

41 Training SetupThe two versions of the Estimators (COMET-HTERMQM) share the same training setup andhyper-parameters (details are included in the Ap-pendices) For training we load the pretrainedencoder and initialize both the pooling layer andthe feed-forward regressor Whereas the layer-wisescalars α from the pooling layer are initially setto zero the weights from the feed-forward are ini-tialized randomly During training we divide themodel parameters into two groups the encoder pa-rameters that include the encoder model and thescalars from α and the regressor parameters thatinclude the parameters from the top feed-forwardnetwork We apply gradual unfreezing and discrim-inative learning rates (Howard and Ruder 2018)meaning that the encoder model is frozen for oneepoch while the feed-forward is optimized with alearning rate of 3eminus5 After the first epoch theentire model is fine-tuned but the learning rate forthe encoder parameters is set to 1eminus5 in order toavoid catastrophic forgetting

In contrast with the two Estimators for theCOMET-RANK model we fine-tune from the outsetFurthermore since this model does not add anynew parameters on top of XLM-RoBERTa (base)other than the layer scalars α we use one singlelearning rate of 1eminus5 for the entire model

42 Evaluation SetupWe use the test data and setup of the WMT 2019Metrics Shared Task (Ma et al 2019) in order tocompare the COMET models with the top perform-ing submissions of the shared task and other recentstate-of-the-art metrics such as BERTSCORE andBLEURT5 The evaluation method used is the of-ficial Kendallrsquos Tau-like formulation τ from theWMT 2019 Metrics Shared Task (Ma et al 2019)defined as

τ =Concordantminus DiscordantConcordant + Discordant

(8)

where Concordant is the number of times a metricassigns a higher score to the ldquobetterrdquo hypothesish+ and Discordant is the number of times a metricassigns a higher score to the ldquoworserdquo hypothesis

5To ease future research we will also provide within ourframework detailed instructions and scripts to run other met-rics such as CHRF BLEU BERTSCORE and BLEURT

2690

Table 1 Kendallrsquos Tau (τ ) correlations on language pairs with English as source for the WMT19 Metrics DARRcorpus For BERTSCORE we report results with the default encoder model for a complete comparison but alsowith XLM-RoBERTa (base) for fairness with our models The values reported for YiSi-1 are taken directly fromthe shared task paper (Ma et al 2019)

Metric en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zhBLEU 0364 0248 0395 0463 0363 0333 0469 0235CHRF 0444 0321 0518 0548 0510 0438 0548 0241YISI-1 0475 0351 0537 0551 0546 0470 0585 0355BERTSCORE (default) 0500 0363 0527 0568 0540 0464 0585 0356BERTSCORE (xlmr-base) 0503 0369 0553 0584 0536 0514 0599 0317COMET-HTER 0524 0383 0560 0552 0508 0577 0539 0380COMET-MQM 0537 0398 0567 0564 0534 0574 0615 0378COMET-RANK 0603 0427 0664 0611 0693 0665 0580 0449

hminus or the scores assigned to both hypotheses is thesame

As mentioned in the findings of (Ma et al 2019)segment-level correlations of all submitted metricswere frustratingly low Furthermore all submit-ted metrics exhibited a dramatic lack of ability tocorrectly rank strong MT systems To evaluatewhether our new MT evaluation models better ad-dress this issue we followed the described evalu-ation setup used in the analysis presented in (Maet al 2019) where correlation levels are examinedfor portions of the DARR data that include only thetop 10 8 6 and 4 MT systems

5 Results

51 From English into X

Table 1 shows results for all eight language pairswith English as source We contrast our threeCOMET models against baseline metrics such asBLEU and CHRF the 2019 task winning metricYISI-1 as well as the more recent BERTSCOREWe observe that across the board our three modelstrained with the COMET framework outperformoften by significant margins all other metrics OurDARR Ranker model outperforms the two Estima-tors in seven out of eight language pairs Also eventhough the MQM Estimator is trained on only 12Kannotated segments it performs roughly on parwith the HTER Estimator for most language-pairsand outperforms all the other metrics in en-ru

52 From X into English

Table 2 shows results for the seven to-English lan-guage pairs Again we contrast our three COMET

models against baseline metrics such as BLEU andCHRF the 2019 task winning metric YISI-1 as

well as the recently published metrics BERTSCORE

and BLEURT As in Table 1 the DARR model showsstrong correlations with human judgements out-performing the recently proposed English-specificBLEURT metric in five out of seven language pairsAgain the MQM Estimator shows surprising strongresults despite the fact that this model was trainedwith data that did not include English as a targetAlthough the encoder used in our trained models ishighly multilingual we hypothesise that this pow-erful ldquozero-shotrdquo result is due to the inclusion ofthe source in our models

53 Language pairs not involving EnglishAll three of our COMET models were trained ondata involving English (either as a source or as atarget) Nevertheless to demonstrate that our met-rics generalize well we test them on the three WMT2019 language pairs that do not include English ineither source or target As can be seen in Table3 our results are consistent with observations inTables 1 and 2

54 Robustness to High-Quality MTFor analysis we use the DARR corpus from the2019 Shared Task and evaluate on the subset ofthe data from the top performing MT systems foreach language pair We included language pairsfor which we could retrieve data for at least tendifferent MT systems (ie all but kk-en and gu-en)We contrast against the strong recently proposedBERTSCORE and BLEURT with BLEU as a base-line Results are presented in Figure 3 For lan-guage pairs where English is the target our threemodels are either better or competitive with all oth-ers where English is the source we note that ingeneral our metrics exceed the performance of oth-

2691

Table 2 Kendallrsquos Tau (τ ) correlations on language pairs with English as a target for the WMT19 Metrics DARRcorpus As for BERTSCORE for BLEURT we report results for two models the base model which is comparablein size with the encoder we used and the large model that is twice the size

Metric de-en fi-en gu-en kk-en lt-en ru-en zh-enBLEU 0053 0236 0194 0276 0249 0177 0321CHRF 0123 0292 0240 0323 0304 0115 0371YISI-1 0164 0347 0312 0440 0376 0217 0426BERTSCORE (default) 0190 0354 0292 0351 0381 0221 0432BERTSCORE (xlmr-base) 0171 0335 0295 0354 0356 0202 0412BLEURT (base-128) 0171 0372 0302 0383 0387 0218 0417BLEURT (large-512) 0174 0374 0313 0372 0388 0220 0436COMET-HTER 0185 0333 0274 0297 0364 0163 0391COMET-MQM 0207 0343 0282 0339 0368 0187 0422COMET-RANK 0202 0399 0341 0358 0407 0180 0445

Table 3 Kendallrsquos Tau (τ ) correlations on languagepairs not involving English for the WMT19 MetricsDARR corpus

Metric de-cs de-fr fr-deBLEU 0222 0226 0173CHRF 0341 0287 0274YISI-1 0376 0349 0310BERTSCORE (default) 0358 0329 0300BERTSCORE (xlmr-base) 0386 0336 0309COMET-HTER 0358 0397 0315COMET-MQM 0386 0367 0296COMET-RANK 0389 0444 0331

ers Even the MQM Estimator trained with only12K segments is competitive which highlights thepower of our proposed framework

55 The Importance of the Source

To shed some light on the actual value and contri-bution of the source language input in our modelsrsquoability to learn accurate predictions we trained twoversions of our DARR Ranker model one that usesonly the reference and another that uses both refer-ence and source Both models were trained usingthe WMT 2017 corpus that only includes languagepairs from English (en-de en-cs en-fi en-tr) Inother words while English was never observed asa target language during training for both variantsof the model the training of the second variant in-cludes English source embeddings We then testedthese two model variants on the WMT 2018 corpusfor these language pairs and for the reversed di-rections (with the exception of en-cs because cs-endoes not exist for WMT 2018) The results in Table

All 10 8 6 40

01

02

03

Top models from X to English

Ken

dall

Tau

(τ)

COMET-RANK BLEUCOMET-MQM BERTSCORECOMET-HTER BLEURT

All 10 8 6 40

02

04

06

Top models from English to X

Ken

dall

Tau

(τ)

Figure 3 Metrics performance over all and the top (108 6 and 4) MT systems

4 clearly show that for the translation ranking archi-tecture including the source improves the overallcorrelation with human judgments Furthermorethe inclusion of the source exposed the second vari-ant of the model to English embeddings which is

2692

Table 4 Comparison between COMET-RANK (section 24) and a reference-only version thereof on WMT18 dataBoth models were trained with WMT17 which means that the reference-only model is never exposed to Englishduring training

Metric en-cs en-de en-fi en-tr cs-en de-en fi-en tr-enCOMET-RANK (ref only) 0660 0764 0630 0539 0249 0390 0159 0128COMET-RANK 0711 0799 0671 0563 0356 0542 0278 0260∆τ 0051 0035 0041 0024 0107 0155 0119 0132

reflected in a higher ∆τ for the language pairs withEnglish as a target

6 Reproducibility

We will release both the code-base of the COMET

framework and the trained MT evaluation modelsdescribed in this paper to the research communityupon publication along with the detailed scriptsrequired in order to run all reported baselines6 Allthe models reported in this paper were trained on asingle Tesla T4 (16GB) GPU Moreover our frame-work builds on top of PyTorch Lightning (Falcon2019) a lightweight PyTorch wrapper that wascreated for maximal flexibility and reproducibility

7 Related Work

Classic MT evaluation metrics are commonly char-acterized as n-gram matching metrics becauseusing hand-crafted features they estimate MT qual-ity by counting the number and fraction of n-grams that appear simultaneous in a candidatetranslation hypothesis and one or more human-references Metrics such as BLEU (Papineni et al2002) METEOR (Lavie and Denkowski 2009)and CHRF (Popovic 2015) have been widely stud-ied and improved (Koehn et al 2007 Popovic2017 Denkowski and Lavie 2011 Guo and Hu2019) but by design they usually fail to recognizeand capture semantic similarity beyond the lexicallevel

In recent years word embeddings (Mikolovet al 2013 Pennington et al 2014 Peters et al2018 Devlin et al 2019) have emerged as a com-monly used alternative to n-gram matching forcapturing word semantics similarity Embedding-based metrics like METEOR-VECTOR (Servanet al 2016) BLEU2VEC (Tattar and Fishel 2017)YISI-1 (Lo 2019) MOVERSCORE (Zhao et al2019) and BERTSCORE (Zhang et al 2020) createsoft-alignments between reference and hypothesis

6These will be hosted at httpsgithubcomUnbabelCOMET

in an embedding space and then compute a scorethat reflects the semantic similarity between thosesegments However human judgements such asDA and MQM capture much more than just se-mantic similarity resulting in a correlation upper-bound between human judgements and the scoresproduced by such metrics

Learnable metrics (Shimanaka et al 2018Mathur et al 2019 Shimanaka et al 2019) at-tempt to directly optimize the correlation with hu-man judgments and have recently shown promis-ing results BLEURT (Sellam et al 2020) a learn-able metric based on BERT (Devlin et al 2019)claims state-of-the-art performance for the last 3years of the WMT Metrics Shared task BecauseBLEURT builds on top of English-BERT (Devlinet al 2019) it can only be used when English is thetarget language which limits its applicability Alsoto the best of our knowledge all the previouslyproposed learnable metrics have focused on opti-mizing DA which due to a scarcity of annotatorscan prove inherently noisy (Ma et al 2019)

Reference-less MT evaluation also known asQuality Estimation (QE) has historically often re-gressed on HTER for segment-level evaluation (Bo-jar et al 2013 2014 2015 2016 2017a) Morerecently MQM has been used for document-levelevaluation (Specia et al 2018 Fonseca et al2019) By leveraging highly multilingual pre-trained encoders such as multilingual BERT (De-vlin et al 2019) and XLM (Conneau and Lam-ple 2019) QE systems have been showing aus-picious correlations with human judgements (Ke-pler et al 2019a) Concurrently the OpenKiwiframework (Kepler et al 2019b) has made it easierfor researchers to push the field forward and buildstronger QE models

8 Conclusions and Future Work

In this paper we present COMET a novel neu-ral framework for training MT evaluation modelsthat can serve as automatic metrics and easily be

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 6: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2690

Table 1 Kendallrsquos Tau (τ ) correlations on language pairs with English as source for the WMT19 Metrics DARRcorpus For BERTSCORE we report results with the default encoder model for a complete comparison but alsowith XLM-RoBERTa (base) for fairness with our models The values reported for YiSi-1 are taken directly fromthe shared task paper (Ma et al 2019)

Metric en-cs en-de en-fi en-gu en-kk en-lt en-ru en-zhBLEU 0364 0248 0395 0463 0363 0333 0469 0235CHRF 0444 0321 0518 0548 0510 0438 0548 0241YISI-1 0475 0351 0537 0551 0546 0470 0585 0355BERTSCORE (default) 0500 0363 0527 0568 0540 0464 0585 0356BERTSCORE (xlmr-base) 0503 0369 0553 0584 0536 0514 0599 0317COMET-HTER 0524 0383 0560 0552 0508 0577 0539 0380COMET-MQM 0537 0398 0567 0564 0534 0574 0615 0378COMET-RANK 0603 0427 0664 0611 0693 0665 0580 0449

hminus or the scores assigned to both hypotheses is thesame

As mentioned in the findings of (Ma et al 2019)segment-level correlations of all submitted metricswere frustratingly low Furthermore all submit-ted metrics exhibited a dramatic lack of ability tocorrectly rank strong MT systems To evaluatewhether our new MT evaluation models better ad-dress this issue we followed the described evalu-ation setup used in the analysis presented in (Maet al 2019) where correlation levels are examinedfor portions of the DARR data that include only thetop 10 8 6 and 4 MT systems

5 Results

51 From English into X

Table 1 shows results for all eight language pairswith English as source We contrast our threeCOMET models against baseline metrics such asBLEU and CHRF the 2019 task winning metricYISI-1 as well as the more recent BERTSCOREWe observe that across the board our three modelstrained with the COMET framework outperformoften by significant margins all other metrics OurDARR Ranker model outperforms the two Estima-tors in seven out of eight language pairs Also eventhough the MQM Estimator is trained on only 12Kannotated segments it performs roughly on parwith the HTER Estimator for most language-pairsand outperforms all the other metrics in en-ru

52 From X into English

Table 2 shows results for the seven to-English lan-guage pairs Again we contrast our three COMET

models against baseline metrics such as BLEU andCHRF the 2019 task winning metric YISI-1 as

well as the recently published metrics BERTSCORE

and BLEURT As in Table 1 the DARR model showsstrong correlations with human judgements out-performing the recently proposed English-specificBLEURT metric in five out of seven language pairsAgain the MQM Estimator shows surprising strongresults despite the fact that this model was trainedwith data that did not include English as a targetAlthough the encoder used in our trained models ishighly multilingual we hypothesise that this pow-erful ldquozero-shotrdquo result is due to the inclusion ofthe source in our models

53 Language pairs not involving EnglishAll three of our COMET models were trained ondata involving English (either as a source or as atarget) Nevertheless to demonstrate that our met-rics generalize well we test them on the three WMT2019 language pairs that do not include English ineither source or target As can be seen in Table3 our results are consistent with observations inTables 1 and 2

54 Robustness to High-Quality MTFor analysis we use the DARR corpus from the2019 Shared Task and evaluate on the subset ofthe data from the top performing MT systems foreach language pair We included language pairsfor which we could retrieve data for at least tendifferent MT systems (ie all but kk-en and gu-en)We contrast against the strong recently proposedBERTSCORE and BLEURT with BLEU as a base-line Results are presented in Figure 3 For lan-guage pairs where English is the target our threemodels are either better or competitive with all oth-ers where English is the source we note that ingeneral our metrics exceed the performance of oth-

2691

Table 2 Kendallrsquos Tau (τ ) correlations on language pairs with English as a target for the WMT19 Metrics DARRcorpus As for BERTSCORE for BLEURT we report results for two models the base model which is comparablein size with the encoder we used and the large model that is twice the size

Metric de-en fi-en gu-en kk-en lt-en ru-en zh-enBLEU 0053 0236 0194 0276 0249 0177 0321CHRF 0123 0292 0240 0323 0304 0115 0371YISI-1 0164 0347 0312 0440 0376 0217 0426BERTSCORE (default) 0190 0354 0292 0351 0381 0221 0432BERTSCORE (xlmr-base) 0171 0335 0295 0354 0356 0202 0412BLEURT (base-128) 0171 0372 0302 0383 0387 0218 0417BLEURT (large-512) 0174 0374 0313 0372 0388 0220 0436COMET-HTER 0185 0333 0274 0297 0364 0163 0391COMET-MQM 0207 0343 0282 0339 0368 0187 0422COMET-RANK 0202 0399 0341 0358 0407 0180 0445

Table 3 Kendallrsquos Tau (τ ) correlations on languagepairs not involving English for the WMT19 MetricsDARR corpus

Metric de-cs de-fr fr-deBLEU 0222 0226 0173CHRF 0341 0287 0274YISI-1 0376 0349 0310BERTSCORE (default) 0358 0329 0300BERTSCORE (xlmr-base) 0386 0336 0309COMET-HTER 0358 0397 0315COMET-MQM 0386 0367 0296COMET-RANK 0389 0444 0331

ers Even the MQM Estimator trained with only12K segments is competitive which highlights thepower of our proposed framework

55 The Importance of the Source

To shed some light on the actual value and contri-bution of the source language input in our modelsrsquoability to learn accurate predictions we trained twoversions of our DARR Ranker model one that usesonly the reference and another that uses both refer-ence and source Both models were trained usingthe WMT 2017 corpus that only includes languagepairs from English (en-de en-cs en-fi en-tr) Inother words while English was never observed asa target language during training for both variantsof the model the training of the second variant in-cludes English source embeddings We then testedthese two model variants on the WMT 2018 corpusfor these language pairs and for the reversed di-rections (with the exception of en-cs because cs-endoes not exist for WMT 2018) The results in Table

All 10 8 6 40

01

02

03

Top models from X to English

Ken

dall

Tau

(τ)

COMET-RANK BLEUCOMET-MQM BERTSCORECOMET-HTER BLEURT

All 10 8 6 40

02

04

06

Top models from English to X

Ken

dall

Tau

(τ)

Figure 3 Metrics performance over all and the top (108 6 and 4) MT systems

4 clearly show that for the translation ranking archi-tecture including the source improves the overallcorrelation with human judgments Furthermorethe inclusion of the source exposed the second vari-ant of the model to English embeddings which is

2692

Table 4 Comparison between COMET-RANK (section 24) and a reference-only version thereof on WMT18 dataBoth models were trained with WMT17 which means that the reference-only model is never exposed to Englishduring training

Metric en-cs en-de en-fi en-tr cs-en de-en fi-en tr-enCOMET-RANK (ref only) 0660 0764 0630 0539 0249 0390 0159 0128COMET-RANK 0711 0799 0671 0563 0356 0542 0278 0260∆τ 0051 0035 0041 0024 0107 0155 0119 0132

reflected in a higher ∆τ for the language pairs withEnglish as a target

6 Reproducibility

We will release both the code-base of the COMET

framework and the trained MT evaluation modelsdescribed in this paper to the research communityupon publication along with the detailed scriptsrequired in order to run all reported baselines6 Allthe models reported in this paper were trained on asingle Tesla T4 (16GB) GPU Moreover our frame-work builds on top of PyTorch Lightning (Falcon2019) a lightweight PyTorch wrapper that wascreated for maximal flexibility and reproducibility

7 Related Work

Classic MT evaluation metrics are commonly char-acterized as n-gram matching metrics becauseusing hand-crafted features they estimate MT qual-ity by counting the number and fraction of n-grams that appear simultaneous in a candidatetranslation hypothesis and one or more human-references Metrics such as BLEU (Papineni et al2002) METEOR (Lavie and Denkowski 2009)and CHRF (Popovic 2015) have been widely stud-ied and improved (Koehn et al 2007 Popovic2017 Denkowski and Lavie 2011 Guo and Hu2019) but by design they usually fail to recognizeand capture semantic similarity beyond the lexicallevel

In recent years word embeddings (Mikolovet al 2013 Pennington et al 2014 Peters et al2018 Devlin et al 2019) have emerged as a com-monly used alternative to n-gram matching forcapturing word semantics similarity Embedding-based metrics like METEOR-VECTOR (Servanet al 2016) BLEU2VEC (Tattar and Fishel 2017)YISI-1 (Lo 2019) MOVERSCORE (Zhao et al2019) and BERTSCORE (Zhang et al 2020) createsoft-alignments between reference and hypothesis

6These will be hosted at httpsgithubcomUnbabelCOMET

in an embedding space and then compute a scorethat reflects the semantic similarity between thosesegments However human judgements such asDA and MQM capture much more than just se-mantic similarity resulting in a correlation upper-bound between human judgements and the scoresproduced by such metrics

Learnable metrics (Shimanaka et al 2018Mathur et al 2019 Shimanaka et al 2019) at-tempt to directly optimize the correlation with hu-man judgments and have recently shown promis-ing results BLEURT (Sellam et al 2020) a learn-able metric based on BERT (Devlin et al 2019)claims state-of-the-art performance for the last 3years of the WMT Metrics Shared task BecauseBLEURT builds on top of English-BERT (Devlinet al 2019) it can only be used when English is thetarget language which limits its applicability Alsoto the best of our knowledge all the previouslyproposed learnable metrics have focused on opti-mizing DA which due to a scarcity of annotatorscan prove inherently noisy (Ma et al 2019)

Reference-less MT evaluation also known asQuality Estimation (QE) has historically often re-gressed on HTER for segment-level evaluation (Bo-jar et al 2013 2014 2015 2016 2017a) Morerecently MQM has been used for document-levelevaluation (Specia et al 2018 Fonseca et al2019) By leveraging highly multilingual pre-trained encoders such as multilingual BERT (De-vlin et al 2019) and XLM (Conneau and Lam-ple 2019) QE systems have been showing aus-picious correlations with human judgements (Ke-pler et al 2019a) Concurrently the OpenKiwiframework (Kepler et al 2019b) has made it easierfor researchers to push the field forward and buildstronger QE models

8 Conclusions and Future Work

In this paper we present COMET a novel neu-ral framework for training MT evaluation modelsthat can serve as automatic metrics and easily be

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 7: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2691

Table 2 Kendallrsquos Tau (τ ) correlations on language pairs with English as a target for the WMT19 Metrics DARRcorpus As for BERTSCORE for BLEURT we report results for two models the base model which is comparablein size with the encoder we used and the large model that is twice the size

Metric de-en fi-en gu-en kk-en lt-en ru-en zh-enBLEU 0053 0236 0194 0276 0249 0177 0321CHRF 0123 0292 0240 0323 0304 0115 0371YISI-1 0164 0347 0312 0440 0376 0217 0426BERTSCORE (default) 0190 0354 0292 0351 0381 0221 0432BERTSCORE (xlmr-base) 0171 0335 0295 0354 0356 0202 0412BLEURT (base-128) 0171 0372 0302 0383 0387 0218 0417BLEURT (large-512) 0174 0374 0313 0372 0388 0220 0436COMET-HTER 0185 0333 0274 0297 0364 0163 0391COMET-MQM 0207 0343 0282 0339 0368 0187 0422COMET-RANK 0202 0399 0341 0358 0407 0180 0445

Table 3 Kendallrsquos Tau (τ ) correlations on languagepairs not involving English for the WMT19 MetricsDARR corpus

Metric de-cs de-fr fr-deBLEU 0222 0226 0173CHRF 0341 0287 0274YISI-1 0376 0349 0310BERTSCORE (default) 0358 0329 0300BERTSCORE (xlmr-base) 0386 0336 0309COMET-HTER 0358 0397 0315COMET-MQM 0386 0367 0296COMET-RANK 0389 0444 0331

ers Even the MQM Estimator trained with only12K segments is competitive which highlights thepower of our proposed framework

55 The Importance of the Source

To shed some light on the actual value and contri-bution of the source language input in our modelsrsquoability to learn accurate predictions we trained twoversions of our DARR Ranker model one that usesonly the reference and another that uses both refer-ence and source Both models were trained usingthe WMT 2017 corpus that only includes languagepairs from English (en-de en-cs en-fi en-tr) Inother words while English was never observed asa target language during training for both variantsof the model the training of the second variant in-cludes English source embeddings We then testedthese two model variants on the WMT 2018 corpusfor these language pairs and for the reversed di-rections (with the exception of en-cs because cs-endoes not exist for WMT 2018) The results in Table

All 10 8 6 40

01

02

03

Top models from X to English

Ken

dall

Tau

(τ)

COMET-RANK BLEUCOMET-MQM BERTSCORECOMET-HTER BLEURT

All 10 8 6 40

02

04

06

Top models from English to X

Ken

dall

Tau

(τ)

Figure 3 Metrics performance over all and the top (108 6 and 4) MT systems

4 clearly show that for the translation ranking archi-tecture including the source improves the overallcorrelation with human judgments Furthermorethe inclusion of the source exposed the second vari-ant of the model to English embeddings which is

2692

Table 4 Comparison between COMET-RANK (section 24) and a reference-only version thereof on WMT18 dataBoth models were trained with WMT17 which means that the reference-only model is never exposed to Englishduring training

Metric en-cs en-de en-fi en-tr cs-en de-en fi-en tr-enCOMET-RANK (ref only) 0660 0764 0630 0539 0249 0390 0159 0128COMET-RANK 0711 0799 0671 0563 0356 0542 0278 0260∆τ 0051 0035 0041 0024 0107 0155 0119 0132

reflected in a higher ∆τ for the language pairs withEnglish as a target

6 Reproducibility

We will release both the code-base of the COMET

framework and the trained MT evaluation modelsdescribed in this paper to the research communityupon publication along with the detailed scriptsrequired in order to run all reported baselines6 Allthe models reported in this paper were trained on asingle Tesla T4 (16GB) GPU Moreover our frame-work builds on top of PyTorch Lightning (Falcon2019) a lightweight PyTorch wrapper that wascreated for maximal flexibility and reproducibility

7 Related Work

Classic MT evaluation metrics are commonly char-acterized as n-gram matching metrics becauseusing hand-crafted features they estimate MT qual-ity by counting the number and fraction of n-grams that appear simultaneous in a candidatetranslation hypothesis and one or more human-references Metrics such as BLEU (Papineni et al2002) METEOR (Lavie and Denkowski 2009)and CHRF (Popovic 2015) have been widely stud-ied and improved (Koehn et al 2007 Popovic2017 Denkowski and Lavie 2011 Guo and Hu2019) but by design they usually fail to recognizeand capture semantic similarity beyond the lexicallevel

In recent years word embeddings (Mikolovet al 2013 Pennington et al 2014 Peters et al2018 Devlin et al 2019) have emerged as a com-monly used alternative to n-gram matching forcapturing word semantics similarity Embedding-based metrics like METEOR-VECTOR (Servanet al 2016) BLEU2VEC (Tattar and Fishel 2017)YISI-1 (Lo 2019) MOVERSCORE (Zhao et al2019) and BERTSCORE (Zhang et al 2020) createsoft-alignments between reference and hypothesis

6These will be hosted at httpsgithubcomUnbabelCOMET

in an embedding space and then compute a scorethat reflects the semantic similarity between thosesegments However human judgements such asDA and MQM capture much more than just se-mantic similarity resulting in a correlation upper-bound between human judgements and the scoresproduced by such metrics

Learnable metrics (Shimanaka et al 2018Mathur et al 2019 Shimanaka et al 2019) at-tempt to directly optimize the correlation with hu-man judgments and have recently shown promis-ing results BLEURT (Sellam et al 2020) a learn-able metric based on BERT (Devlin et al 2019)claims state-of-the-art performance for the last 3years of the WMT Metrics Shared task BecauseBLEURT builds on top of English-BERT (Devlinet al 2019) it can only be used when English is thetarget language which limits its applicability Alsoto the best of our knowledge all the previouslyproposed learnable metrics have focused on opti-mizing DA which due to a scarcity of annotatorscan prove inherently noisy (Ma et al 2019)

Reference-less MT evaluation also known asQuality Estimation (QE) has historically often re-gressed on HTER for segment-level evaluation (Bo-jar et al 2013 2014 2015 2016 2017a) Morerecently MQM has been used for document-levelevaluation (Specia et al 2018 Fonseca et al2019) By leveraging highly multilingual pre-trained encoders such as multilingual BERT (De-vlin et al 2019) and XLM (Conneau and Lam-ple 2019) QE systems have been showing aus-picious correlations with human judgements (Ke-pler et al 2019a) Concurrently the OpenKiwiframework (Kepler et al 2019b) has made it easierfor researchers to push the field forward and buildstronger QE models

8 Conclusions and Future Work

In this paper we present COMET a novel neu-ral framework for training MT evaluation modelsthat can serve as automatic metrics and easily be

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 8: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2692

Table 4 Comparison between COMET-RANK (section 24) and a reference-only version thereof on WMT18 dataBoth models were trained with WMT17 which means that the reference-only model is never exposed to Englishduring training

Metric en-cs en-de en-fi en-tr cs-en de-en fi-en tr-enCOMET-RANK (ref only) 0660 0764 0630 0539 0249 0390 0159 0128COMET-RANK 0711 0799 0671 0563 0356 0542 0278 0260∆τ 0051 0035 0041 0024 0107 0155 0119 0132

reflected in a higher ∆τ for the language pairs withEnglish as a target

6 Reproducibility

We will release both the code-base of the COMET

framework and the trained MT evaluation modelsdescribed in this paper to the research communityupon publication along with the detailed scriptsrequired in order to run all reported baselines6 Allthe models reported in this paper were trained on asingle Tesla T4 (16GB) GPU Moreover our frame-work builds on top of PyTorch Lightning (Falcon2019) a lightweight PyTorch wrapper that wascreated for maximal flexibility and reproducibility

7 Related Work

Classic MT evaluation metrics are commonly char-acterized as n-gram matching metrics becauseusing hand-crafted features they estimate MT qual-ity by counting the number and fraction of n-grams that appear simultaneous in a candidatetranslation hypothesis and one or more human-references Metrics such as BLEU (Papineni et al2002) METEOR (Lavie and Denkowski 2009)and CHRF (Popovic 2015) have been widely stud-ied and improved (Koehn et al 2007 Popovic2017 Denkowski and Lavie 2011 Guo and Hu2019) but by design they usually fail to recognizeand capture semantic similarity beyond the lexicallevel

In recent years word embeddings (Mikolovet al 2013 Pennington et al 2014 Peters et al2018 Devlin et al 2019) have emerged as a com-monly used alternative to n-gram matching forcapturing word semantics similarity Embedding-based metrics like METEOR-VECTOR (Servanet al 2016) BLEU2VEC (Tattar and Fishel 2017)YISI-1 (Lo 2019) MOVERSCORE (Zhao et al2019) and BERTSCORE (Zhang et al 2020) createsoft-alignments between reference and hypothesis

6These will be hosted at httpsgithubcomUnbabelCOMET

in an embedding space and then compute a scorethat reflects the semantic similarity between thosesegments However human judgements such asDA and MQM capture much more than just se-mantic similarity resulting in a correlation upper-bound between human judgements and the scoresproduced by such metrics

Learnable metrics (Shimanaka et al 2018Mathur et al 2019 Shimanaka et al 2019) at-tempt to directly optimize the correlation with hu-man judgments and have recently shown promis-ing results BLEURT (Sellam et al 2020) a learn-able metric based on BERT (Devlin et al 2019)claims state-of-the-art performance for the last 3years of the WMT Metrics Shared task BecauseBLEURT builds on top of English-BERT (Devlinet al 2019) it can only be used when English is thetarget language which limits its applicability Alsoto the best of our knowledge all the previouslyproposed learnable metrics have focused on opti-mizing DA which due to a scarcity of annotatorscan prove inherently noisy (Ma et al 2019)

Reference-less MT evaluation also known asQuality Estimation (QE) has historically often re-gressed on HTER for segment-level evaluation (Bo-jar et al 2013 2014 2015 2016 2017a) Morerecently MQM has been used for document-levelevaluation (Specia et al 2018 Fonseca et al2019) By leveraging highly multilingual pre-trained encoders such as multilingual BERT (De-vlin et al 2019) and XLM (Conneau and Lam-ple 2019) QE systems have been showing aus-picious correlations with human judgements (Ke-pler et al 2019a) Concurrently the OpenKiwiframework (Kepler et al 2019b) has made it easierfor researchers to push the field forward and buildstronger QE models

8 Conclusions and Future Work

In this paper we present COMET a novel neu-ral framework for training MT evaluation modelsthat can serve as automatic metrics and easily be

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 9: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2693

adapted and optimized to different types of humanjudgements of MT quality

To showcase the effectiveness of our frameworkwe sought to address the challenges reported in the2019 WMT Metrics Shared Task (Ma et al 2019)We trained three distinct models which achieve newstate-of-the-art results for segment-level correlationwith human judgments and show promising abilityto better differentiate high-performing systems

One of the challenges of leveraging the power ofpretrained models is the burdensome weight of pa-rameters and inference time A primary avenue forfuture work on COMET will look at the impact ofmore compact solutions such as DistilBERT (Sanhet al 2019)

Additionally whilst we outline the potential im-portance of the source text above we note that ourCOMET-RANK model weighs source and referencedifferently during inference but equally in its train-ing loss function Future work will investigate theoptimality of this formulation and further examinethe interdependence of the different inputs

Acknowledgments

We are grateful to Andre Martins Austin MatthewsFabio Kepler Daan Van Stigt Miguel Vera andthe reviewers for their valuable feedback and dis-cussions This work was supported in part by theP2020 Program through projects MAIA and Unba-bel4EU supervised by ANI under contract num-bers 045909 and 042671 respectively

References

Mikel Artetxe and Holger Schwenk 2019 Mas-sively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond Transac-tions of the Association for Computational Linguis-tics 7597ndash610

Loıc Barrault Ondrej Bojar Marta R Costa-jussaChristian Federmann Mark Fishel Yvette Gra-ham Barry Haddow Matthias Huck Philipp KoehnShervin Malmasi Christof Monz Mathias MullerSantanu Pal Matt Post and Marcos Zampieri 2019Findings of the 2019 conference on machine transla-tion (WMT19) In Proceedings of the Fourth Con-ference on Machine Translation (Volume 2 SharedTask Papers Day 1) pages 1ndash61 Florence Italy As-sociation for Computational Linguistics

Ondrej Bojar Christian Buck Chris Callison-BurchChristian Federmann Barry Haddow PhilippKoehn Christof Monz Matt Post Radu Soricut and

Lucia Specia 2013 Findings of the 2013 Work-shop on Statistical Machine Translation In Proceed-ings of the Eighth Workshop on Statistical MachineTranslation pages 1ndash44 Sofia Bulgaria Associa-tion for Computational Linguistics

Ondrej Bojar Christian Buck Christian FedermannBarry Haddow Philipp Koehn Johannes LevelingChristof Monz Pavel Pecina Matt Post HerveSaint-Amand Radu Soricut Lucia Specia and AlesTamchyna 2014 Findings of the 2014 workshop onstatistical machine translation In Proceedings of theNinth Workshop on Statistical Machine Translationpages 12ndash58 Baltimore Maryland USA Associa-tion for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Shujian HuangMatthias Huck Philipp Koehn Qun Liu Varvara Lo-gacheva Christof Monz Matteo Negri Matt PostRaphael Rubino Lucia Specia and Marco Turchi2017a Findings of the 2017 conference on machinetranslation (WMT17) In Proceedings of the Sec-ond Conference on Machine Translation pages 169ndash214 Copenhagen Denmark Association for Com-putational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannYvette Graham Barry Haddow Matthias Huck An-tonio Jimeno Yepes Philipp Koehn Varvara Lo-gacheva Christof Monz Matteo Negri AurelieNeveol Mariana Neves Martin Popel Matt PostRaphael Rubino Carolina Scarton Lucia Spe-cia Marco Turchi Karin Verspoor and MarcosZampieri 2016 Findings of the 2016 conferenceon machine translation In Proceedings of theFirst Conference on Machine Translation Volume2 Shared Task Papers pages 131ndash198 Berlin Ger-many Association for Computational Linguistics

Ondrej Bojar Rajen Chatterjee Christian FedermannBarry Haddow Matthias Huck Chris HokampPhilipp Koehn Varvara Logacheva Christof MonzMatteo Negri Matt Post Carolina Scarton LuciaSpecia and Marco Turchi 2015 Findings of the2015 workshop on statistical machine translation InProceedings of the Tenth Workshop on StatisticalMachine Translation pages 1ndash46 Lisbon PortugalAssociation for Computational Linguistics

Ondrej Bojar Yvette Graham and Amir Kamran2017b Results of the WMT17 metrics sharedtask In Proceedings of the Second Conference onMachine Translation pages 489ndash513 CopenhagenDenmark Association for Computational Linguis-tics

Aljoscha Burchardt and Arle Lommel 2014 Practi-cal Guidelines for the Use of MQM in Scientific Re-search on Translation quality (access date 2020-05-26)

Alexis Conneau Kartikay Khandelwal Naman GoyalVishrav Chaudhary Guillaume Wenzek Francisco

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 10: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2694

Guzman Edouard Grave Myle Ott Luke Zettle-moyer and Veselin Stoyanov 2019 Unsupervisedcross-lingual representation learning at scale arXivpreprint arXiv191102116

Alexis Conneau and Guillaume Lample 2019 Cross-lingual language model pretraining In H Wal-lach H Larochelle A Beygelzimer F dlsquoAlche BucE Fox and R Garnett editors Advances in Neu-ral Information Processing Systems 32 pages 7059ndash7069 Curran Associates Inc

Michael Denkowski and Alon Lavie 2011 Meteor 13Automatic metric for reliable optimization and eval-uation of machine translation systems In Proceed-ings of the Sixth Workshop on Statistical MachineTranslation pages 85ndash91 Edinburgh Scotland As-sociation for Computational Linguistics

Jacob Devlin Ming-Wei Chang Kenton Lee andKristina Toutanova 2019 BERT Pre-training ofdeep bidirectional transformers for language under-standing In Proceedings of the 2019 Conferenceof the North American Chapter of the Associationfor Computational Linguistics Human LanguageTechnologies Volume 1 (Long and Short Papers)pages 4171ndash4186 Minneapolis Minnesota Associ-ation for Computational Linguistics

WA Falcon 2019 PyTorch Lightning The lightweightPyTorch wrapper for high-performance AI researchGitHub

Erick Fonseca Lisa Yankovskaya Andre F T MartinsMark Fishel and Christian Federmann 2019 Find-ings of the WMT 2019 shared tasks on quality esti-mation In Proceedings of the Fourth Conference onMachine Translation (Volume 3 Shared Task PapersDay 2) pages 1ndash10 Florence Italy Association forComputational Linguistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2013 Continuous measurement scalesin human evaluation of machine translation In Pro-ceedings of the 7th Linguistic Annotation Workshopand Interoperability with Discourse pages 33ndash41Sofia Bulgaria Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2014 Is machine translation gettingbetter over time In Proceedings of the 14th Confer-ence of the European Chapter of the Association forComputational Linguistics pages 443ndash451 Gothen-burg Sweden Association for Computational Lin-guistics

Yvette Graham Timothy Baldwin Alistair Moffat andJustin Zobel 2017 Can machine translation sys-tems be evaluated by the crowd alone Natural Lan-guage Engineering 23(1)330

Yinuo Guo and Junfeng Hu 2019 Meteor++ 20Adopt syntactic level paraphrase knowledge into ma-chine translation evaluation In Proceedings of the

Fourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 501ndash506 Flo-rence Italy Association for Computational Linguis-tics

Jeremy Howard and Sebastian Ruder 2018 Universallanguage model fine-tuning for text classification InProceedings of the 56th Annual Meeting of the As-sociation for Computational Linguistics (Volume 1Long Papers) pages 328ndash339 Melbourne AustraliaAssociation for Computational Linguistics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera Antonio Gois M Amin Farajian Antonio VLopes and Andre F T Martins 2019a Unba-belrsquos participation in the WMT19 translation qual-ity estimation shared task In Proceedings of theFourth Conference on Machine Translation (Volume3 Shared Task Papers Day 2) pages 78ndash84 Flo-rence Italy Association for Computational Linguis-tics

Fabio Kepler Jonay Trenous Marcos Treviso MiguelVera and Andre F T Martins 2019b OpenKiwiAn open source framework for quality estimationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics SystemDemonstrations pages 117ndash122 Florence Italy As-sociation for Computational Linguistics

Philipp Koehn Hieu Hoang Alexandra Birch ChrisCallison-Burch Marcello Federico Nicola BertoldiBrooke Cowan Wade Shen Christine MoranRichard Zens Chris Dyer Ondrej Bojar AlexandraConstantin and Evan Herbst 2007 Moses Opensource toolkit for statistical machine translation InProceedings of the 45th Annual Meeting of the As-sociation for Computational Linguistics CompanionVolume Proceedings of the Demo and Poster Ses-sions pages 177ndash180 Prague Czech Republic As-sociation for Computational Linguistics

Dan Kondratyuk and Milan Straka 2019 75 lan-guages 1 model Parsing universal dependenciesuniversally In Proceedings of the 2019 Confer-ence on Empirical Methods in Natural LanguageProcessing and the 9th International Joint Confer-ence on Natural Language Processing (EMNLP-IJCNLP) pages 2779ndash2795 Hong Kong China As-sociation for Computational Linguistics

Alon Lavie and Michael Denkowski 2009 The meteormetric for automatic evaluation of machine transla-tion Machine Translation 23105ndash115

Chi-kiu Lo 2019 YiSi - a unified semantic MT qualityevaluation and estimation metric for languages withdifferent levels of available resources In Proceed-ings of the Fourth Conference on Machine Transla-tion (Volume 2 Shared Task Papers Day 1) pages507ndash513 Florence Italy Association for Computa-tional Linguistics

Arle Lommel Aljoscha Burchardt and Hans Uszkoreit2014 Multidimensional quality metrics (MQM) A

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 11: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2695

framework for declaring and describing translationquality metrics Tradumtica tecnologies de la tra-ducci 0455ndash463

Qingsong Ma Ondrej Bojar and Yvette Graham 2018Results of the WMT18 metrics shared task Bothcharacters and embeddings achieve good perfor-mance In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages671ndash688 Belgium Brussels Association for Com-putational Linguistics

Qingsong Ma Johnny Wei Ondrej Bojar and YvetteGraham 2019 Results of the WMT19 metricsshared task Segment-level and strong MT sys-tems pose big challenges In Proceedings of theFourth Conference on Machine Translation (Volume2 Shared Task Papers Day 1) pages 62ndash90 Flo-rence Italy Association for Computational Linguis-tics

Nitika Mathur Timothy Baldwin and Trevor Cohn2019 Putting evaluation in context Contextualembeddings improve machine translation evaluationIn Proceedings of the 57th Annual Meeting of theAssociation for Computational Linguistics pages2799ndash2808 Florence Italy Association for Compu-tational Linguistics

Tomas Mikolov Ilya Sutskever Kai Chen Greg S Cor-rado and Jeff Dean 2013 Distributed representa-tions of words and phrases and their compositional-ity In Advances in Neural Information ProcessingSystems 26 pages 3111ndash3119 Curran AssociatesInc

Kishore Papineni Salim Roukos Todd Ward and Wei-Jing Zhu 2002 Bleu a method for automatic eval-uation of machine translation In Proceedings ofthe 40th Annual Meeting of the Association for Com-putational Linguistics pages 311ndash318 PhiladelphiaPennsylvania USA Association for ComputationalLinguistics

Jeffrey Pennington Richard Socher and ChristopherManning 2014 Glove Global vectors for word rep-resentation In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP) pages 1532ndash1543 Doha Qatar Asso-ciation for Computational Linguistics

Matthew Peters Mark Neumann Mohit Iyyer MattGardner Christopher Clark Kenton Lee and LukeZettlemoyer 2018 Deep contextualized word rep-resentations In Proceedings of the 2018 Confer-ence of the North American Chapter of the Associ-ation for Computational Linguistics Human Lan-guage Technologies Volume 1 (Long Papers) pages2227ndash2237 New Orleans Louisiana Associationfor Computational Linguistics

Telmo Pires Eva Schlinger and Dan Garrette 2019How multilingual is multilingual BERT In Pro-ceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4996ndash

5001 Florence Italy Association for Computa-tional Linguistics

Maja Popovic 2015 chrF character n-gram f-scorefor automatic MT evaluation In Proceedings of theTenth Workshop on Statistical Machine Translationpages 392ndash395 Lisbon Portugal Association forComputational Linguistics

Maja Popovic 2017 chrF++ words helping charac-ter n-grams In Proceedings of the Second Con-ference on Machine Translation pages 612ndash618Copenhagen Denmark Association for Computa-tional Linguistics

Nils Reimers and Iryna Gurevych 2019 Sentence-BERT Sentence embeddings using Siamese BERT-networks In Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processingand the 9th International Joint Conference on Natu-ral Language Processing (EMNLP-IJCNLP) pages3982ndash3992 Hong Kong China Association forComputational Linguistics

Victor Sanh Lysandre Debut Julien Chaumond andThomas Wolf 2019 Distilbert a distilled versionof BERT smaller faster cheaper and lighter arXivpreprint arXiv191001108

F Schroff D Kalenichenko and J Philbin 2015Facenet A unified embedding for face recognitionand clustering In 2015 IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) pages815ndash823

Thibault Sellam Dipanjan Das and Ankur Parikh2020 BLEURT Learning robust metrics for textgeneration In Proceedings of the 58th Annual Meet-ing of the Association for Computational Linguisticspages 7881ndash7892 Online Association for Computa-tional Linguistics

Christophe Servan Alexandre Berard Zied ElloumiHerve Blanchon and Laurent Besacier 2016Word2Vec vs DBnary Augmenting METEOR us-ing vector representations or lexical resources InProceedings of COLING 2016 the 26th Interna-tional Conference on Computational LinguisticsTechnical Papers pages 1159ndash1168 Osaka JapanThe COLING 2016 Organizing Committee

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2018 RUSE Regressor using sentenceembeddings for automatic machine translation eval-uation In Proceedings of the Third Conference onMachine Translation Shared Task Papers pages751ndash758 Belgium Brussels Association for Com-putational Linguistics

Hiroki Shimanaka Tomoyuki Kajiwara and MamoruKomachi 2019 Machine Translation Evalu-ation with BERT Regressor arXiv preprintarXiv190712679

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 12: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2696

Matthew Snover Bonnie Dorr Richard Schwartz Lin-nea Micciulla and John Makhoul 2006 A studyof translation edit rate with targeted human annota-tion In In Proceedings of Association for MachineTranslation in the Americas pages 223ndash231

Lucia Specia Frederic Blain Varvara LogachevaRamon Astudillo and Andre F T Martins 2018Findings of the WMT 2018 shared task on qualityestimation In Proceedings of the Third Conferenceon Machine Translation Shared Task Papers pages689ndash709 Belgium Brussels Association for Com-putational Linguistics

Lucia Specia Kim Harris Frederic Blain AljoschaBurchardt Viviven Macketanz Inguna SkadinaMatteo Negri and Marco Turchi 2017 Transla-tion quality and productivity A study on rich mor-phology languages In Machine Translation SummitXVI pages 55ndash71 Nagoya Japan

Kosuke Takahashi Katsuhito Sudoh and Satoshi Naka-mura 2020 Automatic machine translation evalua-tion using source language inputs and cross-linguallanguage model In Proceedings of the 58th AnnualMeeting of the Association for Computational Lin-guistics pages 3553ndash3558 Online Association forComputational Linguistics

Andre Tattar and Mark Fishel 2017 bleu2vec thepainfully familiar metric on continuous vector spacesteroids In Proceedings of the Second Conferenceon Machine Translation pages 619ndash622 Copen-hagen Denmark Association for ComputationalLinguistics

Ian Tenney Dipanjan Das and Ellie Pavlick 2019BERT rediscovers the classical NLP pipeline InProceedings of the 57th Annual Meeting of the Asso-ciation for Computational Linguistics pages 4593ndash4601 Florence Italy Association for ComputationalLinguistics

Tianyi Zhang Varsha Kishore Felix Wu Kilian QWeinberger and Yoav Artzi 2020 Bertscore Eval-uating text generation with bert In InternationalConference on Learning Representations

Wei Zhao Goran Glavas Maxime Peyrard Yang GaoRobert West and Steffen Eger 2020 On the lim-itations of cross-lingual encoders as exposed byreference-free machine translation evaluation InProceedings of the 58th Annual Meeting of the Asso-ciation for Computational Linguistics pages 1656ndash1671 Online Association for Computational Lin-guistics

Wei Zhao Maxime Peyrard Fei Liu Yang Gao Chris-tian M Meyer and Steffen Eger 2019 MoverScoreText generation evaluating with contextualized em-beddings and earth mover distance In Proceedingsof the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th Interna-tional Joint Conference on Natural Language Pro-cessing (EMNLP-IJCNLP) pages 563ndash578 Hong

Kong China Association for Computational Lin-guistics

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 13: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2697

A Appendices

In Table 5 we list the hyper-parameters used to trainour models Before initializing these models a ran-dom seed was set to 3 in all libraries that performldquorandomrdquo operations (torch numpy randomand cuda)

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 14: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2698

Table 5 Hyper-parameters used in our COMET framework to train the presented models

Hyper-parameter COMET(Est-HTERMQM) COMET-RANK

Encoder Model XLM-RoBERTa (base) XLM-RoBERTa (base)Optimizer Adam (default parameters) Adam (default parameters)n frozen epochs 1 0Learning rate 3e-05 and 1e-05 1e-05Batch size 16 16Loss function MSE Triplet Margin (ε = 10)Layer-wise dropout 01 01FP precision 32 32Feed-Forward hidden units 23041152 ndashFeed-Forward activations Tanh ndashFeed-Forward dropout 01 ndash

Table 6 Statistics for the QT21 corpus

en-de en-cs en-lv de-enTotal tuples 54000 42000 35474 41998Avg tokens (reference) 1780 1556 1642 1771Avg tokens (source) 1670 1737 1839 1718Avg tokens (MT) 1765 1564 1642 1778

Table 7 Statistics for the WMT 2017 DARR corpus

en-cs en-de en-fi en-lv en-trTotal tuples 32810 6454 3270 3456 247Avg tokens (reference) 1970 2215 1559 2142 1757Avg tokens (source) 2237 2341 2173 2608 2251Avg tokens (MT) 1945 2258 1606 2218 1725

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 15: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2699

Tabl

e8

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rin

to-E

nglis

hla

ngua

gepa

irs

de-e

nfi-

engu

-en

kk-e

nlt-

enru

-en

zh-e

nTo

talt

uple

s85

365

3217

920

110

9728

2186

239

852

3107

0Av

gto

kens

(ref

eren

ce)

202

918

55

176

420

36

265

521

74

428

9Av

gto

kens

(sou

rce)

184

412

49

219

216

32

203

218

00

757

Avg

toke

ns(M

T)

202

217

76

170

219

68

252

521

80

397

0

Tabl

e9

Stat

istic

sfo

rthe

WM

T20

19D

AR

Rfr

om-E

nglis

han

dno

-Eng

lish

lang

uage

pair

s

en-c

sen

-de

en-fi

en-g

uen

-kk

en-lt

en-r

uen

-zh

fr-d

ede

-cs

de-f

rTo

talt

uple

s27

178

9984

031

820

1135

518

172

1740

124

334

1865

813

6923

194

4862

Avg

toke

ns(r

efer

ence

)22

92

256

520

12

333

218

89

210

024

79

925

226

822

27

273

2Av

gto

kens

(sou

rce)

249

824

97

252

324

32

237

824

46

244

524

39

286

025

22

213

6Av

gto

kens

(MT

)22

60

249

819

69

329

719

92

209

723

37

683

233

621

89

256

8

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 16: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2700

Tabl

e10

MQ

Mco

rpus

(sec

tion

33)

stat

istic

s

en-n

len

-sv

en-j

aen

-de

en-r

uen

-es

en-f

ren

-iten

-pt-

bren

-tr

en-p

ten

-es-

lata

mTo

talt

uple

s24

4797

015

9027

5610

4325

914

7481

250

437

091

6Av

gto

kens

(ref

eren

ce)

141

014

24

203

213

78

133

710

90

137

513

61

124

87

9512

18

103

3Av

gto

kens

(sou

rce)

142

315

31

136

913

76

139

411

23

128

514

22

124

610

36

134

512

33

Avg

toke

ns(M

T)

136

613

91

178

413

41

131

910

88

135

913

02

121

97

9912

21

101

7

Tabl

e11

Sta

tistic

sfo

rthe

WM

T20

18D

AR

Rla

ngua

gepa

irs

zh-e

nen

-zh

cs-e

nfi-

enru

-en

tr-e

nde

-en

en-c

sen

-de

en-e

ten

-fien

-ru

en-t

ret

-en

Tota

ltup

les

3335

728

602

5110

1564

810

404

8525

7781

154

1319

711

3220

298

0922

181

1358

5672

1Av

gto

kens

(ref

eren

ce)

288

624

04

219

821

13

249

723

25

232

919

50

235

418

21

163

221

81

201

523

40

Avg

toke

ns(s

ourc

e)23

86

282

718

67

150

321

37

188

021

95

226

724

82

234

722

82

252

424

37

181

5Av

gto

kens

(MT

)27

45

149

421

79

204

625

25

228

022

64

197

323

74

183

717

15

218

619

61

235

2

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 17: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2701

All 10 8 6 4

02

04

06

Top models en-cs

Ken

dall

Tau

scor

e

All 10 8 6 4minus02

0

02

04

Top models en-de

All 10 8 6 4

02

04

06

Top models en-fi

Ken

dall

Tau

scor

e

All 10 8 6 4

03

04

05

06

Top models en-gu

All 10 8 6 4

0

02

04

06

Top models en-kk

Ken

dall

Tau

scor

e

All 10 8 6 40

02

04

06

Top models en-lt

All 10 8 6 4

02

04

06

Top models en-ru

Ken

dall

Tau

scor

e

01

02

03

04

Top models en-zh

Table 12 Metrics performance over all and the top (108 6 and 4) MT systems for all from-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT

Page 18: COMET: A Neural Framework for MT Evaluation · 2020. 11. 11. · (Lavie and Denkowski,2009) remain popular as a means of evaluating MT systems due to their light-weight and fast computation

2702

All 10 8 6 4minus01

0

01

02

Top models de-en

Ken

dall

Tau

scor

e

All 10 8 6 4

01

02

03

04

Top models fi-en

All 10 8 6 4

01

02

03

04

Top models lt-en

Ken

dall

Tau

scor

e

All 10 8 6 40

5 middot 10minus2

01

015

02

Top models ru-en

All 10 8 6 4

0

02

04

Top models zh-en

Ken

dall

Tau

scor

e

Table 13 Metrics performance over all and the top (108 6 and 4) MT systems for all into-English languagepairs The color scheme is as follows COMET-RANK COMET-HTER COMET-MQM BLEUBERTSCORE BLEURT