A daptable A utomatic E valuation M etrics for M achine T ranslation

Adaptable Automatic EvaluationMetrics for

Machine Translation

Lucian Vlad Litajoint work with Alon Lavie and Monica Rogati

Outline

BLEU and ROUGE metric families BLANC –family of adaptable metrics

All common skip n-grams Local n-gram model Overall model

Experiments and results Conclusions Future work References

Automatic Evaluation Metrics

Manual human judgments Edit distance (WER) Word overlap (PER) Metrics based on n-grams

n-gram precision (BLEU) weighted n-grams (NIST) longest common subsequence (Rouge-L) skip 2-grams (pairs of ordered words – Rouge-S)

Integrate additional knowledge (synonyms, stemming)(METEOR)

t i m

e

translation quality ( candidate | reference )

Automatic Evaluation Metrics

Manual human judgments

Machine translation (MT) evaluation metrics Manually created estimators of quality Improvements often shown on the same data Rigid notion of quality Based on existing judgment guidelines

Goal: trainable evaluation metric

t i m

e

translation quality ( candidate | reference )

Goal: Trainable MT Metric

Build on the features used by established metrics (BLEU, ROUGE)

Extendable – additional features/processing Correlate well with human judgments Trainable models

Different notions of “translation quality”E.g. computer consumption vs. human consumption

Different features will be more important for differentLanguagesDomains

The WER Metric

R: the students asked the professor

C: the students talk professor

Word Error Rate =# of word insertions, deletions, and substitutions

# words in R

Transform reference (human) translation R into candidate (machine) translation C Levenshtein (edit) distance

The PER Metric

Word overlap between candidate (machine) translation C and reference (human) translation R Bag of words

Position IndependentError Rate

|count of w in R – count of w in C|

# words in R



w in C

The BLEU Metric

Modified n-gram precisions 1-gram precision = 3 / 4 2-gram precision = 1 / 3 …

Contiguous n-gram overlap between reference (human) translation R and candidate (machine) translation C



BLEU = ( Pi-gram )1/n * ( brevity penalty )i = 1

n

The BLEU Metric

BLEU is the most established evaluation metric in MT

Basic feature: contiguous n-grams of all sizes Computes modified precision Uses a simple formula to combine all precision

scores Bigram precision is “as important” as unigram

precision Brevity penalty – quasi recall

The Rouge-L Metric



Longest common subsequence (LCS) of the candidate (machine) translation C and reference (human) translation R LCS = 3 “the students … professor”

PrecisionLCS (C,R)

# words in C= Recall

LCS (C,R)

# words in R=

Rouge-L = harmonic mean (Precision, Recall) = 2PR / (P+R)

The Rouge-S Metric



Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R

Skip2(C) = 6 { “the students”, “the talk”, “the professor”,

“students talk”, “students professor”, “talk professor” }

Skip2(C,R) = 3 { “the students” , “the professor”, “students professor” }

11

The Rouge-S Metric



Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R

PrecisionSkip2 (C,R)

|C| choose 2= Recall

Skip2 (C,R)

|R| choose 2=

Rouge-S = harmonic mean (Precision, Recall)

The ROUGE Metrics

Rouge-L Basic feature: longest common subsequence LCS

Size of the longest common skip n-gram Weighted LCS

Rouge-S Basic feature: skip bigrams Skip bigram gap size irrelevant Limited to n-grams of size 2

Both use harmonic mean (F1-measure) to combine precision and recall

Is BLEU Trainable?

Can we assign/learn relative importance between P2 and P3?

Simplest model: regression Train/test on past MT output [C,R] Inputs: P1, P2 , P2 … and brevity penalty

P1, P2 , P2, bp HJ fluency score

BLEU = ( Pi-gram )1/n * ( brevity penalty )i = 1

n

Is Rouge Trainable?

Simple regression on Size of the longest common skip n-gram Number of common skip 2-grams

Second order parameters (dependencies) – model is not linear in its inputs anymore Window size (computation reasons) F-measure to F (replacing brevity penalty)

Potential models Iterative methods Hill climbing?

Non-linear (Bp, |LCS|, Skip2, F, ws) HJ fluency score

The BLANC Metric Family

Generalization of established evaluation metrics N-gram features used by BLEU and ROUGE

Trainable parameters Skip n-gram contiguity in C Relative importance of n (i.e. bigrams vs. trigrams) Precision-recall balance

Adaptability to different: Translation quality criteria, languages, domains

Allow additional processing/features (e.g. METEOR matching)

All Common Skip N-grams

C: the one pure student brought the necessary condiments

R: the new student brought the food



( , , , )

( , , , )

( , , , )

( , , , )1

1

1

1 0

1 2

3

0

0 1

3

0

0 0

1# 1grams: 4# 2grams: 6# 3grams: 4# 4grams: 1

the(0,0)

the(4,5)

student(2,3) brought(3,4)

the(0,5)the(4,0)






the(0,0)

the(4,5)

student(2,3) brought(3,4)

( , , , )

( , , , )

( , , , )

( , , , )1

1

1

1 0

s22 s32

3

0

0 1

?

0

0 0

1score(1-grams)score(2-grams)score(3-grams)score(4-grams)

score(the0,0,student2,3)

’ ’’


Algorithms literature: all common subsequences Listing vs. counting subsequences Interested in counting

# common subsequences of size 1, 2, 3 …

Replace counting with score over all n-grams of the same size Score(w1…wi,wi+1…wn) =

Score(w1…wi) Score(w1+1…wn)

BLANCi(C,R) = f(common i-grams of C,R)

Modeling Gap Size Importance

skip 3-grams

… the ____ ____ ____ ____ student ____ ____ has …

… the ____ student has …

… the student has …

Modeling Gap Size Importance

Model the importance of skip n-gram gap size as an exponential function with one parameter ()

Special cases Gap size doesn’t matter (Rouge-S): = 0 No gaps are allowed (BLEU): = large number

C: … the __ __ __ __ student __ __ has …

Modeling Candidate-Reference Gap Difference

skip 3-gram match

C1: … the ____ ____ ____ ____ student ____ ____ has …

R: … the ____ student has …

C2: … the student has …

Modeling Candidate-Reference Gap Difference

Model the importance of gap size difference between the candidate and reference translations as an exponential function with one parameter ()

Special cases Gap size differences do not matter: = 0 Skip 2-gram overlap (Rouge-S): = 0, = 0, n=2 Largest skip n-gram (Rouge-L): = 0, = 0, n=LCS

C: … the __ __ __ __ student __ __ has …R: … the __ student has …

Skip N-gram Model Incorporate simple scores into an exponential model

Skip n-gram gap size Candidate-reference gap size difference

Possible to incorporate higher level features Partial skip n-grams matching (e.g. synonyms, stemming)

“the __ students” vs. “the __ pupils”, “the __ students” vs. “the __ student”

From word classing to syntax e.g. score( “students __ __ professor”) ? score (“the __ __ of”)

CandidatesReferences

Find CommonSkip Ngram

Find All Common Skip Ngrams

Compute Skip NgramPair Features e-ifi (sn)

Combine All CommonSkip Ngram Scores

Global parameters • precision/recall• f(skip ngram size)

Compute CorrelationCoefficient • pearson• spearman

Criterion• adequacy• fluency• f(adequacy, fluency)• other

Trained Metric

BLANC Overview

Incorporating Global Features

Compute BLANC precision and recall for each n-gram size i

Global exponential model based on N-gram size: I BLANCi (C,R) i=1..n F-measure parameter F for each size i Average reference segment size Other scores (i.e. BLEU, ROUGE-L, ROUGE-S) …

Train for average human judgment vs. train for best overall correlation (as the error function)

Experiment Setup

Tides evaluation data Arabic English 2003, 2004

Training and test sentences separated by year Optimized:

n-gram contiguity difference in gap size (C vs. R) Balance between precision and recall

Correlation using the Pearson correlation coefficient Compared BLANC to BLEU and ROUGE Trained BLANC for

Fluency vs. adequacy System level vs. sentence level

Tides 2003 Arabic Evaluation

System Level Sentence Level

Method Adequacy Fluency Adequacy Fluency

BLEU 0.950 0.934 0.382 0.286

NIST 0.962 0.939 0.439 0.304

Rouge-L 0.974 0.926 0.440 0.328

Rouge-S 0.949 0.935 0.360 0.328

BLANC 0.988 0.979 0.492 0.391

Pearson [-1,1] correlation with human judgments at system level and sentence level

Tides 2004 Arabic Evaluation

System Level Sentence Level

Method Adequacy Fluency Adequacy Fluency

BLEU 0.978 0.994 0.446 0.337

NIST 0.987 0.952 0.529 0.358

Rouge-L 0.981 0.985 0.538 0.412

Rouge-S 0.937 0.980 0.367 0.408

BLANC 0.982 0.994 0.565 0.438

Pearson [-1,1] correlation with human judgments at system level and sentence level

Advantages of BLANC

Consistently good performance Candidate evaluation is fast Adaptable

fluency and adequacy languages, domains

Help train MT systems for specific tasks e.g. information extraction, information retrieval

Model complexity Can be optimized for specific MT system

performance levels

Disadvantages of BLANC

Training data vs. number of parameters Model complexity Guarantees of the training process

Conclusions

Move towards learning evaluation metrics Quality criteria – e.g. fluency, adequacy Correlation coefficients – e.g. Pearson, Spearman Languages – e.g. English, Arabic, Chinese

BLANC – family of trainable evaluation metrics Consistently performs well on evaluating machine

translation output

Future Work

Recently obtained a two year NSF Grant

Try different models and improve the training mechanism for BLANC Is a local exponential model the best choice? Is a global exponential model the best choice? Explore different training methods

Integrate additional features Apply BLANC to other tasks (summarization)

References

Leusch, Ueffing, Vilar and Ney, “Preprocessing and Normalization for Automatic Evaluation of Machine Translation.” IEEMTS Workshop, ACL 2005

Lin and Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics”, ACL 2004

Lita, Lavie and Rogati, “BLANC: Learning Evaluation Metrics for MT”, HLT-EMNLP 2005

Papineni, Roukos, Ward and Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation”, IBM Report 2002

Akiba, Imamura and Sumita, “Using Multiple Edit Distances to Automatically Rank Machine Translation Output”, MT Summit VIII 2001

Su, Wu and Chang, “A new Quantitative Quality Measure for a Machine Translation System”, COLING 1992

Thank you

Acronyms, acronyms …

Official: Broad Learning Adaptation for Numeric Criteria

Inspiration: white light contains light of all frequencies

Fun: Building on Legacy Acronym Naming Conventions Bleu, Rouge, Orange, Pourpre … Blanc?

Documents

A daptable A utomatic E valuation M etrics for M achine T ranslation