36
Adaptable Automatic Evaluation Metrics for Machine Translation Lucian Vlad Lita joint work with Alon Lavie and Monica Rogati

A daptable A utomatic E valuation M etrics for M achine T ranslation

Embed Size (px)

DESCRIPTION

A daptable A utomatic E valuation M etrics for M achine T ranslation. L ucian V lad L ita joint work with A lon L avie and M onica R ogati. Outline. BLEU and ROUGE metric families BLANC –family of adaptable metrics All common skip n-grams Local n-gram model Overall model - PowerPoint PPT Presentation

Citation preview

Page 1: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Adaptable Automatic EvaluationMetrics for

Machine Translation

Lucian Vlad Litajoint work with Alon Lavie and Monica Rogati

Page 2: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Outline

BLEU and ROUGE metric families BLANC –family of adaptable metrics

All common skip n-grams Local n-gram model Overall model

Experiments and results Conclusions Future work References

Page 3: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Automatic Evaluation Metrics

Manual human judgments Edit distance (WER) Word overlap (PER) Metrics based on n-grams

n-gram precision (BLEU) weighted n-grams (NIST) longest common subsequence (Rouge-L) skip 2-grams (pairs of ordered words – Rouge-S)

Integrate additional knowledge (synonyms, stemming)(METEOR)

t i m

e

translation quality ( candidate | reference )

Page 4: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Automatic Evaluation Metrics

Manual human judgments

Machine translation (MT) evaluation metrics Manually created estimators of quality Improvements often shown on the same data Rigid notion of quality Based on existing judgment guidelines

Goal: trainable evaluation metric

t i m

e

translation quality ( candidate | reference )

Page 5: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Goal: Trainable MT Metric

Build on the features used by established metrics (BLEU, ROUGE)

Extendable – additional features/processing Correlate well with human judgments Trainable models

Different notions of “translation quality”E.g. computer consumption vs. human consumption

Different features will be more important for differentLanguagesDomains

Page 6: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The WER Metric

R: the students asked the professor

C: the students talk professor

Word Error Rate =# of word insertions, deletions, and substitutions

# words in R

Transform reference (human) translation R into candidate (machine) translation C Levenshtein (edit) distance

Page 7: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The PER Metric

Word overlap between candidate (machine) translation C and reference (human) translation R Bag of words

Position IndependentError Rate

|count of w in R – count of w in C|

# words in R

R: the students asked the professor

C: the students talk professor

w in C

Page 8: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The BLEU Metric

Modified n-gram precisions 1-gram precision = 3 / 4 2-gram precision = 1 / 3 …

Contiguous n-gram overlap between reference (human) translation R and candidate (machine) translation C

R: the students asked the professor

C: the students talk professor

BLEU = ( Pi-gram )1/n * ( brevity penalty )i = 1

n

Page 9: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The BLEU Metric

BLEU is the most established evaluation metric in MT

Basic feature: contiguous n-grams of all sizes Computes modified precision Uses a simple formula to combine all precision

scores Bigram precision is “as important” as unigram

precision Brevity penalty – quasi recall

Page 10: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The Rouge-L Metric

R: the students asked the professor

C: the students talk professor

Longest common subsequence (LCS) of the candidate (machine) translation C and reference (human) translation R LCS = 3 “the students … professor”

PrecisionLCS (C,R)

# words in C= Recall

LCS (C,R)

# words in R=

Rouge-L = harmonic mean (Precision, Recall) = 2PR / (P+R)

Page 11: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The Rouge-S Metric

R: the students asked the professor

C: the students talk professor

Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R

Skip2(C) = 6 { “the students”, “the talk”, “the professor”,

“students talk”, “students professor”, “talk professor” }

Skip2(C,R) = 3 { “the students” , “the professor”, “students professor” }

11

Page 12: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The Rouge-S Metric

R: the students asked the professor

C: the students talk professor

Skip 2-gram overlap (LCS) of the candidate (machine) translation C and reference (human) translation R

PrecisionSkip2 (C,R)

|C| choose 2= Recall

Skip2 (C,R)

|R| choose 2=

Rouge-S = harmonic mean (Precision, Recall)

Page 13: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The ROUGE Metrics

Rouge-L Basic feature: longest common subsequence LCS

Size of the longest common skip n-gram Weighted LCS

Rouge-S Basic feature: skip bigrams Skip bigram gap size irrelevant Limited to n-grams of size 2

Both use harmonic mean (F1-measure) to combine precision and recall

Page 14: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Is BLEU Trainable?

Can we assign/learn relative importance between P2 and P3?

Simplest model: regression Train/test on past MT output [C,R] Inputs: P1, P2 , P2 … and brevity penalty

P1, P2 , P2, bp HJ fluency score

BLEU = ( Pi-gram )1/n * ( brevity penalty )i = 1

n

Page 15: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Is Rouge Trainable?

Simple regression on Size of the longest common skip n-gram Number of common skip 2-grams

Second order parameters (dependencies) – model is not linear in its inputs anymore Window size (computation reasons) F-measure to F (replacing brevity penalty)

Potential models Iterative methods Hill climbing?

Non-linear (Bp, |LCS|, Skip2, F, ws) HJ fluency score

Page 16: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

The BLANC Metric Family

Generalization of established evaluation metrics N-gram features used by BLEU and ROUGE

Trainable parameters Skip n-gram contiguity in C Relative importance of n (i.e. bigrams vs. trigrams) Precision-recall balance

Adaptability to different: Translation quality criteria, languages, domains

Allow additional processing/features (e.g. METEOR matching)

Page 17: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

All Common Skip N-grams

C: the one pure student brought the necessary condiments

R: the new student brought the food

C: the one pure student brought the necessary condiments

R: the new student brought the food

( , , , )

( , , , )

( , , , )

( , , , )1

1

1

1 0

1 2

3

0

0 1

3

0

0 0

1# 1grams: 4# 2grams: 6# 3grams: 4# 4grams: 1

the(0,0)

the(4,5)

student(2,3) brought(3,4)

the(0,5)the(4,0)

Page 18: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

All Common Skip N-grams

C: the one pure student brought the necessary condiments

R: the new student brought the food

C: the one pure student brought the necessary condiments

R: the new student brought the food

the(0,0)

the(4,5)

student(2,3) brought(3,4)

( , , , )

( , , , )

( , , , )

( , , , )1

1

1

1 0

s22 s32

3

0

0 1

?

0

0 0

1score(1-grams)score(2-grams)score(3-grams)score(4-grams)

score(the0,0,student2,3)

’ ’’

Page 19: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

All Common Skip N-grams

Algorithms literature: all common subsequences Listing vs. counting subsequences Interested in counting

# common subsequences of size 1, 2, 3 …

Replace counting with score over all n-grams of the same size Score(w1…wi,wi+1…wn) =

Score(w1…wi) Score(w1+1…wn)

BLANCi(C,R) = f(common i-grams of C,R)

Page 20: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Modeling Gap Size Importance

skip 3-grams

… the ____ ____ ____ ____ student ____ ____ has …

… the ____ student has …

… the student has …

Page 21: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Modeling Gap Size Importance

Model the importance of skip n-gram gap size as an exponential function with one parameter ()

Special cases Gap size doesn’t matter (Rouge-S): = 0 No gaps are allowed (BLEU): = large number

C: … the __ __ __ __ student __ __ has …

Page 22: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Modeling Candidate-Reference Gap Difference

skip 3-gram match

C1: … the ____ ____ ____ ____ student ____ ____ has …

R: … the ____ student has …

C2: … the student has …

Page 23: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Modeling Candidate-Reference Gap Difference

Model the importance of gap size difference between the candidate and reference translations as an exponential function with one parameter ()

Special cases Gap size differences do not matter: = 0 Skip 2-gram overlap (Rouge-S): = 0, = 0, n=2 Largest skip n-gram (Rouge-L): = 0, = 0, n=LCS

C: … the __ __ __ __ student __ __ has …R: … the __ student has …

Page 24: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Skip N-gram Model Incorporate simple scores into an exponential model

Skip n-gram gap size Candidate-reference gap size difference

Possible to incorporate higher level features Partial skip n-grams matching (e.g. synonyms, stemming)

“the __ students” vs. “the __ pupils”, “the __ students” vs. “the __ student”

From word classing to syntax e.g. score( “students __ __ professor”) ? score (“the __ __ of”)

Page 25: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

CandidatesReferences

Find CommonSkip Ngram

Find All Common Skip Ngrams

Compute Skip NgramPair Features e-ifi (sn)

Combine All CommonSkip Ngram Scores

Global parameters • precision/recall• f(skip ngram size)

Compute CorrelationCoefficient • pearson• spearman

Criterion• adequacy• fluency• f(adequacy, fluency)• other

Trained Metric

BLANC Overview

Page 26: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Incorporating Global Features

Compute BLANC precision and recall for each n-gram size i

Global exponential model based on N-gram size: I BLANCi (C,R) i=1..n F-measure parameter F for each size i Average reference segment size Other scores (i.e. BLEU, ROUGE-L, ROUGE-S) …

Train for average human judgment vs. train for best overall correlation (as the error function)

Page 27: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Experiment Setup

Tides evaluation data Arabic English 2003, 2004

Training and test sentences separated by year Optimized:

n-gram contiguity difference in gap size (C vs. R) Balance between precision and recall

Correlation using the Pearson correlation coefficient Compared BLANC to BLEU and ROUGE Trained BLANC for

Fluency vs. adequacy System level vs. sentence level

Page 28: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Tides 2003 Arabic Evaluation

System Level Sentence Level

Method Adequacy Fluency Adequacy Fluency

BLEU 0.950 0.934 0.382 0.286

NIST 0.962 0.939 0.439 0.304

Rouge-L 0.974 0.926 0.440 0.328

Rouge-S 0.949 0.935 0.360 0.328

BLANC 0.988 0.979 0.492 0.391

Pearson [-1,1] correlation with human judgments at system level and sentence level

Page 29: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Tides 2004 Arabic Evaluation

System Level Sentence Level

Method Adequacy Fluency Adequacy Fluency

BLEU 0.978 0.994 0.446 0.337

NIST 0.987 0.952 0.529 0.358

Rouge-L 0.981 0.985 0.538 0.412

Rouge-S 0.937 0.980 0.367 0.408

BLANC 0.982 0.994 0.565 0.438

Pearson [-1,1] correlation with human judgments at system level and sentence level

Page 30: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Advantages of BLANC

Consistently good performance Candidate evaluation is fast Adaptable

fluency and adequacy languages, domains

Help train MT systems for specific tasks e.g. information extraction, information retrieval

Model complexity Can be optimized for specific MT system

performance levels

Page 31: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Disadvantages of BLANC

Training data vs. number of parameters Model complexity Guarantees of the training process

Page 32: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Conclusions

Move towards learning evaluation metrics Quality criteria – e.g. fluency, adequacy Correlation coefficients – e.g. Pearson, Spearman Languages – e.g. English, Arabic, Chinese

BLANC – family of trainable evaluation metrics Consistently performs well on evaluating machine

translation output

Page 33: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Future Work

Recently obtained a two year NSF Grant

Try different models and improve the training mechanism for BLANC Is a local exponential model the best choice? Is a global exponential model the best choice? Explore different training methods

Integrate additional features Apply BLANC to other tasks (summarization)

Page 34: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

References

Leusch, Ueffing, Vilar and Ney, “Preprocessing and Normalization for Automatic Evaluation of Machine Translation.” IEEMTS Workshop, ACL 2005

Lin and Och, “Automatic Evaluation of Machine Translation Quality Using Longest Common Subsequence and Skip-Bigram Statistics”, ACL 2004

Lita, Lavie and Rogati, “BLANC: Learning Evaluation Metrics for MT”, HLT-EMNLP 2005

Papineni, Roukos, Ward and Zhu, “BLEU: A Method for Automatic Evaluation of Machine Translation”, IBM Report 2002

Akiba, Imamura and Sumita, “Using Multiple Edit Distances to Automatically Rank Machine Translation Output”, MT Summit VIII 2001

Su, Wu and Chang, “A new Quantitative Quality Measure for a Machine Translation System”, COLING 1992

Page 35: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Thank you

Page 36: A daptable  A utomatic  E valuation M etrics for M achine  T ranslation

Acronyms, acronyms …

Official: Broad Learning Adaptation for Numeric Criteria

Inspiration: white light contains light of all frequencies

Fun: Building on Legacy Acronym Naming Conventions Bleu, Rouge, Orange, Pourpre … Blanc?