1
Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output • Earlier metric was tuned to get good correlations with adequacy and fluency style human judgments Re-tuned to optimize correlation with human ranking data from last year's WMT shared task Parameter Tuning Introduction to METEOR • Computes a one to one word alignment between reference and hypothesis using a matching module • Uses unigram precision and unigram recall along with a fragmentation penalty as score components • In case of multiple references, the best scoring reference-hypothesis pair is chosen Matching Module • Matcher uses various word mapping modules to identify all possible word matches • Exact : Match words with same surface forms • porter_stem: Match words with same root • wn_synonymy: Match words based on WordNet synsets • Compute the alignment having fewest number of crossing edges Language Technologies Institute Carnegie Mellon University Abhaya Agarwal & Alon Lavie Matcher Example The Sri Lanka prime minister criticizes the leader of the country President of Sri Lanka criticized by prime minister of the country Score Computation • Based on the alignment thus produced, unigram precision (P) and unigram recall (R) are computed • P and R are combined into a parametrized harmonic mean F mean = P R α P 1 α R • To account for the differences in the order of the unigrams matched, a fragmentation penalty is computed as the ratio of the consecutive segments matched to the total number of unigrams matched • Fragmentation penalty and final score are computed as follows. P enalty = γ frag β Score = 1 P enalty F mean MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. To be tuned against, a metric should be fast and easy to compute In this work, we seek to improve already established metrics using simple techniques that are not expensive to compute. Objective Computing Ranking Correlation • Convert binary judgements into full rankings Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements Topologically sort the graph For one source sentence with N hypothesis Average across all source sentences Results 3-fold Cross validation results on the WMT 2007 data (Average Spearman Correlation) Results on the WMT 2008 data( % of correct binary judgements) O riginal R e-Tuned English 0.3813 0.4020 Germ an 0.2166 0.2838 French 0.2992 0.3640 Spanish 0.2021 0.2186 O riginal R e-Tuned English 0.5120 0.5120 Germ an 0.4250 0.2600 French 0.6250 0.7120 Spanish 0.2950 0.2700 Flexible Matching for BLEU The flexible matching in METEOR can be used to extend any metric that needs word to word matching Compute the alignment between reference and hypothesis using Meteor matcher Re-write the reference by replacing matched words with the words from hypothesis. Compute BLEU with the new references Average BLEU and M-BLEU scores on WMT 2008 data Results BLEU M -BLEU English 0.2111 0.2860 Germ an 0.1383 0.1900 French 0.2004 0.2548 Spanish 0.2288 0.2792 Parameter Tuning • The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements. • Since the ranges of the parameters are bounded, exhaustive search is use No consistent gains across languages seen in correlations at the segment level on WMT 2007 data Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Embed Size (px)

DESCRIPTION

Objective. MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. To be tuned against, a metric should be fast and easy to compute - PowerPoint PPT Presentation

Citation preview

Page 1: Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output

• Earlier metric was tuned to get good correlations with adequacy and fluency style human judgments

Re-tuned to optimize correlation with human ranking data from last year's WMT shared task

Parameter Tuning

Introduction to METEOR• Computes a one to one word alignment between reference and hypothesis using a matching module• Uses unigram precision and unigram recall along with a fragmentation penalty as score components• In case of multiple references, the best scoring reference-hypothesis pair is chosen

Matching Module• Matcher uses various word mapping modules to identify all possible word matches

• Exact : Match words with same surface forms• porter_stem: Match words with same root• wn_synonymy: Match words based on WordNet synsets

• Compute the alignment having fewest number of crossing edges

Language Technologies InstituteCarnegie Mellon University

Abhaya Agarwal & Alon Lavie

Matcher ExampleThe Sri Lanka prime minister criticizes the leader of the country

President of Sri Lanka criticized by prime minister of the country

Score Computation• Based on the alignment thus produced, unigram precision (P) and unigram recall (R) are computed • P and R are combined into a parametrized harmonic mean

F mean=P⋅R

α⋅P1−α⋅R • To account for the differences in the order of the unigrams matched, a fragmentation penalty is computed as the ratio of the consecutive segments matched to the total number of unigrams matched• Fragmentation penalty and final score are computed as follows.

Penalty=γ⋅ frag β Score=1−Penalty ⋅F mean

MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly.

To be tuned against, a metric should be fast and easy to compute

In this work, we seek to improve already established metrics using simple techniques that are not expensive to compute.

Objective

Computing Ranking Correlation• Convert binary judgements into full rankings

Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements

Topologically sort the graph For one source sentence with N hypothesis

• Average across all source sentences

Results3-fold Cross

validation results on the WMT 2007 data (Average Spearman Correlation)

Results on the WMT 2008 data( % of correct binary judgements)

Original Re-Tuned

English 0.3813 0.4020

German 0.2166 0.2838

French 0.2992 0.3640

Spanish 0.2021 0.2186

Original Re-Tuned

English 0.5120 0.5120

German 0.4250 0.2600

French 0.6250 0.7120

Spanish 0.2950 0.2700

Flexible Matching for BLEUThe flexible matching in METEOR can be used to

extend any metric that needs word to word matching Compute the alignment between reference and

hypothesis using Meteor matcher Re-write the reference by replacing matched words

with the words from hypothesis. Compute BLEU with the new references

Average BLEU and M-BLEU scores on WMT 2008 data

ResultsBLEU M-BLEU

English 0.2111 0.2860

German 0.1383 0.1900

French 0.2004 0.2548

Spanish 0.2288 0.2792

Parameter Tuning• The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements.• Since the ranges of the parameters are bounded, exhaustive search is use

No consistent gains across languages seen in correlations at the segment level on WMT 2007 data

Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])