Upload
ulla-mckay
View
22
Download
1
Embed Size (px)
DESCRIPTION
Objective. MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly. To be tuned against, a metric should be fast and easy to compute - PowerPoint PPT Presentation
Citation preview
Meteor & M-BLEU : Evaluation Metrics for High Correlation with Human Rankings of MT Output
• Earlier metric was tuned to get good correlations with adequacy and fluency style human judgments
Re-tuned to optimize correlation with human ranking data from last year's WMT shared task
Parameter Tuning
Introduction to METEOR• Computes a one to one word alignment between reference and hypothesis using a matching module• Uses unigram precision and unigram recall along with a fragmentation penalty as score components• In case of multiple references, the best scoring reference-hypothesis pair is chosen
Matching Module• Matcher uses various word mapping modules to identify all possible word matches
• Exact : Match words with same surface forms• porter_stem: Match words with same root• wn_synonymy: Match words based on WordNet synsets
• Compute the alignment having fewest number of crossing edges
Language Technologies InstituteCarnegie Mellon University
Abhaya Agarwal & Alon Lavie
Matcher ExampleThe Sri Lanka prime minister criticizes the leader of the country
President of Sri Lanka criticized by prime minister of the country
Score Computation• Based on the alignment thus produced, unigram precision (P) and unigram recall (R) are computed • P and R are combined into a parametrized harmonic mean
F mean=P⋅R
α⋅P1−α⋅R • To account for the differences in the order of the unigrams matched, a fragmentation penalty is computed as the ratio of the consecutive segments matched to the total number of unigrams matched• Fragmentation penalty and final score are computed as follows.
Penalty=γ⋅ frag β Score=1−Penalty ⋅F mean
MT Evaluation has become integral to the development of SMT systems which can be tuned towards the evaluation metrics directly.
To be tuned against, a metric should be fast and easy to compute
In this work, we seek to improve already established metrics using simple techniques that are not expensive to compute.
Objective
Computing Ranking Correlation• Convert binary judgements into full rankings
Build a directed graph with nodes representing individual hypothesis and edges representing binary judgements
Topologically sort the graph For one source sentence with N hypothesis
• Average across all source sentences
Results3-fold Cross
validation results on the WMT 2007 data (Average Spearman Correlation)
Results on the WMT 2008 data( % of correct binary judgements)
Original Re-Tuned
English 0.3813 0.4020
German 0.2166 0.2838
French 0.2992 0.3640
Spanish 0.2021 0.2186
Original Re-Tuned
English 0.5120 0.5120
German 0.4250 0.2600
French 0.6250 0.7120
Spanish 0.2950 0.2700
Flexible Matching for BLEUThe flexible matching in METEOR can be used to
extend any metric that needs word to word matching Compute the alignment between reference and
hypothesis using Meteor matcher Re-write the reference by replacing matched words
with the words from hypothesis. Compute BLEU with the new references
Average BLEU and M-BLEU scores on WMT 2008 data
ResultsBLEU M-BLEU
English 0.2111 0.2860
German 0.1383 0.1900
French 0.2004 0.2548
Spanish 0.2288 0.2792
Parameter Tuning• The 3 free parameters in the metric are tuned to obtain maximum correlation with human judgements.• Since the ranges of the parameters are bounded, exhaustive search is use
No consistent gains across languages seen in correlations at the segment level on WMT 2007 data
Similar mixed patterns seen in WMT 2008 data as well. (as reported in [Callison-Burch et al 2008])