Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 1

Stephan VogelSpring Semester 2011

Machine Translation

Minimum Error Rate Training


Overview

Optimization approaches Simplex MER

Avoiding local minima Additional considerations

Tuning towards different metrics Tuning on different development sets


Tuning the SMT System

We use different models in SMT system Models have simplifications Trained on different amounts of data

=> Models have different levels of reliability and scores have different ranges

=> Give different weight to different ModelsQ = c1 Q1 + c2 Q2 + … + cn Qn

Find optimal scaling factors (feature weights) c1 … cn

Optimal means: Highest score for chosen evaluation metric Mie: find (c1, …, cn) such that M(argmine{Q(e,f)}) is high

Metric M is our objective function


Problems

The surface of the objective function is not nice Not convex -> local minima (actually, many local minima) Not differentiable -> gradient descent methods not readily

applicable

There may be dangerousareas (‘boundary cliffs’)

Example: Tune on Dev set with

short reference translations Optimization leads towards

short translations New test set has long reference translations Translations are now too short ->length penalty

Small change

Big effect


Brute Force Approach – Manual Tuning

Decode with different scaling factors Get feeling for range of good values Get feeling for importance of models

LM is typically most important Sentence length (word count feature) to balance shortening

effect of LM Word reordering is more or less effective depending on

language

Narrow down range in which scaling factors are tested Essentially multi-linear optimization

Works good for small number of models Time consuming (CPU wise) if decoding takes long time


Automatic Tuning

Many algorithms to find (near) optimal solutions available Simplex Powell (line search) MIRA (Margin Infused Relaxed Algorithm) Specially designed minimum error training (Och 2003) Genetic algorithm

Note: models are not improved, only their combination

Note: some parameters change performance of decoder, but are not in Q Number of alternative translation Beam size Word reordering restrictions


Automatic Tuning on N-best List

Optimization algorithm need many iterations – too expensive to run full translations

=> Use n-best lists e.g. for each of 500 source sentences 1000 translations Change scaling factors results in re-ranking the n-best lists Evaluate new 1-best translations

Apply any of the standard optimization techniques Advantage: much faster Can pre-calculate the counts (e.g. n-gram matches)

for each translation to speed up evaluation


Simplex (Nelder-Mead)

Start with n+1 random configurations Get 1-best translation for each configuration ->

objective function Sort points xk according to objective function:

f(x1) < f(x2) < … < f(xn+1)

Calculate x0 as center of gravity for x1 … xn

Replace worst point with a point reflected through the centroid

xr = x0 + r * (x0 – xn+1)


Demo

Obviously, we need to change the size of the simplex to enforce convergence

Also, want to adjust the step size If new point is best point – increase step size If new point is worse then x1 … xn – decrease step size

11

9

127

86

9


Expansion and Contraction

Reflection:Calculate xr = x0 + r * (x0 – xn+1)if f(x1) <= f(xr) < f(xn) replace xn+1 with xr; Next iteration

Expansion:If reflected point is better then best, i.e. f(xr) < f(x1)

Calculate xe = x0 + e * (x0 – xn+1)

If f(xe) < f(xr) then replace xn+1 with xe else replace xn+1 with xr

Next iterationelse Contract

Contraction:Reflected point f(xr) >= f(xn)Calculate xc = xn+1 + c * (x0 – xn+1)If f(xc) <= f(xn+1) then replace xn+1 with xc else Shrink

Shrinking:For all xk, k = 2 … n+1: xk = x1 + s * (xk – x1)Next iteration


Changing the Simplex

xn+1x1

reflectionxn+1

x0

expansionxn+1

x0

contraction

xn+1

x0

shrinking


Powell Line Search

Select directions in search space, then

Loop until convergence Loop over directions d Perform line search for direction d until convergence

Many variants Select directions

Easiest is to use the model scores Or combine multiple scores

Step size in line search

MER (Och 2003) is line search along models with smart selection of steps


Minimum Error Training

For each hypothesis we haveQ = ck*Qk

Select oneQ\k = ck Qk + n\k cn*Qn = ck Qk + QRest

ck

Metric ScoreWER = 8

TotalModelScore

QRest

Qk

Individual model scoregives slope

1



Source sentence 1 Depending on scaling factor ck, different hyps are in 1-best position Set ck to have metric-best hyp also being model-best

ck

h11: WER = 8

h12 : WER = 5

h13 : WER = 4

best hyp:h11

h12 h13

8 5 4

ModelScore



Select minimum number of evaluation points Calculate intersection point Keep only if hyps are minimum at that point Choose evaluation points between intersection points

ck

h11: WER = 8

h12 : WER = 5

h13 : WER = 4

best hyp:h11

h12 h13

8 5 4

ModelScore



Source sentence 1, now different error scores Optimization would find a different ck

=> Different metrics lead to different scaling factors

ck

ModelScore

h11: WER = 8

h12 : WER = 2

h13 : WER = 4

best hyp:h11

h12 h13

8 2 4



Sentence 2 Best ck in a different range No matter which ck, h22 would newer be 1-best

ckbest hyp:

h21: WER = 2

h22 : WER = 0

h23 : WER = 5

h21h23

2 5

ModelScore



Multiple sentences

ck

h11: WER = 8

h12 : WER = 5

h13 : WER = 4

best hyp:h11

h12 h13

h21: WER = 2

h22 : WER = 0

h23 : WER = 5

h21h22

10 7 10 9

ModelScore


Iterate Decoding - Optimization

N-best list is (very restricted) substitute for search space With updated feature weights we may have generated other

(better) translations Some of the hyps in the n-best list would have been pruned

Iterate Re-translate with new feature weights Merge new translations with old translations (increases

stability) Run optimizer over larger n-best lists Repeat until no new translations, or improvement < epsilon, or

just k times (typically 5-10 iterations)


Avoiding Local Minima

Optimization can get stuck in local minimum Remedies

Fiddle around with the parameters of your optimization algorithm Larger n-best list -> more evaluation points Combine with Simulated Annealing type approach (Smith & Eisner,

2007) Restart multiple times


Random Restarts

Comparison Simplex/Powell (Alok, unpublished) Comparison Simplex/ext. Simplex/MER (Bing Zhao,

unpubl.)

Observations: Alok: Simplex ‘jumpier’ then Powell Bing: Simplex better than MER Both: you need many restarts


Optimizing NOT Towards References

Ideally, we want system output identical to reference translations

But there is not guarantee that system can generate reference translations (under realistic conditions) E.g. we restrict reordering window We have unknown words Reference translations may have words unknow to the system

Instead of forcing decoder towards reference translations optimize towards best translations generated by the system Find hypotheses with best metric score Use those as pseudo references Optimize towards the pseudo references


Optimizing Towards Different Metrics

Automatic metrics have different characteristics Optimizing towards one does not mean that other metric

scores will also go up Esp. Different metrics prefer shorter or longer translations

Typically: TER < BLEU < METEOR (< means ‘shorter translation’)

Mauser et al (2007) on Ch-En NIST 2005 test set Reasonably well behaved Resulting length of translation differs by more than 15%


Generalization to other Test Sets

Optimize on one set, test on multiple other sets Again Mauser et al, Ch-En

Shown is behavior overSimplex optimization iterations

Nice, nearly parallel developmentof metric scores

However, we had also observed brittle behavior Esp. when ratio src_length / ref_length is very different

between dev and eval test sets


Large Weight = Important Feature?

Assume we have cLM = 1.0, cTM = 0.55, cWC = 3.2

Which feature is most important?

Cannot say!!! We want to re-rank the n-best lists Feature weights scale feature values such that they can

compete

Example: Variation in LM and TM larger

then for WC Need large weight for WC to make

small differences effective

To know if feature is important, remove it and look at drop in metric score

QLM QTM QWC Q

H1 22 83 7 112

H2 29 77 8 116

H3 26 85 9 120


Open Issues

Should not all optimizers get the same results, if done right The models are the same, it’s just finding the right mix If local minima can be avoided, then similar good optima

should be found

How to stay save Avoid good optima close to ‘cliffs’ Different configurations give very similar metric scores, pick

one which is more stable

One hat fits all? Why one set of feature weights? How about different sets for

Good/bad translations (tuning on tail: mixed results so far) Short/long sentences Begin/middle/end of sentence ...


Summary

Optimizing system by modifying scaling factors (feature weights)

Different optimization approaches can be used Simplex, Powell most common MERT (Och) is similar to Powell, with pre-calculation of grid

points

Many local optima, avoid getting stuck early Most effective: many restarts

Generalization To unseen test data: mostly ok, sometimes selection of dev

set has big impact (length penalty!) To different metrics: reasonably stable (metrics are

reasonably correlated in most cases)

Still open questions => more research needed

Documents

Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training