27
Stephan Vogel - Machine Transl ation 1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Embed Size (px)

Citation preview

Page 1: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 1

Stephan VogelSpring Semester 2011

Machine Translation

Minimum Error Rate Training

Page 2: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 2

Overview

Optimization approaches Simplex MER

Avoiding local minima Additional considerations

Tuning towards different metrics Tuning on different development sets

Page 3: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 3

Tuning the SMT System

We use different models in SMT system Models have simplifications Trained on different amounts of data

=> Models have different levels of reliability and scores have different ranges

=> Give different weight to different ModelsQ = c1 Q1 + c2 Q2 + … + cn Qn

Find optimal scaling factors (feature weights) c1 … cn

Optimal means: Highest score for chosen evaluation metric Mie: find (c1, …, cn) such that M(argmine{Q(e,f)}) is high

Metric M is our objective function

Page 4: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 4

Problems

The surface of the objective function is not nice Not convex -> local minima (actually, many local minima) Not differentiable -> gradient descent methods not readily

applicable

There may be dangerousareas (‘boundary cliffs’)

Example: Tune on Dev set with

short reference translations Optimization leads towards

short translations New test set has long reference translations Translations are now too short ->length penalty

Small change

Big effect

Page 5: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 5

Brute Force Approach – Manual Tuning

Decode with different scaling factors Get feeling for range of good values Get feeling for importance of models

LM is typically most important Sentence length (word count feature) to balance shortening

effect of LM Word reordering is more or less effective depending on

language

Narrow down range in which scaling factors are tested Essentially multi-linear optimization

Works good for small number of models Time consuming (CPU wise) if decoding takes long time

Page 6: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 6

Automatic Tuning

Many algorithms to find (near) optimal solutions available Simplex Powell (line search) MIRA (Margin Infused Relaxed Algorithm) Specially designed minimum error training (Och 2003) Genetic algorithm

Note: models are not improved, only their combination

Note: some parameters change performance of decoder, but are not in Q Number of alternative translation Beam size Word reordering restrictions

Page 7: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 7

Automatic Tuning on N-best List

Optimization algorithm need many iterations – too expensive to run full translations

=> Use n-best lists e.g. for each of 500 source sentences 1000 translations Change scaling factors results in re-ranking the n-best lists Evaluate new 1-best translations

Apply any of the standard optimization techniques Advantage: much faster Can pre-calculate the counts (e.g. n-gram matches)

for each translation to speed up evaluation

Page 8: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 8

Simplex (Nelder-Mead)

Start with n+1 random configurations Get 1-best translation for each configuration ->

objective function Sort points xk according to objective function:

f(x1) < f(x2) < … < f(xn+1)

Calculate x0 as center of gravity for x1 … xn

Replace worst point with a point reflected through the centroid

xr = x0 + r * (x0 – xn+1)

Page 9: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 9

Demo

Obviously, we need to change the size of the simplex to enforce convergence

Also, want to adjust the step size If new point is best point – increase step size If new point is worse then x1 … xn – decrease step size

11

9

127

86

9

Page 10: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 10

Expansion and Contraction

Reflection:Calculate xr = x0 + r * (x0 – xn+1)if f(x1) <= f(xr) < f(xn) replace xn+1 with xr; Next iteration

Expansion:If reflected point is better then best, i.e. f(xr) < f(x1)

Calculate xe = x0 + e * (x0 – xn+1)

If f(xe) < f(xr) then replace xn+1 with xe else replace xn+1 with xr

Next iterationelse Contract

Contraction:Reflected point f(xr) >= f(xn)Calculate xc = xn+1 + c * (x0 – xn+1)If f(xc) <= f(xn+1) then replace xn+1 with xc else Shrink

Shrinking:For all xk, k = 2 … n+1: xk = x1 + s * (xk – x1)Next iteration

Page 11: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 11

Changing the Simplex

xn+1x1

reflectionxn+1

x0

expansionxn+1

x0

contraction

xn+1

x0

shrinking

Page 12: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 12

Powell Line Search

Select directions in search space, then

Loop until convergence Loop over directions d Perform line search for direction d until convergence

Many variants Select directions

Easiest is to use the model scores Or combine multiple scores

Step size in line search

MER (Och 2003) is line search along models with smart selection of steps

Page 13: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 13

Minimum Error Training

For each hypothesis we haveQ = ck*Qk

Select oneQ\k = ck Qk + n\k cn*Qn = ck Qk + QRest

ck

Metric ScoreWER = 8

TotalModelScore

QRest

Qk

Individual model scoregives slope

1

Page 14: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 14

Minimum Error Training

Source sentence 1 Depending on scaling factor ck, different hyps are in 1-best position Set ck to have metric-best hyp also being model-best

ck

h11: WER = 8

h12 : WER = 5

h13 : WER = 4

best hyp:h11

h12 h13

8 5 4

ModelScore

Page 15: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 15

Minimum Error Training

Select minimum number of evaluation points Calculate intersection point Keep only if hyps are minimum at that point Choose evaluation points between intersection points

ck

h11: WER = 8

h12 : WER = 5

h13 : WER = 4

best hyp:h11

h12 h13

8 5 4

ModelScore

Page 16: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 16

Minimum Error Training

Source sentence 1, now different error scores Optimization would find a different ck

=> Different metrics lead to different scaling factors

ck

ModelScore

h11: WER = 8

h12 : WER = 2

h13 : WER = 4

best hyp:h11

h12 h13

8 2 4

Page 17: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 17

Minimum Error Training

Sentence 2 Best ck in a different range No matter which ck, h22 would newer be 1-best

ckbest hyp:

h21: WER = 2

h22 : WER = 0

h23 : WER = 5

h21h23

2 5

ModelScore

Page 18: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 18

Minimum Error Training

Multiple sentences

ck

h11: WER = 8

h12 : WER = 5

h13 : WER = 4

best hyp:h11

h12 h13

h21: WER = 2

h22 : WER = 0

h23 : WER = 5

h21h22

10 7 10 9

ModelScore

Page 19: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 19

Iterate Decoding - Optimization

N-best list is (very restricted) substitute for search space With updated feature weights we may have generated other

(better) translations Some of the hyps in the n-best list would have been pruned

Iterate Re-translate with new feature weights Merge new translations with old translations (increases

stability) Run optimizer over larger n-best lists Repeat until no new translations, or improvement < epsilon, or

just k times (typically 5-10 iterations)

Page 20: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 20

Avoiding Local Minima

Optimization can get stuck in local minimum Remedies

Fiddle around with the parameters of your optimization algorithm Larger n-best list -> more evaluation points Combine with Simulated Annealing type approach (Smith & Eisner,

2007) Restart multiple times

Page 21: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 21

Random Restarts

Comparison Simplex/Powell (Alok, unpublished) Comparison Simplex/ext. Simplex/MER (Bing Zhao,

unpubl.)

Observations: Alok: Simplex ‘jumpier’ then Powell Bing: Simplex better than MER Both: you need many restarts

Page 22: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 22

Optimizing NOT Towards References

Ideally, we want system output identical to reference translations

But there is not guarantee that system can generate reference translations (under realistic conditions) E.g. we restrict reordering window We have unknown words Reference translations may have words unknow to the system

Instead of forcing decoder towards reference translations optimize towards best translations generated by the system Find hypotheses with best metric score Use those as pseudo references Optimize towards the pseudo references

Page 23: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 23

Optimizing Towards Different Metrics

Automatic metrics have different characteristics Optimizing towards one does not mean that other metric

scores will also go up Esp. Different metrics prefer shorter or longer translations

Typically: TER < BLEU < METEOR (< means ‘shorter translation’)

Mauser et al (2007) on Ch-En NIST 2005 test set Reasonably well behaved Resulting length of translation differs by more than 15%

Page 24: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 24

Generalization to other Test Sets

Optimize on one set, test on multiple other sets Again Mauser et al, Ch-En

Shown is behavior overSimplex optimization iterations

Nice, nearly parallel developmentof metric scores

However, we had also observed brittle behavior Esp. when ratio src_length / ref_length is very different

between dev and eval test sets

Page 25: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 25

Large Weight = Important Feature?

Assume we have cLM = 1.0, cTM = 0.55, cWC = 3.2

Which feature is most important?

Cannot say!!! We want to re-rank the n-best lists Feature weights scale feature values such that they can

compete

Example: Variation in LM and TM larger

then for WC Need large weight for WC to make

small differences effective

To know if feature is important, remove it and look at drop in metric score

QLM QTM QWC Q

H1 22 83 7 112

H2 29 77 8 116

H3 26 85 9 120

Page 26: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 26

Open Issues

Should not all optimizers get the same results, if done right The models are the same, it’s just finding the right mix If local minima can be avoided, then similar good optima

should be found

How to stay save Avoid good optima close to ‘cliffs’ Different configurations give very similar metric scores, pick

one which is more stable

One hat fits all? Why one set of feature weights? How about different sets for

Good/bad translations (tuning on tail: mixed results so far) Short/long sentences Begin/middle/end of sentence ...

Page 27: Stephan Vogel - Machine Translation1 Stephan Vogel Spring Semester 2011 Machine Translation Minimum Error Rate Training

Stephan Vogel - Machine Translation 27

Summary

Optimizing system by modifying scaling factors (feature weights)

Different optimization approaches can be used Simplex, Powell most common MERT (Och) is similar to Powell, with pre-calculation of grid

points

Many local optima, avoid getting stuck early Most effective: many restarts

Generalization To unseen test data: mostly ok, sometimes selection of dev

set has big impact (length penalty!) To different metrics: reasonably stable (metrics are

reasonably correlated in most cases)

Still open questions => more research needed