Machine Translation Day 20. EVALUATING MT 2 MT Evaluation I have a throbbing pain. I am experiencing a throbbing pain. I am suffering from a throbbing

Machine Translation

Day 20

2

EVALUATING MT

MT Evaluation

• I have a throbbing pain.• I am experiencing a throbbing

pain.• I am suffering from a throbbing

pain.• I am feeling a throbbing pain.• It is a throbbing pain.• It's throbbing and it really

hurts.• It's painful and it's throbbing.• It's throbbing with pain.

• It's in throbbing pain.• It hurts so much it's throbbing.• I've got a throbbing pain.• I can feel a throbbing pain.• I am suffering from a

throbbing pain.• I am experiencing a throbbing

pain.• I have a painful throbbing.• I feel a painful throbbing.

Source : ズキズキ痛みます。16 human translations:

3

Data from International Workshop on Spoken Language Translation

4

MT Evaluation

• No “right answer”!• What can we test instead?

– Human adequacy / fluency ratings– Human efficacy in an application

(e.g. question answering from translated foreign documents vs. native documents)

– Very accurate, but slow & expensive• Agreement with reference translations

– BLEU (BiLingual Evaluation Understudy: IBM)– Fast system development

5

BLEU (Papineni, ACL 2002)

• MT output:1: It is a guide to action which ensures that the military always obeys the

commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.

• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed

Party commands.2: It is the guiding principle which guarantees the military forces always

being under the command of the Party.3: It is the practical guide for the army always to heed the directions of

the party.

6

BLEU

• MT output:1: It is a guide to action which ensures that the military always obeys

the commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.




the party.

7

BLEU

• MT output:1: It is a guide to action which ensures that the military always obeys the

commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.




the party.

8

BLEU: observations

1: It is a guide to action which ensures that the military always obeys the commands of the party.

2: It is to insure the troops forever hearing the activity guidebook that party direct.

• Observations– Word overlap is indicative– n-gram (word sequence) overlap is even more distinct– Drawing from multiple reference translations helps

9

BLEU metric

• Compute n-gram precisions:Pn = c(matched n-grams) / c(n-grams in candidate)

• Compute a brevity penalty(Prevent candidates from deleting difficult words)BP = exp( min( 1 – r/c, 0 ) ), r = reference length, c =

candidate length• Combine using geometric mean

BLEU = BP (∏∙ i=1n Pi)^(1/n)

• Produces score on a 0-1 scale – often expressed as a “percentage” (e.g., * 100)

BLEU results circa 2002

[from Papineni et al., ACL 2002] [from G. Doddington, NIST]

Distinguishes humans from machines… …correlates well with human judgments

10

However nowadays we’re starting to see problems: - Some systems score better than human translations - In competitions, some “gaming of BLEU” - Rule based systems are at a disadvantage after tuning

11

MT Evaluation: Human• Absolute evaluation

– Given a reference translation human evaluators are asked to rank translation quality on a scale of 1-4

4= Ideal: grammatically correct, all information included3= Acceptable: Not perfect, but definitely comprehensible, AND with

accurate transfer of all important information.2= Possibly acceptable: may be interpretable given context/time, some

information transferred accurately1= Unacceptable: Absolutely not comprehensible and/or little or no

information transferred accurately.

• Relative evaluation– Human judges are presented with a reference translation and two

machine translations in random order, and must pick the better of the two

– Criteria for decision are left up to individual judge

12

Absolute quality: SpanishEnglish

0

20

40

60

80

100

120

Number of Sentences

1 1.5 2 2.5 3 3.5 4Quality Score

BabelfishMSR MT

Average quality scores: Babelfish=2.344 MSR-MT=2.727

Extrinsic evaluation: Microsoft product support site

• Microsoft support knowledge base– Thousands of customer support articles available at

http://support.microsoft.com– However, most are only available in English– Translating all articles by hand is too expensive– Instead we present unedited MT articles– Available in Spanish, French, German, Japanese, etc.

• Some of the publicly available data-driven translations (2002-2003)

http://support.microsoft.com/

14

http://support.microsoft.com

15

PSS survey results (Spanish)

• Overall satisfaction with the article (scale: 1 to 9)– 86.0% scored between 5 and 9; US English = 74.2%

• Technical accuracy of the article (1 to 9)– 75.3% scored between 5 and 9

• Task success– “Did the information in the (machine translated) knowledge base article

help answer your question?” – Yes:

• Machine translated Spanish = 49.7%• Human translated Spanish = 51.2%• US English = 53.6%

WORD ALIGNMENT

17

A very simple MT system

• Get a translation dictionary• Assign a uniform distribution over all

translations of each source word• Tokenize input sentence, replace each word

with its English translation:weil er gestern gegangen istbecause he yesterday gone is

• Not terrible, but not very fluent

18

Simple Statistical Machine Translation

• Given foreign f, find best English translation e*e* = argmaxe P(e | f)

• Use Bayes’ rule to get “noisy channel” modelP(e | f) = P(f | e) P(∙ e) / P(f)argmaxe P(e | f) = argmax P(f | e) P(∙ e)

• P(f | e) is the channel or translation model• P(e) is the language model

19

Toy System A

• Channel model reversed, otherwise identical– Now gives a probability of source given target– Uniform distribution over all source translations of

a given target word• Word-based bigram model as language model

– Improve translations in context– Improves fluency overall

• Looks like an HMM tagger:– Find Viterbi path through a lattice or trellis

20

eat-10.3

eat-9.8

Toy System A: searchweil er gestern gegangen ist

because he yesterday gone is

him left had

his are

has

<s>0

because-3.2

he-5.6

him-5.4

his-5.9

yesterday-8.3

gone-9.9

Only need to keep the best hypothesis ending in some word – bigram LM can’t see beyond that

(Viterbi!)

Each partial hypothesis keeps track of the last word generated (for LM score) and the total score so far

left-10.4

Learning the translation model

• Start from seminal work by IBM back in the late 1980s – early 1990s

• They develop models for identifying word correspondences (word alignments) of parallel data


• Say we had some word aligned parallel data• How would we estimate a translation model?

the

house

la maison

the

flower

la fleur

blue

house

the

la maison bleu


• Say I had a model of P(french | english)• How can I find alignments?

the

house

la maison

the

flower

la fleur

blue

house

the

la maison bleu

blue

Word Prob

bleu 0.8

… …

the

Word Prob

la 0.3

le 0.3

les 0.2

…

24

Parameter estimation

• Given lists of parallel sentences (e, f)• If we had the hidden alignments a, then we could

estimate multinomial parameters based on countsc(e, f) := number of times e was aligned to fc(e) := number of occurrences of et(f | e) := c(e, f) / c(e)

• On the other hand, if we knew the parameters t( | )∙ ∙ , we could find the most likely alignments

• Bit of a chicken and an egg problem…

25

Expectation-Maximization

• Enter the Expectation-Maximization algorithm– Method for optimizing parameters / finding hidden state in

unsupervised problems• A procedural description for now

– Pick an initial set of parameters t0(f | e), set k = 0– Until convergence…

• Find expected values of the hidden states ak+1 for each pair assuming parameters tk are correct (Expectation)

• Find the most likely parameters tk+1 assuming that hidden states ak+1 are correct (Maximization)

• Increment k

26

Model 1

the

house

[null]

la maison

the

flower

[null]

la fleur

blue

house

the

la maison

[null]

bleu

27

Model 1, EM iteration 0

the

house

[null]

la maison

0.33 0.33

0.33

0.33

0.33

0.33

the

flower

[null]

la fleur

0.33 0.33

0.33

0.33

0.33

0.33

blue

house

the

la maison

0.25 0.25

[null]

bleu

0.25 0.25

0.25 0.25

0.25 0.25

0.25

0.25

0.25

0.25

28


the

house

[null]

la maison

0.34 0.28

0.34

0.31

0.28

0.42

the

flower

[null]

la fleur

0.32 0.20

0.32

0.36

0.20

0.60

blue

house

the

la maison

0.25 0.32

[null]

bleu

0.27 0.21

0.21 0.25

0.27 0.21

0.16

0.45

0.24

0.16

29


the

house

[null]

la maison

0.37 0.27

0.37

0.26

0.27

0.46

the

flower

[null]

la fleur

0.37 0.13

0.37

0.26

0.13

0.74

blue

house

the

la maison

0.23 0.36

[null]

bleu

0.31 0.21

0.14 0.21

0.31 0.21

0.11

0.60

0.18

0.11

30


the

house

[null]

la maison

0.44 0.18

0.44

0.11

0.18

0.64

the

flower

[null]

la fleur

0.48 0.02

0.48

0.05

0.02

0.96

blue

house

the

la maison

0.11 0.58

[null]

bleu

0.44 0.17

0.02 0.08

0.44 0.17

0.02

0.91

0.05

0.02

31

IBM Word-based translation(Brown et al., 1993)

• Model P(f | e): French translations given English

I

do

not

speak

French

je ne parle pas francais

[null]

32

Model 1

• Lots of simplifying assumptions:– All lengths are equally likely

P(m | e) uniform = ∼ ε

– All word alignments are equally likelyP(aj | a1

j-1, f1j-1, m, e) uniform = 1 / (∼ l + 1)

– French word depends on English word it’s aligned toP(fj | a1

j, f1j-1, m, e) ∼ t(fj | eaj) multinomial over English words∼

• Resulting modelP(f, a | e) = ε / (l + 1)m ∏j=1

m t(fj | eaj)

33

A generative story(IBM Models 1-2, HMM)

P(f, a | e) =P(m | e) ∙

∏j=1m (

P(aj | a1j-1, f1

j-1, m, e) ∙

P(fj | a1j, f1

j-1, m, e)

)Exact – chain rule!

Pick the length of the French sentence

For each position in the French sentence…

Pick the English word aligned to the French word in that

position, then…

Pick the French word in that position

E, F: English, French vocabulariese = e1

l = (e1, …, el): English sentence, ei E∈f = f1

m = (f1, …, fm): French sentence, fj F∈a = a1

m = (a1, …, am): word alignment, aj [0..l]∈

Progression of alignment models• Models of increasing complexity

– Only Model 1 is convex

• Models 3, 4, 5 each capture new aspects of the sentence– Capture “fertility”– Different movement models– Each model can initialize its

successor – helps avoid local minima

• Freely available tools for this task– GIZA++– Berkeley aligner

Model Translation Distortion Fertility

1 Yes --- ---2 Yes Abs ---HMM Yes Rel ---3 Yes Abs Yes4 Yes Rel Yes5 Yes Rel Yes

34

Toy System A’

• Our prior toy system used a uniform distribution for translations

• Now we can plug in Model 1 parameters• Language model helps pick translations that

are fluent• Translation model helps pick translations that

are adequate• Looks just like an HMM!

36

eat-10.3

eat-9.8

Toy System A’weil er gestern gegangen ist


him left had

his are

has

<s>0

because-3.2

he-5.6

him-5.4

his-5.9

yesterday-8.3

gone-9.9

Each translation is like a part-of-speech tag

Becomes

Bigram LM + Model 1!

left-10.4

Some questions:

• What about standard translation dictionaries? Should we include them, and how?

• What translation phenomena are we covering and what are we missing?

• Does it work?

38

Toy System B• System A: finds better translations in context, but can’t reorder

“er gestern gegangen ist he yesterday left had”(should be “he had left yesterday”)

• System B: allow all possible permutations– Each hypothesis now remembers:

• Last target word generated• Set of source words already translated

– 5! = 125 permutations, 10! = 3.6M, 20! = 2.43e18– No way we can afford to keep all translations!

• Group into stacks based on count of words covered– Histogram pruning: limited number of hypotheses on any stack– Threshold pruning: only keep hypotheses within d of best on stack

39Stack 2Stack 0 Stack 1

Toy System B: search

<s>0

00000 because-3.2

10000

he-3.5

01000

he-5.8

1100000

… …

…

Like an expanded Viterbi search, but each hypothesis also needs to remember which source words have been translated already!

yesterday-1.9

00100

because-5.2

100100

weil er gestern gegangen ist


him left had

his are

has

yesterday-5.6

100100

40

Beyond Toy System B

• Many problems with this system:– System allows all possible reorderings, but some are

much more likely than others– Contextual information is only captured by the target

language model, not in the source• Multiple paths from here:

– Better word alignment– Phrase-based translation: learn bigger translation

units – this is crucial!– Better reordering models: syntax can help here

41

Word-based MT results

SRC: 对外经济贸易合怍部今无提供的数据表明，今年至十一月中国实际利用外资四百六十九点五九亿美元 , 其中包括外商直接投资四百点零七亿美元。

REF: According to the data provided today by the Ministry of Foreign Trade and Economic Cooperation, as of November this year, China has actually utilized 46.959 billion US dollars of foreign capital, including 40.007 billion US dollars of direct investment from foreign businessmen.

WB: The Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40.007 billion US dollars today provide data include that year to November china actually using 46.959 billion US dollars and

SRC: Le politique de la haineREF: Politics of hateWB: The policy of the hatred

SRC: Nous avone signé le protocole.REF: We did sign the memorandum of agreement.WB: We have signed the protocol.

SRC: Où était le plan solide?REF: But where was the solid plan?WB: Where was the economic base?

42

Word alignment and phrase extraction (Koehn, Och, Marcu 2003)

•

•

•

•

•

blue

house

the

a casa azul

43


• the a

•

•

•

•

blue

house

the

a casa azul

44


• the a

• blue azul

•

•

•

blue

house

the

a casa azul

45


• the a

• blue azul

• house casa

•

•

blue

house

the

a casa azul

46


• the a

• blue azul

• house casa

• blue house casa azul

•

blue

house

the

a casa azul

47


• the a

• blue azul

• house casa


• the blue house a casa azul

blue

house

the

a casa azul

48


• the a

• blue azul

• house casa



blue

house

the

a casa azul

49


• the a

• blue azul

• house casa



blue

house

the

a casa azul

50

Phrase table

• Extract phrases from all sentence pairs• Estimate P(src | tgt) with c(src, tgt) / c(tgt)

Portuguese English Probver see 0.533ver view 0.129ver to see 0.044ver viewing 0.009ver seeing 0.008ver watch 0.007

…ver o mundo atravês view the world through 1.000ver e adquirir browse and purchase 1.000ver ou editar view or edit 0.875ver filmes watch movies 0.667

51

Word-based vs. phrase-based(BLEU score vs. training data size)

40k 80k 160k 320k20

22

24

26

28

30

Phrases from word alignment

Word-based

[Koehn, Och, and Marcu 2003]

These systems, with sufficient data, produce better translations than

rule-based systems… mostly.

52

Syntax in translation

• Phrases capture contextual translation and local reordering surprisingly well

• However this information is brittle:– “author of the book 本書的作者” tells us nothing about

how to translate “author of the pamphlet” or “author of the play”

– The Chinese phrase “NOUN1 的 NOUN2” becomes “NOUN2 of NOUN1” in English

• No information about global reordering– In Chinese, prepositional phrases often come before verbs; in

English, they’re come after

53

Syntax-based source reordering

• Language is hierarchical – our models should capture this

• Phrasal cohesion (Fox, 2002): most often, each source constituent translates to a contiguous target constituent

• Source parse trees can inform reordering– First parse the source sentence– Then use information about the source to guide

reordering

54

Wang, Collins, Koehn (2007):Parse the Chinese, reorder like English

55

Some pertinent rules

56

Syntax-directed translation

• Begin by parsing source sentence– Syntactic analysis can guide reordering and inform

translation• One approach: Treelet translation (Quirk,

Menezes, and Cherry, 2005)– Use dependency trees: minimal amount of

syntactic information (just head node)

57

Treelet and template extraction

• Start from word aligned sentence pairs

blue housethe

a casa azul

58


• Parse source:

blue/JJ

house/NN

the/DT

a casa azul

59


• Project tree:

blue/JJ

house/NN

the/DT

a

casa

azul

60


• Extract treelet pairs:

• Treelet: connected subgraphof the dependency tree

blue/JJ

house/NN

the/DT

a

casa

azul

the a

blue azul

house casa

blue house casa azul

the blue house a casa azul

the house a casa

61


• Extract templates:

blue/JJ

house/NN

the/DT

a

casa

azul*/JJ

*/NN

*/DT

*

*

*

62

Europarl English-Spanish

devtest in-domain out-of-domain20%

25%

30%

35%

PhrasalTemplateBL

EU

Impact of preserving ambiguity

• Start with treelet systems– Technical English-German,

English-Japanese– Newswire Chinese-English

• Translate each of k-best parses independently

• Keep the translation with the best score

• Evaluate using BLEU

parses EG EJ CE

1 33.6 36.0 28.2

2 33.8 36.1 28.5

4 34.1 36.3 28.9

8 34.3 36.6 29.2

16 34.5 36.8 29.7

32 34.8 37.1 30.0

64

Target langauge syntax

• If we want a grammatical translation, shouldn’t we use a grammar?

• Use a parser in the target language instead– Translation becomes cross-lingual parsing: find the best

English parse tree for a Chinese sentence– Great for translating into English or other languages with

lots of linguistic resources• Later approaches capture larger synchronous rules at a

time (Marcu et al., 2006) and are pretty successful, though somewhat slow in comparison– Ongoing research to speed things up

What about morphology?

• Most of these approaches treat words as indivisible units

• Some recent work addresses this problem:– Phrasal translations of morpheme sequences

(requires morphological segmentation)

Remaining limitations

• Most systems consider only a single sentence at a time– What about discourse phenomena?– Coreference?

• How do we handle unknown words?• Where do we get the data?

Thanks!

Documents

Machine Translation Day 20. EVALUATING MT 2 MT Evaluation I have a throbbing pain. I am experiencing a throbbing pain. I am suffering from a throbbing