Upload
arthur-byrd
View
256
Download
0
Tags:
Embed Size (px)
Citation preview
Machine Translation
Day 20
2
EVALUATING MT
MT Evaluation
• I have a throbbing pain.• I am experiencing a throbbing
pain.• I am suffering from a throbbing
pain.• I am feeling a throbbing pain.• It is a throbbing pain.• It's throbbing and it really
hurts.• It's painful and it's throbbing.• It's throbbing with pain.
• It's in throbbing pain.• It hurts so much it's throbbing.• I've got a throbbing pain.• I can feel a throbbing pain.• I am suffering from a
throbbing pain.• I am experiencing a throbbing
pain.• I have a painful throbbing.• I feel a painful throbbing.
Source : ズキズキ 痛み ます 。16 human translations:
3
Data from International Workshop on Spoken Language Translation
4
MT Evaluation
• No “right answer”!• What can we test instead?
– Human adequacy / fluency ratings– Human efficacy in an application
(e.g. question answering from translated foreign documents vs. native documents)
– Very accurate, but slow & expensive• Agreement with reference translations
– BLEU (BiLingual Evaluation Understudy: IBM)– Fast system development
5
BLEU (Papineni, ACL 2002)
• MT output:1: It is a guide to action which ensures that the military always obeys the
commands of the party.2: It is to insure the troops forever hearing the activity guidebook that
party direct.
• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed
Party commands.2: It is the guiding principle which guarantees the military forces always
being under the command of the Party.3: It is the practical guide for the army always to heed the directions of
the party.
6
BLEU
• MT output:1: It is a guide to action which ensures that the military always obeys
the commands of the party.2: It is to insure the troops forever hearing the activity guidebook that
party direct.
• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed
Party commands.2: It is the guiding principle which guarantees the military forces always
being under the command of the Party.3: It is the practical guide for the army always to heed the directions of
the party.
7
BLEU
• MT output:1: It is a guide to action which ensures that the military always obeys the
commands of the party.2: It is to insure the troops forever hearing the activity guidebook that
party direct.
• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed
Party commands.2: It is the guiding principle which guarantees the military forces always
being under the command of the Party.3: It is the practical guide for the army always to heed the directions of
the party.
8
BLEU: observations
1: It is a guide to action which ensures that the military always obeys the commands of the party.
2: It is to insure the troops forever hearing the activity guidebook that party direct.
• Observations– Word overlap is indicative– n-gram (word sequence) overlap is even more distinct– Drawing from multiple reference translations helps
9
BLEU metric
• Compute n-gram precisions:Pn = c(matched n-grams) / c(n-grams in candidate)
• Compute a brevity penalty(Prevent candidates from deleting difficult words)BP = exp( min( 1 – r/c, 0 ) ), r = reference length, c =
candidate length• Combine using geometric mean
BLEU = BP (∏∙ i=1n Pi)^(1/n)
• Produces score on a 0-1 scale – often expressed as a “percentage” (e.g., * 100)
BLEU results circa 2002
[from Papineni et al., ACL 2002] [from G. Doddington, NIST]
Distinguishes humans from machines… …correlates well with human judgments
10
However nowadays we’re starting to see problems: - Some systems score better than human translations - In competitions, some “gaming of BLEU” - Rule based systems are at a disadvantage after tuning
11
MT Evaluation: Human• Absolute evaluation
– Given a reference translation human evaluators are asked to rank translation quality on a scale of 1-4
4= Ideal: grammatically correct, all information included3= Acceptable: Not perfect, but definitely comprehensible, AND with
accurate transfer of all important information.2= Possibly acceptable: may be interpretable given context/time, some
information transferred accurately1= Unacceptable: Absolutely not comprehensible and/or little or no
information transferred accurately.
• Relative evaluation– Human judges are presented with a reference translation and two
machine translations in random order, and must pick the better of the two
– Criteria for decision are left up to individual judge
12
Absolute quality: SpanishEnglish
0
20
40
60
80
100
120
Number of Sentences
1 1.5 2 2.5 3 3.5 4Quality Score
BabelfishMSR MT
Average quality scores: Babelfish=2.344 MSR-MT=2.727
Extrinsic evaluation: Microsoft product support site
• Microsoft support knowledge base– Thousands of customer support articles available at
http://support.microsoft.com– However, most are only available in English– Translating all articles by hand is too expensive– Instead we present unedited MT articles– Available in Spanish, French, German, Japanese, etc.
• Some of the publicly available data-driven translations (2002-2003)
14
http://support.microsoft.com
15
PSS survey results (Spanish)
• Overall satisfaction with the article (scale: 1 to 9)– 86.0% scored between 5 and 9; US English = 74.2%
• Technical accuracy of the article (1 to 9)– 75.3% scored between 5 and 9
• Task success– “Did the information in the (machine translated) knowledge base article
help answer your question?” – Yes:
• Machine translated Spanish = 49.7%• Human translated Spanish = 51.2%• US English = 53.6%
WORD ALIGNMENT
17
A very simple MT system
• Get a translation dictionary• Assign a uniform distribution over all
translations of each source word• Tokenize input sentence, replace each word
with its English translation:weil er gestern gegangen istbecause he yesterday gone is
• Not terrible, but not very fluent
18
Simple Statistical Machine Translation
• Given foreign f, find best English translation e*e* = argmaxe P(e | f)
• Use Bayes’ rule to get “noisy channel” modelP(e | f) = P(f | e) P(∙ e) / P(f)argmaxe P(e | f) = argmax P(f | e) P(∙ e)
• P(f | e) is the channel or translation model• P(e) is the language model
19
Toy System A
• Channel model reversed, otherwise identical– Now gives a probability of source given target– Uniform distribution over all source translations of
a given target word• Word-based bigram model as language model
– Improve translations in context– Improves fluency overall
• Looks like an HMM tagger:– Find Viterbi path through a lattice or trellis
20
eat-10.3
eat-9.8
Toy System A: searchweil er gestern gegangen ist
because he yesterday gone is
him left had
his are
has
<s>0
because-3.2
he-5.6
him-5.4
his-5.9
yesterday-8.3
gone-9.9
Only need to keep the best hypothesis ending in some word – bigram LM can’t see beyond that
(Viterbi!)
Each partial hypothesis keeps track of the last word generated (for LM score) and the total score so far
left-10.4
Learning the translation model
• Start from seminal work by IBM back in the late 1980s – early 1990s
• They develop models for identifying word correspondences (word alignments) of parallel data
Learning the translation model
• Say we had some word aligned parallel data• How would we estimate a translation model?
the
house
la maison
the
flower
la fleur
blue
house
the
la maison bleu
Learning the translation model
• Say I had a model of P(french | english)• How can I find alignments?
the
house
la maison
the
flower
la fleur
blue
house
the
la maison bleu
blue
Word Prob
bleu 0.8
… …
the
Word Prob
la 0.3
le 0.3
les 0.2
…
24
Parameter estimation
• Given lists of parallel sentences (e, f)• If we had the hidden alignments a, then we could
estimate multinomial parameters based on countsc(e, f) := number of times e was aligned to fc(e) := number of occurrences of et(f | e) := c(e, f) / c(e)
• On the other hand, if we knew the parameters t( | )∙ ∙ , we could find the most likely alignments
• Bit of a chicken and an egg problem…
25
Expectation-Maximization
• Enter the Expectation-Maximization algorithm– Method for optimizing parameters / finding hidden state in
unsupervised problems• A procedural description for now
– Pick an initial set of parameters t0(f | e), set k = 0– Until convergence…
• Find expected values of the hidden states ak+1 for each pair assuming parameters tk are correct (Expectation)
• Find the most likely parameters tk+1 assuming that hidden states ak+1 are correct (Maximization)
• Increment k
26
Model 1
the
house
[null]
la maison
the
flower
[null]
la fleur
blue
house
the
la maison
[null]
bleu
27
Model 1, EM iteration 0
the
house
[null]
la maison
0.33 0.33
0.33
0.33
0.33
0.33
the
flower
[null]
la fleur
0.33 0.33
0.33
0.33
0.33
0.33
blue
house
the
la maison
0.25 0.25
[null]
bleu
0.25 0.25
0.25 0.25
0.25 0.25
0.25
0.25
0.25
0.25
28
Model 1, EM iteration 1
the
house
[null]
la maison
0.34 0.28
0.34
0.31
0.28
0.42
the
flower
[null]
la fleur
0.32 0.20
0.32
0.36
0.20
0.60
blue
house
the
la maison
0.25 0.32
[null]
bleu
0.27 0.21
0.21 0.25
0.27 0.21
0.16
0.45
0.24
0.16
29
Model 1, EM iteration 2
the
house
[null]
la maison
0.37 0.27
0.37
0.26
0.27
0.46
the
flower
[null]
la fleur
0.37 0.13
0.37
0.26
0.13
0.74
blue
house
the
la maison
0.23 0.36
[null]
bleu
0.31 0.21
0.14 0.21
0.31 0.21
0.11
0.60
0.18
0.11
30
Model 1, EM iteration 6
the
house
[null]
la maison
0.44 0.18
0.44
0.11
0.18
0.64
the
flower
[null]
la fleur
0.48 0.02
0.48
0.05
0.02
0.96
blue
house
the
la maison
0.11 0.58
[null]
bleu
0.44 0.17
0.02 0.08
0.44 0.17
0.02
0.91
0.05
0.02
31
IBM Word-based translation(Brown et al., 1993)
• Model P(f | e): French translations given English
I
do
not
speak
French
je ne parle pas francais
[null]
32
Model 1
• Lots of simplifying assumptions:– All lengths are equally likely
P(m | e) uniform = ∼ ε
– All word alignments are equally likelyP(aj | a1
j-1, f1j-1, m, e) uniform = 1 / (∼ l + 1)
– French word depends on English word it’s aligned toP(fj | a1
j, f1j-1, m, e) ∼ t(fj | eaj) multinomial over English words∼
• Resulting modelP(f, a | e) = ε / (l + 1)m ∏j=1
m t(fj | eaj)
33
A generative story(IBM Models 1-2, HMM)
P(f, a | e) =P(m | e) ∙
∏j=1m (
P(aj | a1j-1, f1
j-1, m, e) ∙
P(fj | a1j, f1
j-1, m, e)
)Exact – chain rule!
Pick the length of the French sentence
For each position in the French sentence…
Pick the English word aligned to the French word in that
position, then…
Pick the French word in that position
E, F: English, French vocabulariese = e1
l = (e1, …, el): English sentence, ei E∈f = f1
m = (f1, …, fm): French sentence, fj F∈a = a1
m = (a1, …, am): word alignment, aj [0..l]∈
Progression of alignment models• Models of increasing complexity
– Only Model 1 is convex
• Models 3, 4, 5 each capture new aspects of the sentence– Capture “fertility”– Different movement models– Each model can initialize its
successor – helps avoid local minima
• Freely available tools for this task– GIZA++– Berkeley aligner
Model Translation Distortion Fertility
1 Yes --- ---2 Yes Abs ---HMM Yes Rel ---3 Yes Abs Yes4 Yes Rel Yes5 Yes Rel Yes
34
Toy System A’
• Our prior toy system used a uniform distribution for translations
• Now we can plug in Model 1 parameters• Language model helps pick translations that
are fluent• Translation model helps pick translations that
are adequate• Looks just like an HMM!
36
eat-10.3
eat-9.8
Toy System A’weil er gestern gegangen ist
because he yesterday gone is
him left had
his are
has
<s>0
because-3.2
he-5.6
him-5.4
his-5.9
yesterday-8.3
gone-9.9
Each translation is like a part-of-speech tag
Becomes
Bigram LM + Model 1!
left-10.4
Some questions:
• What about standard translation dictionaries? Should we include them, and how?
• What translation phenomena are we covering and what are we missing?
• Does it work?
38
Toy System B• System A: finds better translations in context, but can’t reorder
“er gestern gegangen ist he yesterday left had”(should be “he had left yesterday”)
• System B: allow all possible permutations– Each hypothesis now remembers:
• Last target word generated• Set of source words already translated
– 5! = 125 permutations, 10! = 3.6M, 20! = 2.43e18– No way we can afford to keep all translations!
• Group into stacks based on count of words covered– Histogram pruning: limited number of hypotheses on any stack– Threshold pruning: only keep hypotheses within d of best on stack
39Stack 2Stack 0 Stack 1
Toy System B: search
<s>0
00000 because-3.2
10000
he-3.5
01000
he-5.8
1100000
… …
…
Like an expanded Viterbi search, but each hypothesis also needs to remember which source words have been translated already!
yesterday-1.9
00100
because-5.2
100100
weil er gestern gegangen ist
because he yesterday gone is
him left had
his are
has
yesterday-5.6
100100
40
Beyond Toy System B
• Many problems with this system:– System allows all possible reorderings, but some are
much more likely than others– Contextual information is only captured by the target
language model, not in the source• Multiple paths from here:
– Better word alignment– Phrase-based translation: learn bigger translation
units – this is crucial!– Better reordering models: syntax can help here
41
Word-based MT results
SRC: 对外经济贸易合怍部今无提供的数据表明,今年至十一月中国实际利用外资四百六十九点五九亿美元 , 其中包括外商直接投资四百点零七亿美元。
REF: According to the data provided today by the Ministry of Foreign Trade and Economic Cooperation, as of November this year, China has actually utilized 46.959 billion US dollars of foreign capital, including 40.007 billion US dollars of direct investment from foreign businessmen.
WB: The Ministry of Foreign Trade and Economic Cooperation, including foreign direct investment 40.007 billion US dollars today provide data include that year to November china actually using 46.959 billion US dollars and
SRC: Le politique de la haineREF: Politics of hateWB: The policy of the hatred
SRC: Nous avone signé le protocole.REF: We did sign the memorandum of agreement.WB: We have signed the protocol.
SRC: Où était le plan solide?REF: But where was the solid plan?WB: Where was the economic base?
42
Word alignment and phrase extraction (Koehn, Och, Marcu 2003)
•
•
•
•
•
blue
house
the
a casa azul
43
Word alignment and phrase extraction (Koehn, Och, Marcu 2003)
• the a
•
•
•
•
blue
house
the
a casa azul
44
Word alignment and phrase extraction (Koehn, Och, Marcu 2003)
• the a
• blue azul
•
•
•
blue
house
the
a casa azul
45
Word alignment and phrase extraction (Koehn, Och, Marcu 2003)
• the a
• blue azul
• house casa
•
•
blue
house
the
a casa azul
46
Word alignment and phrase extraction (Koehn, Och, Marcu 2003)
• the a
• blue azul
• house casa
• blue house casa azul
•
blue
house
the
a casa azul
47
Word alignment and phrase extraction (Koehn, Och, Marcu 2003)
• the a
• blue azul
• house casa
• blue house casa azul
• the blue house a casa azul
blue
house
the
a casa azul
48
Word alignment and phrase extraction (Koehn, Och, Marcu 2003)
• the a
• blue azul
• house casa
• blue house casa azul
• the blue house a casa azul
blue
house
the
a casa azul
49
Word alignment and phrase extraction (Koehn, Och, Marcu 2003)
• the a
• blue azul
• house casa
• blue house casa azul
• the blue house a casa azul
blue
house
the
a casa azul
50
Phrase table
• Extract phrases from all sentence pairs• Estimate P(src | tgt) with c(src, tgt) / c(tgt)
Portuguese English Probver see 0.533ver view 0.129ver to see 0.044ver viewing 0.009ver seeing 0.008ver watch 0.007
…ver o mundo atravês view the world through 1.000ver e adquirir browse and purchase 1.000ver ou editar view or edit 0.875ver filmes watch movies 0.667
51
Word-based vs. phrase-based(BLEU score vs. training data size)
40k 80k 160k 320k20
22
24
26
28
30
Phrases from word alignment
Word-based
[Koehn, Och, and Marcu 2003]
These systems, with sufficient data, produce better translations than
rule-based systems… mostly.
52
Syntax in translation
• Phrases capture contextual translation and local reordering surprisingly well
• However this information is brittle:– “author of the book 本書的作者” tells us nothing about
how to translate “author of the pamphlet” or “author of the play”
– The Chinese phrase “NOUN1 的 NOUN2” becomes “NOUN2 of NOUN1” in English
• No information about global reordering– In Chinese, prepositional phrases often come before verbs; in
English, they’re come after
53
Syntax-based source reordering
• Language is hierarchical – our models should capture this
• Phrasal cohesion (Fox, 2002): most often, each source constituent translates to a contiguous target constituent
• Source parse trees can inform reordering– First parse the source sentence– Then use information about the source to guide
reordering
54
Wang, Collins, Koehn (2007):Parse the Chinese, reorder like English
55
Some pertinent rules
56
Syntax-directed translation
• Begin by parsing source sentence– Syntactic analysis can guide reordering and inform
translation• One approach: Treelet translation (Quirk,
Menezes, and Cherry, 2005)– Use dependency trees: minimal amount of
syntactic information (just head node)
57
Treelet and template extraction
• Start from word aligned sentence pairs
blue housethe
a casa azul
58
Treelet and template extraction
• Parse source:
blue/JJ
house/NN
the/DT
a casa azul
59
Treelet and template extraction
• Project tree:
blue/JJ
house/NN
the/DT
a
casa
azul
60
Treelet and template extraction
• Extract treelet pairs:
• Treelet: connected subgraphof the dependency tree
blue/JJ
house/NN
the/DT
a
casa
azul
the a
blue azul
house casa
blue house casa azul
the blue house a casa azul
the house a casa
61
Treelet and template extraction
• Extract templates:
blue/JJ
house/NN
the/DT
a
casa
azul*/JJ
*/NN
*/DT
*
*
*
62
Europarl English-Spanish
devtest in-domain out-of-domain20%
25%
30%
35%
PhrasalTemplateBL
EU
Impact of preserving ambiguity
• Start with treelet systems– Technical English-German,
English-Japanese– Newswire Chinese-English
• Translate each of k-best parses independently
• Keep the translation with the best score
• Evaluate using BLEU
parses EG EJ CE
1 33.6 36.0 28.2
2 33.8 36.1 28.5
4 34.1 36.3 28.9
8 34.3 36.6 29.2
16 34.5 36.8 29.7
32 34.8 37.1 30.0
64
Target langauge syntax
• If we want a grammatical translation, shouldn’t we use a grammar?
• Use a parser in the target language instead– Translation becomes cross-lingual parsing: find the best
English parse tree for a Chinese sentence– Great for translating into English or other languages with
lots of linguistic resources• Later approaches capture larger synchronous rules at a
time (Marcu et al., 2006) and are pretty successful, though somewhat slow in comparison– Ongoing research to speed things up
What about morphology?
• Most of these approaches treat words as indivisible units
• Some recent work addresses this problem:– Phrasal translations of morpheme sequences
(requires morphological segmentation)
Remaining limitations
• Most systems consider only a single sentence at a time– What about discourse phenomena?– Coreference?
• How do we handle unknown words?• Where do we get the data?
Thanks!