Upload
tacy
View
35
Download
0
Embed Size (px)
DESCRIPTION
Search Applications: Machine Translation. Next time: Constraint Satisfaction Reading for today: See “Machine Translation Paper” under links Reading for next time: Chapter 5. Homework Questions?. Agenda. Introduction to machine translation Statistical approaches Use of parallel data - PowerPoint PPT Presentation
Citation preview
Search Applications:Machine Translation
Next time: Constraint Satisfaction
Reading for today: See “Machine Translation Paper” under links
Reading for next time: Chapter 5
2
Homework Questions?
3
Agenda
Introduction to machine translation Statistical approaches Use of parallel data Alignment
What functions must be optimized?
Comparison of A* and greedy local search (hill climbing) algorithms for translation
How they work Their performance
4
Approach to Statistical MT
Translate from past experience
Observe how words, and phrases, and sentences are translated
Given new sentences in the source language, choose the most probable translation in the target language
Data: large corpus of parallel text E.g., Canadian Parliamentary proceedings
5
Data
Example Ce n’est pas clair. It is not clear.
Quantity 200 billion words (2004 MT evaluation)
Sources Hansards: Canadian parliamentary proceedings Hong Kong: official documents published in multiple
languages Newspapers published in multiple languages Religious and literary works
6
Alignment – the first step
Which sentences or paragraphs in one language correspond to which paragraphs or sentences in another language? (Or what words?)
Problems Translators don’t use word for word translations Crossing alignments
Types of alignment 1:1 (90% of the cases) 1:2, 2:1 3:1, 1:3
7
With regard to Quant aux According to
[the] mineral waters and [(les) eaux minerales et [our survey,] 1988
the lemonades-soft drinks aux limonades],
they encounter [elles rencontrent [sales] of
still more toujours plus [mineral water
users. Indeed
d’adeptes. ]En effet
and soft drinks] were
our survey [notre sondage] [much higher]
makes standout fait ressortir [than in 1987,]
the sales [des ventes] reflecting
clearly [nettement [The growing popularity]
superior Superieures] Of these products.
to those in 1987 [a celles de 1987] [Cola drink] manufacturers
for cola-based drinks Pour [les boissons a base de cola]
[in particular]
especially notamment Achieved above
Average growth rates
An example of 2:2 alignment
8
Fertility: a word may be translated by more than 1 word
Notamment -> in particular (fertility 2) Limonades -> soft drinks
Fertility 0: A word translated by 0 words Des ventes -> sales Les boissons a base de cola -> cola drinks
Many to many: Elles rencontrent toujours plus d’adeptes -> The
growing popularity
9
Bead for sentence alignment
A group of sentences in one language that corresponds in content to some group of sentences in the other language
Either group can be empty
How much content has to overlap between sentences to count it as alignment?
An overlapping clause can be sufficient
10
Methods for alignment
Length based
Offset alignment
Word based
Anchors (e.g., cognates)
11
Word Based Alignment
Assume first and last sentences of the texts align (anchors).
Then until most sentences aligned: Form an envelope of alignments from the cartesian
product of the list of sentences Exclude alignments if they cross anchors or too distance
Choose pairs of words that tend to occur in alignments
Find pairs of source and target sentences which contain many possible lexical correspondences.
The most reliable augment the set of anchors
12
The Noisy Channel Model for MT
Language Model
P(e)
Translation Model
P(f|e)
Decoder
e’=argmaxeP(e|f)
Noisy Channel
13
The problem
Language model constructed from a large corpus of English
Bigram model: probability of word pairs Trigram model: probability of 3 words in a row From these, compute sentence probability
Translation model can be derived from alignment
For any pair of English/French words, what is the probability that pair is a translation?
Decoding is the problem: Given an unseen French sentence, how do we determine the translation?
14
Language Model
Predict the next word given the previous words
P(Wn| W1……Wn-1) Markov assumption
Only the last few words affects the next word Usual cases: bigram, trigram, 4gram
Sue swallowed the large green …. Parameter estimation
Bigram: 20,000X19,000 = 400 million Trigram: 20,0002X19,000 = 8 trillion 4gram: 20,0003X19,000=1.6X1017
15
Translation Model
For a particular word alignment, multiply the m translation probabilities:
P(Jean aime Marie | John loves Mary) P(Jean|John)XP(aime|loves)XP(Marie|Mary)
Then sum the probabilities of all alignments
16
Decoding is NP complete
When considering any word re-ordering Swapped words Words with fertility > n (insertions) Words with fertility 0 (deletions)
Usual strategy: examine a subset of likely possibilities and choose from that
Search error: decoder returns e’ but there exists some e s.t. P(e|f) > P (e’|f)
17
Example Decoding Errors
Search ErrorPermettez que je donne un example a la chambre.Let me give the House one example.Let me give an example in the House
Model Error Vous avez besoin de toute l’aide disponible.You need all the help you can get.You need of the whole benefits available.
18
Search
Traditional decoding method: stack decoder A* algorithm Deeply explore each hypothesis
Fast greedy algorithm Much faster than A* How often does it fail?
Integer Programming Method Transform to Traveling Salesman (see paper) Very slow Guaranteed to find the best choice
19
Large branching factors Machine Translation
Input: sequence of n words, each with up to 200 possible target word translations.
Output: sequence of m words in the target language that has high score under some goodness criterion.
Search space: 6 words French
sentence has 10300 distinct translation scores under the IBM M4 translation model. [Soricut, Knight, Marcu, AMTA’2002]
…
…
20
Stack decoder: A*
Initialize the stack with an empty hypothesis
Loop Pop h, the best hypothesis off the stack If h is a complete sentence, output h and
terminate For each possible next word w, extend h
by adding w and push the resulting hypothesis onto the stack.
21
Complications
It’s not a simple left-to-right translation
Because we multiply probabilities as we add words, shorter hypotheses will always win
Use multiple stacks, one for each length
Given fertility possibilities, when we add a new target word for an input source word, how many do we add?
22
Example
Hill climbing function HillClimbing(problem, initial-state, queuing-fn)node ← MakeNode(initial-state(problem));while T do
next ← Best(SearchOperator-fn(node,cost-fn));if(IsBetter-fn(next, node)) then continue;else if(GoalTest(node)) then return node;else exit;
end whilereturn Failure;
MT (Germann et al., ACL-2001)
node ← targetGloss(sourceSentence);while T do next ← Best( LocallyModifiedTranslationOf(node)); if(IsBetter(next, node)) then continue; else print node; exit;end while
24
Types of changes
Translate one or two words (j1e1j2e2)
Translate and insert (j e1 e2)
Remove word of fertility 0 (i)
Swap segments (i1 i2 j1 j2)
Join words (i1 i2)
25
Example
Total of 77,421 possible translations attempted
26
27
28
How to search better?
MakeNode(initial-state(problem))
RemoveFront(Q)
SearchOperator-fn(node, cost-fn);
queuing-fn(problem, Q, (Next,Cost));
29
Example 1: Greedy Search MakeNode(initial-state(problem))
Machine Translation (Marcu and Wong, EMNLP-2002)
node ← targetGloss(sourceSentence);while T do next ← Best( LocallyModifiedTranslationOf(node)); if(IsBetter(next, node)) then continue; else print node; exit;end while
20.5
21
21.5
22
22.5
23
23.5
IBM J M, p(E | F) gloss J M, p(E, F) gloss
IBM JM, p(E | F) gloss JM, p(E, F) gloss
30
Climbing the wrong peak
What sentence is more grammatical?1. better bart than madonna , i say 2. i say better than bart madonna ,
Can you make a sentence with these words? a and apparently as be could dissimilar firing identical neural really so things thought two
Model validation
Model stress-testing
31
Language-model stress-testing
Input: bag of words Output: best sequence according to a
linear combination of an ngram LM syntax-based LM (Collins, 1997)
32
Size: 10-25 words long
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
ID E Search Errors Model Errors
NGLM NGLM+SBLM NGLM+SBLM*
Best searched• 51.6: and so could really be a neural apparently thought things as dissimilar firing two identicalOriginal word order• 64.3: could two things so apparently dissimilar as a thought and neural firing really be identical
0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%
ID E Search Errors Model Errors
NGLM NGLM+SBLM NGLM+SBLM*
Best searched• 32.3: i say better than bart madonna ,Original word order• 41.6: better bart than madonna, i say
Size: 3-7 words long
SBLM*: trained on an additional 160k WSJ sentences.
33
End of Class Questions