Search Applications: Machine Translation

Search Applications:Machine Translation

Next time: Constraint Satisfaction

Reading for today: See “Machine Translation Paper” under links

Reading for next time: Chapter 5

2

Homework Questions?

3

Agenda

Introduction to machine translation Statistical approaches Use of parallel data Alignment

What functions must be optimized?

Comparison of A* and greedy local search (hill climbing) algorithms for translation

How they work Their performance

4

Approach to Statistical MT

Translate from past experience

Observe how words, and phrases, and sentences are translated

Given new sentences in the source language, choose the most probable translation in the target language

Data: large corpus of parallel text E.g., Canadian Parliamentary proceedings

5

Data

Example Ce n’est pas clair. It is not clear.

Quantity 200 billion words (2004 MT evaluation)

Sources Hansards: Canadian parliamentary proceedings Hong Kong: official documents published in multiple

languages Newspapers published in multiple languages Religious and literary works

6

Alignment – the first step

Which sentences or paragraphs in one language correspond to which paragraphs or sentences in another language? (Or what words?)

Problems Translators don’t use word for word translations Crossing alignments

Types of alignment 1:1 (90% of the cases) 1:2, 2:1 3:1, 1:3

7

With regard to Quant aux According to

[the] mineral waters and [(les) eaux minerales et [our survey,] 1988

the lemonades-soft drinks aux limonades],

they encounter [elles rencontrent [sales] of

still more toujours plus [mineral water

users. Indeed

d’adeptes. ]En effet

and soft drinks] were

our survey [notre sondage] [much higher]

makes standout fait ressortir [than in 1987,]

the sales [des ventes] reflecting

clearly [nettement [The growing popularity]

superior Superieures] Of these products.

to those in 1987 [a celles de 1987] [Cola drink] manufacturers

for cola-based drinks Pour [les boissons a base de cola]

[in particular]

especially notamment Achieved above

Average growth rates

An example of 2:2 alignment

8

Fertility: a word may be translated by more than 1 word

Notamment -> in particular (fertility 2) Limonades -> soft drinks

Fertility 0: A word translated by 0 words Des ventes -> sales Les boissons a base de cola -> cola drinks

Many to many: Elles rencontrent toujours plus d’adeptes -> The

growing popularity

9

Bead for sentence alignment

A group of sentences in one language that corresponds in content to some group of sentences in the other language

Either group can be empty

How much content has to overlap between sentences to count it as alignment?

An overlapping clause can be sufficient

10

Methods for alignment

Length based

Offset alignment

Word based

Anchors (e.g., cognates)

11

Word Based Alignment

Assume first and last sentences of the texts align (anchors).

Then until most sentences aligned: Form an envelope of alignments from the cartesian

product of the list of sentences Exclude alignments if they cross anchors or too distance

Choose pairs of words that tend to occur in alignments

Find pairs of source and target sentences which contain many possible lexical correspondences.

The most reliable augment the set of anchors

12

The Noisy Channel Model for MT

Language Model

P(e)

Translation Model

P(f|e)

Decoder

e’=argmaxeP(e|f)

Noisy Channel

13

The problem

Language model constructed from a large corpus of English

Bigram model: probability of word pairs Trigram model: probability of 3 words in a row From these, compute sentence probability

Translation model can be derived from alignment

For any pair of English/French words, what is the probability that pair is a translation?

Decoding is the problem: Given an unseen French sentence, how do we determine the translation?

14

Language Model

Predict the next word given the previous words

P(Wn| W1……Wn-1) Markov assumption

Only the last few words affects the next word Usual cases: bigram, trigram, 4gram

Sue swallowed the large green …. Parameter estimation

Bigram: 20,000X19,000 = 400 million Trigram: 20,0002X19,000 = 8 trillion 4gram: 20,0003X19,000=1.6X1017

15

Translation Model

For a particular word alignment, multiply the m translation probabilities:

P(Jean aime Marie | John loves Mary) P(Jean|John)XP(aime|loves)XP(Marie|Mary)

Then sum the probabilities of all alignments

16

Decoding is NP complete

When considering any word re-ordering Swapped words Words with fertility > n (insertions) Words with fertility 0 (deletions)

Usual strategy: examine a subset of likely possibilities and choose from that

Search error: decoder returns e’ but there exists some e s.t. P(e|f) > P (e’|f)

17

Example Decoding Errors

Search ErrorPermettez que je donne un example a la chambre.Let me give the House one example.Let me give an example in the House

Model Error Vous avez besoin de toute l’aide disponible.You need all the help you can get.You need of the whole benefits available.

18

Search

Traditional decoding method: stack decoder A* algorithm Deeply explore each hypothesis

Fast greedy algorithm Much faster than A* How often does it fail?

Integer Programming Method Transform to Traveling Salesman (see paper) Very slow Guaranteed to find the best choice

19

Large branching factors Machine Translation

Input: sequence of n words, each with up to 200 possible target word translations.

Output: sequence of m words in the target language that has high score under some goodness criterion.

Search space: 6 words French

sentence has 10300 distinct translation scores under the IBM M4 translation model. [Soricut, Knight, Marcu, AMTA’2002]

…

…

20

Stack decoder: A*

Initialize the stack with an empty hypothesis

Loop Pop h, the best hypothesis off the stack If h is a complete sentence, output h and

terminate For each possible next word w, extend h

by adding w and push the resulting hypothesis onto the stack.

21

Complications

It’s not a simple left-to-right translation

Because we multiply probabilities as we add words, shorter hypotheses will always win

Use multiple stacks, one for each length

Given fertility possibilities, when we add a new target word for an input source word, how many do we add?

22

Example

Hill climbing function HillClimbing(problem, initial-state, queuing-fn)node ← MakeNode(initial-state(problem));while T do

next ← Best(SearchOperator-fn(node,cost-fn));if(IsBetter-fn(next, node)) then continue;else if(GoalTest(node)) then return node;else exit;

end whilereturn Failure;

MT (Germann et al., ACL-2001)

node ← targetGloss(sourceSentence);while T do next ← Best( LocallyModifiedTranslationOf(node)); if(IsBetter(next, node)) then continue; else print node; exit;end while

24

Types of changes

Translate one or two words (j1e1j2e2)

Translate and insert (j e1 e2)

Remove word of fertility 0 (i)

Swap segments (i1 i2 j1 j2)

Join words (i1 i2)

25

Example

Total of 77,421 possible translations attempted

26

27

28

How to search better?

MakeNode(initial-state(problem))

RemoveFront(Q)

SearchOperator-fn(node, cost-fn);

queuing-fn(problem, Q, (Next,Cost));

29

Example 1: Greedy Search MakeNode(initial-state(problem))

Machine Translation (Marcu and Wong, EMNLP-2002)

node ← targetGloss(sourceSentence);while T do next ← Best( LocallyModifiedTranslationOf(node)); if(IsBetter(next, node)) then continue; else print node; exit;end while

20.5

21

21.5

22

22.5

23

23.5

IBM J M, p(E | F) gloss J M, p(E, F) gloss

IBM JM, p(E | F) gloss JM, p(E, F) gloss

30

Climbing the wrong peak

What sentence is more grammatical?1. better bart than madonna , i say 2. i say better than bart madonna ,

Can you make a sentence with these words? a and apparently as be could dissimilar firing identical neural really so things thought two

Model validation

Model stress-testing

31

Language-model stress-testing

Input: bag of words Output: best sequence according to a

linear combination of an ngram LM syntax-based LM (Collins, 1997)

32

Size: 10-25 words long

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

ID E Search Errors Model Errors

NGLM NGLM+SBLM NGLM+SBLM*

Best searched• 51.6: and so could really be a neural apparently thought things as dissimilar firing two identicalOriginal word order• 64.3: could two things so apparently dissimilar as a thought and neural firing really be identical

0.00%10.00%20.00%30.00%40.00%50.00%60.00%70.00%80.00%90.00%

ID E Search Errors Model Errors

NGLM NGLM+SBLM NGLM+SBLM*

Best searched• 32.3: i say better than bart madonna ,Original word order• 41.6: better bart than madonna, i say

Size: 3-7 words long

SBLM*: trained on an additional 160k WSJ sentences.

33

End of Class Questions

Documents

Search Applications: Machine Translation