Machine Translation Domain Adaptation Day 19 1. PROJECT #2 2

Preview:

Citation preview

1

Machine TranslationDomain Adaptation

Day 19

2

PROJECT #2

MEMM tools

• Online description of project #2 has been updated with more information

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

You write code to convert this to features!

“featurize.pl training.txt training.feats”

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

trigram.model<binary gobbledegoo>

Run memm_train to train this model

“memm_train --input training.feats --classifier trigram.model --markovOrder 2”

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

trigram.model<binary gobbledegoo>

test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.

Get some unseen test data…

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

trigram.model<binary gobbledegoo>

test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.

test.featsPRP w0=he:1 w-1=<s>:1VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1=<s>:1VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 

Use the same featurization code on test data

“featurize.pl test.txt test.feats”

Quick walk throughtraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

trigram.model<binary gobbledegoo>

test.txthe/PRP arrived/VBD ./.John/NNP left/VBD ./.

test.featsPRP w0=he:1 w-1=<s>:1VBD w0=arrived:1 w-1=he:1. w0=.:1 w-1=arrived:1 NNP w0=John:1 w-1=<s>:1VBD w0=left:1 w-1=John:1. w0=.:1 w-1=left:1 

test.tagsPRPVBD. NNPVBD.  

memm_test predicts tags (memm_test ignores first column; can include true tags)

“memm_test --input test.feats --classifier trigram.model --markovOrder 2 --output test.tags”

MEMM featurestraining.txtI/PRP left/VBD ./.John/NNP arrived/VBD ./.

training.featsPRP w0=I:1 w-1=<s>:1VBD w0=left:1 w-1=I:1. w0=.:1 w-1=left:1 NNP w0=John:1 w-1=<s>:1VBD w0=arrived:1 w-1=John:1. w0=.:1 w-1=arrived:1 

Actual features used by MEMMPRP w0=I:1 w-1=<s>:1 t[-1]=<s>:1 t[-1]=<s>,t[-2]=<s>:1VBD w0=left:1 w-1=I:1 t[-1]=PRP:1 t[-1]=PRP,t[-2]=<s>:1. w0=.:1 w-1=left:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=PRP:1<s> t[-1]=.:1 t[-1]=.,t[-2]=VBD:1NNP w0=John:1 w-1=<s>:1 t[-1]=<s>:1 t[-1]=<s>,t[-2]=<s>:1VBD w0=arrived:1 w-1=John:1 t[-1]=NNP:1 t[-1]=NNP,t[-2]=<s>:1. w0=.:1 w-1=arrived:1 t[-1]=VBD:1 t[-1]=VBD,t[-2]=NNP:1<s> t[-1]=.:1 t[-1]=.,t[-2]=VBD:1

You provide these features…

…and add the argument “--markovOrder 2”

The MEMM adds in features about tag

context add training and test time

11

MACHINE TRANSLATION

12

Acknowledgments

• Many thanks to (for helpful content and input on content):– Chris Callison-Burch, Matt Post, & Adam Lopez

(JHU)– Philipp Koehn & Barry Haddow (U Edinburgh)– Kevin Knight (ISI)

13

14

15

Translation: global problem and interesting research problem

English32%

Chinese13%

Spanish9%

Japanese7%

French5%

German4%

Arabic4%

Portuguese4%

Other21%

Internet users – 2007• Non-English Internet content and user communities are increasing explosively

• Human translation costs are excessive: major languages range from 10-50 cents per word

• Non-English Internet content and user communities are increasing explosively

• Human translation costs are excessive: major languages range from 10-50 cents per word

Result: the vast majority of published material remains untranslated!

16

Prevalence of MT on the Web

Estonian

Hungarian

Slovenian

Slovak

Romanian

Latvian

Lithuanian

12.13% 12.93%

25.47%

46.40% 47.40% 50.07% 51.53%

Proportion of MT’d Content by language

From Rarrick et al, 2010

17

18

The Goal: (sentence) translation

• Translate source sentences into target sentences– For now, ignore

discourse structure, co-reference, and phenomena across sentence boundaries

滴水之恩當以涌泉相報

A drop of water shall be returned with a burst of

spring.

19

Types of MT systems

• Source of information– Rule based: People write rules to specify translations of

words, phrases– Data-driven: Use learning techniques to derive translation

“rules” from data sources (e.g., parallel corpora)

• Level of representationInterlingua

Semantic forms

Syntax trees

Phrases

WordsModified Vauquois pyramid

20

Advantages of data-driven translation

• We can model the genres of documents that we would like to model– Learn contextually appropriate translations for technical

data, chat data, etc.• Very flexible system– Given corpus C = ({x1,y1}, {x2,y2}, …) of sentence pairs– Translate(C, x) = y is a function of the training data and the

input sentence– To build a new system (or optimize our old one) we just

change the data

– But…we need oodles of data to get “good” models

21

Statistical MT

• Learn word and phrase alignments from “parallel” data

22

Statistical MT

• Learn word and phrase alignments from “parallel” data– Parallel data? – Parallel documents?

23

Statistical MT

• Learn word and phrase alignments from “parallel” data– Parallel documents?

24

Statistical MT

• Learn word and phrase alignments from “parallel” data– Parallel documents?

25

Statistical MT

• Learn word and phrase alignments from “parallel” data– Parallel documents?

26

Statistical MT

• Learn word and phrase alignments from “parallel” data– Start with parallel documents• Need parallel sentences• Sentence break and sentence align

– Word align and produce word and phrase translation tables (our translation models)

27

28

29

Some Hmong

a house ib lub tsev

a new house ib lub tsev tshiab

my new house kuv lub tsev tshiab

eight new houses yim lub tsev tshiab

my eight new houses kuv yim lub tsev tshiab

30

Some More Hmong

a house ib lub tsev

a new house ib lub tsev tshiab

my new house kuv lub tsev tshiab

eight new houses yim lub tsev tshiab

my eight new houses kuv yim lub tsev tshiab

the house lub tsev

31

Even More Hmong

kuv pluag heev I'm very poorib pluag mov a meal ib taig mov a bowl of riceib taig zaub a bowl of vegetables

32

Statistical MT

• Learn word and phrase alignments from “parallel” data– Start with parallel documents• Need parallel sentences• Sentence break and sentence align

– Word align and produce word and phrase translation tables (our translation models)

33

Statistical MT

• Learn word and phrase alignments from “parallel” data– Start with parallel documents

• Need parallel sentences• Sentence break and sentence align

– Word align and produce word and phrase translation tables (our translation models)

• Use monolingual data to– Build language models

• Inform ordering• Choose best translation from n-best list

34

Statistical MT Recipe

Start With• Parallel sentences

– Align words & phrases, & generate counts

Build These Components• Translation Model

– Probs associated with aligned words & phrases – P (E|F)

35

Statistical MT Recipe

Start With• Parallel sentences

– Align words & phrases, & generate counts

• Monolingual data

Build These Components• Translation Model

– Probs associated with aligned words & phrases – P (E|F)

• Language Model – P(E)

36

Statistical MT Recipe

Start With• Parallel sentences

– Align words & phrases, & generate counts

• Monolingual data• Decoding Algorithm

Build These Components• Translation Model

– Probs associated with aligned words & phrases – P (E|F)

• Language Model – P(E)• Decoder

– Maximizes P(F|E)*P(E)

37

Statistical Machine Translation

• Given foreign f, find best English translation e*e* = argmaxe P(e | f)

• Use Bayes’ rule to get “noisy channel” modelP(e | f) = P(f | e) P(∙ e) / P(f)argmaxe P(e | f) = argmax P(f | e) P(∙ e)

• P(f | e) is the channel or translation model• P(e) is the language model

38

Centauri/Arcturan [Knight, 1997]Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Slides 38-74 adapted from Kevin Knight and CCB’s JHU crew

39

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

40

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

41

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

42

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

43

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

44

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

45

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

46

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

???

47

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

48

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

process ofelimination

49

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

cognate?

50

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

zerofertility

Your assignment, translate this to Arcturan: farok crrrok hihok yorok clok kantok ok-yurp

51

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa

It’s Really Spanish/English

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

52

Clients do not sell pharmaceuticals in Europe => Clientes no venden medicinas en Europa

It’s Really Spanish/English

1a. Garcia and associates .1b. Garcia y asociados .

7a. the clients and the associates are enemies .7b. los clients y los asociados son enemigos .

2a. Carlos Garcia has three associates .2b. Carlos Garcia tiene tres asociados .

8a. the company has three groups .8b. la empresa tiene tres grupos .

3a. his associates are not strong .3b. sus asociados no son fuertes .

9a. its groups are in Europe .9b. sus grupos estan en Europa .

4a. Garcia has a company also .4b. Garcia tambien tiene una empresa .

10a. the modern groups sell strong pharmaceuticals .10b. los grupos modernos venden medicinas fuertes .

5a. its clients are angry .5b. sus clientes estan enfadados .

11a. the groups do not sell zenzanine .11b. los grupos no venden zanzanina .

6a. the associates are also angry .6b. los asociados tambien estan enfadados .

12a. the small groups are not modern .12b. los grupos pequenos no son modernos . 

53

Your assignment, put these words in order: { jjat, arrat, mat, bat, oloat, at-yurp }

Centauri/Arcturan [Knight, 1997]

1a. ok-voon ororok sprok .

1b. at-voon bichat dat .

7a. lalok farok ororok lalok sprok izok enemok .

7b. wat jjat bichat wat dat vat eneat .

2a. ok-drubel ok-voon anok plok sprok .

2b. at-drubel at-voon pippat rrat dat .

8a. lalok brok anok plok nok .

8b. iat lat pippat rrat nnat .

3a. erok sprok izok hihok ghirok .

3b. totat dat arrat vat hilat .

9a. wiwok nok izok kantok ok-yurp .

9b. totat nnat quat oloat at-yurp .4a. ok-voon anok drok brok jok .

4b. at-voon krat pippat sat lat .

10a. lalok mok nok yorok ghirok clok .

10b. wat nnat gat mat bat hilat .5a. wiwok farok izok stok .

5b. totat jjat quat cat .

11a. lalok nok crrrok hihok yorok zanzanok .

11b. wat nnat arrat mat zanzanat .6a. lalok sprok izok jok stok .

6b. wat dat krat quat cat .

12a. lalok rarok nok izok hihok mok .

12b. wat nnat forat arrat vat gat .

zerofertility

54

Reorder

55

Reorder

56

Reorder

57

Reorder

5040 Possible Orderings!!

58

59

60

61

62

63

64

65

66

67

68

69

70

71

72

73

74

75

Language Model

• Use a standard n-gram language model for P(E).• Trained on large monolingual corpus – 4- or 5-gram is typical– Often uses target side of parallel data + monolingual data

76

Translation Model

• “Phrase table”– N-gram pairs and probabilities

77

Statistical Machine Translation

78

EVALUATING MT

MT Evaluation

• I have a throbbing pain.• I am experiencing a throbbing

pain.• I am suffering from a throbbing

pain.• I am feeling a throbbing pain.• It is a throbbing pain.• It's throbbing and it really

hurts.• It's painful and it's throbbing.• It's throbbing with pain.

• It's in throbbing pain.• It hurts so much it's throbbing.• I've got a throbbing pain.• I can feel a throbbing pain.• I am suffering from a

throbbing pain.• I am experiencing a throbbing

pain.• I have a painful throbbing.• I feel a painful throbbing.

Source : ズキズキ 痛み ます 。16 human translations:

79

Data from International Workshop on Spoken Language Translation

80

MT Evaluation

• No “right answer”!• What can we test instead?– Human adequacy / fluency ratings– Human efficacy in an application

(e.g. question answering from translated foreign documents vs. native documents)

– Very accurate, but slow & expensive• Agreement with reference translations– BLEU (BiLingual Evaluation Understudy: IBM)– Fast system development

81

BLEU (Papineni, ACL 2002)

• MT output:1: It is a guide to action which ensures that the military always obeys the

commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.

• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed

Party commands.2: It is the guiding principle which guarantees the military forces always

being under the command of the Party.3: It is the practical guide for the army always to heed the directions of

the party.

82

BLEU

• MT output:1: It is a guide to action which ensures that the military always obeys

the commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.

• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed

Party commands.2: It is the guiding principle which guarantees the military forces always

being under the command of the Party.3: It is the practical guide for the army always to heed the directions of

the party.

83

BLEU

• MT output:1: It is a guide to action which ensures that the military always obeys the

commands of the party.2: It is to insure the troops forever hearing the activity guidebook that

party direct.

• Human (reference) translations:1: It is a guide to action that ensures that the military will forever heed

Party commands.2: It is the guiding principle which guarantees the military forces always

being under the command of the Party.3: It is the practical guide for the army always to heed the directions of

the party.

84

BLEU: observations

1: It is a guide to action which ensures that the military always obeys the commands of the party.

2: It is to insure the troops forever hearing the activity guidebook that party direct.

• Observations– Word overlap is indicative– n-gram (word sequence) overlap is even more distinct– Drawing from multiple reference translations helps

85

BLEU metric

• Compute n-gram precisions:Pn = c(matched n-grams) / c(n-grams in candidate)

• Compute a brevity penalty(Prevent candidates from deleting difficult words)BP = exp( min( 1 – r/c, 0 ) ), r = reference length, c =

candidate length• Combine using geometric mean

BLEU = BP (∏∙ i=1n Pi)^(1/n)

• Produces score on a 0-1 scale – often expressed as a “percentage” (e.g., * 100)

BLEU results circa 2002

[from Papineni et al., ACL 2002] [from G. Doddington, NIST]

Distinguishes humans from machines… …correlates well with human judgments

86

However nowadays we’re starting to see problems: - Some systems score better than human translations - In competitions, some “gaming of BLEU” - Rule based systems are at a disadvantage after tuning

87

Next Time

• MT & Word Alignment• Application of EM

Recommended