32
CSCI 599 MACHINE TRANSLATION 11-1-11 11:00 1 2011/01/11

CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

CSCI 599MACHINE TRANSLATION

11-1-11 11:00

12011/01/11

Page 2: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

INSTRUCTORS

David Chiang 蔣偉

Liang Huang 黃亮

Kevin Knight 武凯文

22011/01/11

Page 3: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

LANGUAGES ON THE WEB

32011/01/11

Page 4: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

LANGUAGES ON TWITTER

42011/01/11

Page 5: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

LANGUAGES IN LA

52011/01/11

Page 6: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

WHY DO WE NEED MT?

62011/01/11

Page 7: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

WHY DO WE NEED MT?

72011/01/11

Page 8: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

WHY DO WE NEED MT?

82011/01/11

Page 9: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

92011/01/11

Page 10: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

102011/01/11

Page 12: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

WHY IS MT HARD?

chiliagon

Coverage

?

122011/01/11

Page 13: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

WHY IS MT HARD?

La caja está en la pluma

The box is in the pen

La caja está en el corral

Ambiguity

132011/01/11

Page 14: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

WHY IS MT HARD?

Juan cruzó a nado el lago

John swam across the lake

Juan nadó tras el lago

Divergence

142011/01/11

Page 15: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

IN THE BEGINNINGOne naturally wonders if the problem of translation could conceivably be treated as a problem in cryptography. When I look at an article in Russian, I say: “This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode.”

Warren Weaver, 1947

152011/01/11

Page 16: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

IBM-GEORGETOWN

162011/01/11

Page 17: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

BAR-HILLEL

• Syntactic transfer

• “Semantic barrier”: The box was in the pen

172011/01/11

Page 18: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

ALPAC REPORT

182011/01/11

Page 19: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

RULE-BASED MT

String

Syntax

Semantics

String

Syntax

SemanticsSource language Target language

Interlingua

Direct

Transfer

Transfer

192011/01/11

Page 20: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

STATISTICAL MT

202011/01/11

Page 21: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

A SCI-FI EXAMPLE(KNIGHT, 1997)

farok crrrok hihok yorok clok kantok ok-yurp

Your assignment: translate this Centauri sentence into Arcturan

212011/01/11

Page 22: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

1c. ok-voon ororok sprok .

1a. at-voon bichat dat .

7c. lalok farok ororok lalok sprok izok enemok .

7a. wat jjat bichat wat dat vat eneat .

2c. ok-drubel ok-voon anok plok sprok .

2a. at-drubel at-voon pippat rrat dat .

8c. lalok brok anok plok nok .

8a. iat lat pippat rrat nnat .

3c. erok sprok izok hihok ghirok .

3a. totat dat arrat vat hilat .

9c. wiwok nok izok kantok ok-yurp .

9a. totat nnat quat oloat at-yurp .

4c. ok-voon anok drok brok jok .

4a. at-voon krat pippat sat lat .

10c. lalok mok nok yorok ghirok clok .

10a. wat nnat gat mat bat hilat .

5c. wiwok farok izok stok .

5a. totat jjat quat cat .

11c. lalok nok crrrok hihok yorok zanzanok .

11a. wat nnat arrat mat zanzanat .

6c. lalok sprok izok jok stok .

6a. wat dat krat quat cat .

12c. lalok rarok nok izok hihok mok .

12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)222011/01/11

Page 23: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

1c. ok-voon ororok sprok .

1a. at-voon bichat dat .

7c. lalok farok ororok lalok sprok izok enemok .

7a. wat jjat bichat wat dat vat eneat .

2c. ok-drubel ok-voon anok plok sprok .

2a. at-drubel at-voon pippat rrat dat .

8c. lalok brok anok plok nok .

8a. iat lat pippat rrat nnat .

3c. erok sprok izok hihok ghirok .

3a. totat dat arrat vat hilat .

9c. wiwok nok izok kantok ok-yurp .

9a. totat nnat quat oloat at-yurp .

4c. ok-voon anok drok brok jok .

4a. at-voon krat pippat sat lat .

10c. lalok mok nok yorok ghirok clok .

10a. wat nnat gat mat bat hilat .

5c. wiwok farok izok stok .

5a. totat jjat quat cat .

11c. lalok nok crrrok hihok yorok zanzanok .

11a. wat nnat arrat mat zanzanat .

6c. lalok sprok izok jok stok .

6a. wat dat krat quat cat .

12c. lalok rarok nok izok hihok mok .

12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)232011/01/11

Page 24: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

1c. ok-voon ororok sprok .

1a. at-voon bichat dat .

7c. lalok farok ororok lalok sprok izok enemok .

7a. wat jjat bichat wat dat vat eneat .

2c. ok-drubel ok-voon anok plok sprok .

2a. at-drubel at-voon pippat rrat dat .

8c. lalok brok anok plok nok .

8a. iat lat pippat rrat nnat .

3c. erok sprok izok hihok ghirok .

3a. totat dat arrat vat hilat .

9c. wiwok nok izok kantok ok-yurp .

9a. totat nnat quat oloat at-yurp .

4c. ok-voon anok drok brok jok .

4a. at-voon krat pippat sat lat .

10c. lalok mok nok yorok ghirok clok .

10a. wat nnat gat mat bat hilat .

5c. wiwok farok izok stok .

5a. totat jjat quat cat .

11c. lalok nok crrrok hihok yorok zanzanok .

11a. wat nnat arrat mat zanzanat .

6c. lalok sprok izok jok stok .

6a. wat dat krat quat cat .

12c. lalok rarok nok izok hihok mok .

12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)242011/01/11

Page 25: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

1c. ok-voon ororok sprok .

1a. at-voon bichat dat .

7c. lalok farok ororok lalok sprok izok enemok .

7a. wat jjat bichat wat dat vat eneat .

2c. ok-drubel ok-voon anok plok sprok .

2a. at-drubel at-voon pippat rrat dat .

8c. lalok brok anok plok nok .

8a. iat lat pippat rrat nnat .

3c. erok sprok izok hihok ghirok .

3a. totat dat arrat vat hilat .

9c. wiwok nok izok kantok ok-yurp .

9a. totat nnat quat oloat at-yurp .

4c. ok-voon anok drok brok jok .

4a. at-voon krat pippat sat lat .

10c. lalok mok nok yorok ghirok clok .

10a. wat nnat gat mat bat hilat .

5c. wiwok farok izok stok .

5a. totat jjat quat cat .

11c. lalok nok crrrok hihok yorok zanzanok .

11a. wat nnat arrat mat zanzanat .

6c. lalok sprok izok jok stok .

6a. wat dat krat quat cat .

12c. lalok rarok nok izok hihok mok .

12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)252011/01/11

Page 26: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

1c. ok-voon ororok sprok .

1a. at-voon bichat dat .

7c. lalok farok ororok lalok sprok izok enemok .

7a. wat jjat bichat wat dat vat eneat .

2c. ok-drubel ok-voon anok plok sprok .

2a. at-drubel at-voon pippat rrat dat .

8c. lalok brok anok plok nok .

8a. iat lat pippat rrat nnat .

3c. erok sprok izok hihok ghirok .

3a. totat dat arrat vat hilat .

9c. wiwok nok izok kantok ok-yurp .

9a. totat nnat quat oloat at-yurp .

4c. ok-voon anok drok brok jok .

4a. at-voon krat pippat sat lat .

10c. lalok mok nok yorok ghirok clok .

10a. wat nnat gat mat bat hilat .

5c. wiwok farok izok stok .

5a. totat jjat quat cat .

11c. lalok nok crrrok hihok yorok zanzanok .

11a. wat nnat arrat mat zanzanat .

6c. lalok sprok izok jok stok .

6a. wat dat krat quat cat .

12c. lalok rarok nok izok hihok mok .

12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)262011/01/11

Page 27: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

1c. ok-voon ororok sprok .

1a. at-voon bichat dat .

7c. lalok farok ororok lalok sprok izok enemok .

7a. wat jjat bichat wat dat vat eneat .

2c. ok-drubel ok-voon anok plok sprok .

2a. at-drubel at-voon pippat rrat dat .

8c. lalok brok anok plok nok .

8a. iat lat pippat rrat nnat .

3c. erok sprok izok hihok ghirok .

3a. totat dat arrat vat hilat .

9c. wiwok nok izok kantok ok-yurp .

9a. totat nnat quat oloat at-yurp .

4c. ok-voon anok drok brok jok .

4a. at-voon krat pippat sat lat .

10c. lalok mok nok yorok ghirok clok .

10a. wat nnat gat mat bat hilat .

5c. wiwok farok izok stok .

5a. totat jjat quat cat .

11c. lalok nok crrrok hihok yorok zanzanok .

11a. wat nnat arrat mat zanzanat .

6c. lalok sprok izok jok stok .

6a. wat dat krat quat cat .

12c. lalok rarok nok izok hihok mok .

12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)272011/01/11

Page 28: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

1c. ok-voon ororok sprok .

1a. at-voon bichat dat .

7c. lalok farok ororok lalok sprok izok enemok .

7a. wat jjat bichat wat dat vat eneat .

2c. ok-drubel ok-voon anok plok sprok .

2a. at-drubel at-voon pippat rrat dat .

8c. lalok brok anok plok nok .

8a. iat lat pippat rrat nnat .

3c. erok sprok izok hihok ghirok .

3a. totat dat arrat vat hilat .

9c. wiwok nok izok kantok ok-yurp .

9a. totat nnat quat oloat at-yurp .

4c. ok-voon anok drok brok jok .

4a. at-voon krat pippat sat lat .

10c. lalok mok nok yorok ghirok clok .

10a. wat nnat gat mat bat hilat .

5c. wiwok farok izok stok .

5a. totat jjat quat cat .

11c. lalok nok crrrok hihok yorok zanzanok .

11a. wat nnat arrat mat zanzanat .

6c. lalok sprok izok jok stok .

6a. wat dat krat quat cat .

12c. lalok rarok nok izok hihok mok .

12a. wat nnat forat arrat vat gat .

farok crrrok hihok yorok clok kantok ok-yurp

(Knight,1997)282011/01/11

Page 29: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

A SCI-FI EXAMPLE(KNIGHT, 1997)

farok crrrok hihok yorok clok kantok ok-yurp

Your assignment: translate this Centauri sentence into Arcturan

jjat arrat mat bat oloat at-yurp

farok crrrok hihok yorok clok kantok ok-yurp

Next: put the Arcturan words in Arcturan order

292011/01/11

Page 30: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

1e. Garcia and associates .1s. Garcia y asociados .

7e. the clients and the associates are enemies .7s. los clients y los asociados son enemigos .

2e. Carlos Garcia has three associates .2s. Carlos Garcia tiene tres asociados .

8e. the company has three groups .8s. la empresa tiene tres grupos .

3e. his associates are not strong .3s. sus asociados no son fuertes .

9e. its groups are in Europe .9s. sus grupos estan en Europa .

4e. Garcia has a company also .4s. Garcia tambien tiene una empresa .

10e. the modern groups sell strong pharmaceuticals .10s. los grupos modernos venden medicinas fuertes .

5e. its clients are angry .5s. sus clientes estan enfadados .

11e. the groups do not sell zenzanine .11s. los grupos no venden zanzanina .

6e. the associates are also angry .6s. los asociados tambien estan

enfadados .

12e. the small groups are not modern .12s. los grupos pequenos no son modernos .

Clients do not sell pharmaceuticals in Europe .

(Knight,1997)302011/01/11

Page 31: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

TRANSLATION AS DECODING

arg maxe

p(e | f ) = arg maxe

p(e, f )

= arg maxe

p(e)p( f | e)

fe

312011/01/11

Page 32: CSCI 599 MACHINE TRANSLATIONnlg.isi.edu/teaching/cs599mt/Introduction.pdf · LANGUAGES ON THE WEB 2011/01/11 3. LANGUAGES ON TWITTER 2011/01/11 4. LANGUAGES IN LA 2011/01/11 5. WHY

OVERVIEW• Word-based alignment and translation

• Language models

• Evaluation

• Phrase-based translation

• Discriminative training

• Interlude: Subword translation

• Syntax-based translation

322011/01/11