30
English-Persian SMT Reza Saeedi [email protected] 1 WTLAB Wednesday, May 25, 2011

English-Persian SMT

  • Upload
    sawyer

  • View
    90

  • Download
    3

Embed Size (px)

DESCRIPTION

English-Persian SMT. Reza Saeedi [email protected]. WTLAB. Wednesday, May 25, 2011. Outline. MT Introduction SMT Introduction Requirements for SMT Evaluation metrics English-Persian MT challenges English-Persian SMT System1 System2 Problems in English-Persian SMT. - PowerPoint PPT Presentation

Citation preview

Page 1: English-Persian SMT

English-Persian SMT

Reza [email protected]

1

WTLAB Wednesday, May 25, 2011

Page 2: English-Persian SMT

Outline2

MT Introduction SMT Introduction Requirements for SMT Evaluation metrics English-Persian MT challenges English-Persian SMT

System1 System2

Problems in English-Persian SMT

Page 3: English-Persian SMT

MT Introduction3

Automatic translation of text written in a natural language into another one by the use of computers is referred to as Machine Translation.

There are several way to do this work: Dictionary-based Rule-based Example-based Statistical approach

Page 4: English-Persian SMT

SMT Introduction4

First ideas of Statistical machine translation was proposed by Warren Weaver in 1947.

Statistical machine translation tries to learn the translation by examining the translations made by humans.

Page 5: English-Persian SMT

SMT Introduction(Cont.)5

Statistical MT models take the view that every sentence in the target language is a translation of the source language sentence with some probability.

The best translation, of course, is the sentence that has the highest probability.

The key problems in statistical MT are: estimating the probability of a translation and efficiently finding the sentence with the highest

probability.

Page 6: English-Persian SMT

SMT Introduction(Cont.)6

Given a Source sentence f, we seek the target sentence e that maximizes P(e | f).

e‘ = argmaxe P(e | f)

Intuitively, P(e|f) should depend on two factors:

P(e|f) = P(e) * P(f | e) / P(f)

argmaxe P(e | f) = argmaxe P(e) * P(f | e)

fluency faithfulness

Page 7: English-Persian SMT

SMT Introduction(Cont.)7

Philipp koehn http://homepages.inf.ed.ac.uk/pkoehn

Page 8: English-Persian SMT

Why SMT?8

Better use of resources Not need linguistic knowledge It can use for any pair of language

But We need a big training corpus

Page 9: English-Persian SMT

Steps of SMT9

Page 10: English-Persian SMT

Requirements for SMT10

Bilingual and Monolingual Corpus: For bilingual need tow file aligned sentence by

sentence (one file for source language and other for target language)

Microsoft Bi-Lingual sentence Aligner

Language Model: We need a tool to compute P(e) For this step we need to monolingual corpus SRILM: a tool for create N-grams

Page 11: English-Persian SMT

LM output11

Page 12: English-Persian SMT

Requirements for SMT12

Translation Model: We need a tool for compute P(f|e) For this step we need to bilingual corpus GIZA++ The output of this tool is a phrase table

Decode: For search and find best translation Moses

Page 13: English-Persian SMT

Phrase table13

Page 14: English-Persian SMT

Moses tool14

Page 15: English-Persian SMT

The training steps15

Prepare data Run GIZA++ Align words Get lexical translation table Extract phrases Score phrases Build reordering model Build generation models Create configuration file

Page 16: English-Persian SMT

Evaluation metrics16

BLEU(BiLingual Evaluation Understudy)

Developed at IBM’s

The closer a MT is to a professional human translation,

the better it is

NIST

Page 17: English-Persian SMT

English-Persian MT challenges17

The Persian language structure is very different in comparison to English

The structure of Persian language is very complex There has been little previous work done for this

language pair Effective SMT systems rely on very large bilingual

corpora but there are not readily available for the English/Persian language pair

Page 18: English-Persian SMT

English-Persian SMT18

There have been few English-Persian MT systems

developed

Most of them are purely rule-based

There are two work on English-Persian SMT

Mohaghegh and Sarrafzadeh (Massey University)

Pilevar and Faili (Tehran University)

Page 19: English-Persian SMT

System119

Corpus: BBC news

Page 20: English-Persian SMT

System1(Cont.)20

Tools: SRILM, GIZA++, Moses

Page 21: English-Persian SMT

System1: Improved Language Modeling21

Page 22: English-Persian SMT

System222

Corpus: Bidirectional(TEP): Subtitle of films, 3 books, KDE4

Page 23: English-Persian SMT

System2(Cont.)23

Corpus: Monolingual: Hamshahri, subtitle of films

Page 24: English-Persian SMT

System2(Cont.)24

Tools: SRILM, GIZA++, Moses

PersianSMT with 4-gram Sub-LM

Page 25: English-Persian SMT

Comparison PersianSMT with Google Translator

25

Page 26: English-Persian SMT

Problems in English-Persian SMT26

compound verbs (aligning problem) Use a phrase-based SMT system But problem is inflectional morphology Large number of inflected verb forms does not let the

system learn to translate all the individual forms of a compound verb

Persian takes personal pronouns as an optional element in the sentence (aligning problem)

Page 27: English-Persian SMT

Problems(Cont.)27

failure of the system to place the elements of the

sentence in the right order

Use a phrase-based SMT system

Re-rank the n-best output list and/or reorder the output

sentences

Prior to translation, the input sentence is reordered using

morpho-syntactic information, so that the word order

resembles better that of the target language.

Page 28: English-Persian SMT

28

Page 29: English-Persian SMT

References29

[1] A. Ramanathan, "Statistical Machine Translation", Ph.D. Seminar Report, Department of Computer Science and Engineering Indian Institute of Technology, 2000.

[2] A. LOPEZ, "Statistical Machine Translation", ACM Computing Surveys, 2008. [3] M. Mohaghegh, & A. Sarrafzadeh, “The first english-persian statistical

machine translation”, New Zealand Postgraduate Conference, 2009 . [4] M. Mohaghegh, & A. Sarrafzadeh, " An analysis of the effect of training data

variation in English-Persian Statistical Machine Translation”, 2009 International Conference on Innovations in Information Technology (IIT 2009)

[5] M. Mohaghegh, & A. Sarrafzadeh, " Performance evaluation of various training data in English-Persian statistical machine translation “, Appear in Proceedings of the 10th International Conference on the Statistical Analysis of Textual Data (JADT 2010), Rome, Italy, June 9-11, 2010.

[6] M. Mohaghegh, & A. Sarrafzadeh, " Improved Language Modeling for English-Persian Statistical Machine Translation”, COLING 2010 / SIGMT Workshop 23rd International Conference on Computational Linguistics Beijing, China 28 August 2010

Page 30: English-Persian SMT

References(Cont.)30

[7] M.T. Pilevar and H. Faili, "PersianSMT: A First Attempt to English-Persian Statistical Machine Translation", to appear in Proc. of 10th International Conference on statistical analysis of textual data (JADT 2010)