Upload
khyati-gupta
View
139
Download
1
Embed Size (px)
Citation preview
Experiment With Different Models Of Statistical Machine Translation
Submitted by-Khyati gupta(14483)
Rakhi Sharma(14514)
Project PresentationON
Contents
Problem Statement Objective
About the project
Flow chart Work done Conclusion Future work Reference
Problem Statement
• Machine Translation is quite popular in research field since 1990’s.
• But little work has been done in Indian Languages as the current state-of-the-art is quite bleak due to sparse data resources.
• The success of an SMT is dependent on the availability of a large parallel corpus.
• Such a data is necessary to reliably estimate translation probabilities.
• We have worked on Hindi to English Translation.
Objective
The objectives of our thesis is-
• Work on Different models of Statistical Machine Translation..
• Report the result obtained
• The SMT models studied are-
SMT
TREE
HIERARCHICAL SYNTAX
STRING
PHRASE
Introduction
What is Translation
Process of converting text from one language to another, so that the original message is retained in target language.
Source Language = language whose text is to be translated.
Target Language = language in which the text is translated.
What is machine translation?
Machine translation is automated translation or “translation carried out by a computer.” It is a process, sometimes referred to as Natural Language Processing which uses a bilingual data set and other language assets to build language and phrase models used to translate text from source language to another language.
About the Project
• Study the basics of SMT
• Installation of Moses, IRSTLM and MGIZA.
• Study various models of SMT like phrase, syntax, hierarchical model
• Creation of parallel Corpus
• Experiment translation from Hindi to English using different models of SMT.
• Conversion of Parser’s output into Moses format .
• Find out result on the basis of Score obtained .
• Evaluate the best model of SMT for a given corpus.
Flowchart of SMT
Bayesian Approach
• We apply Bayesian approach for this-
• Language model(LM):assigns a probability to any target string of words {P(e)}
• an LM probability distribution over strings S that attempts to reflect how frequently a string S occurs as a sentence.
• Translation model(TM): assigns a probability to any pair of target and source strings {P(f|e)}
• Decoder: determines translation based on probabilities of LM & TM
argmaxe p(e|f) = argmaxep(f|e) p(e)
Language Model
• A simple model of language Computes a probability of the sentence.
• Goal of the Language Model: Detect good English.
• SMT uses n-gram approach to computing probability of LM.
• A sentence is composed of product of conditional probability of component words.
• Probability of a word is calculated by that word given the preceding words. calculate
• Likelihood of sentence P(S) =P(W1)*P(W2)*….. *P(N)
= P(w1) × P(w2|w1) × … × P(wn|wn-1)
• Example illustrating bigram model- P(the barking dog) = P(the|<start>)P(barking|the)P(dog|barking)
Translation ModelsP(s|e) is called Translation model. It is used to give better scores to accurate and complete .It is trained on bilingual Hindi-English parallel data.
Approaches for translation models are-
1. Phrase-based translation• The sequences of words are called blocks or phrases, but typically are not linguistic
phrases, but phrasemes found using statistical methods from corpora
2 Hierarchical phrase-based translation• . Hierarchical phrase-based translation combines the strengths of phrase-based and
syntax-based translation.
• It uses synchronous context-free grammar rules, but the grammars may be constructed by an extension of methods for phrase-based translation without reference to linguistically motivated syntactic constituents
3. Syntax based Model• Syntax model works on syntactic categories of word and uses CFG grammar.
Decoding
• The task of decoding in machine translation is to find the best scoring translation according to these formulae.
• Given a Hindi sentence f, it finds the English yield of the single best derivation that has Hindi yield f:
• Phrase based model uses beam search algorithm.
• Tree based models use chart decoding.
System Overview
Component Tool
Word Alignment GIZA++
MGIZA
Library BOOST
Decoder Moses 5
Language Model IRSTLM
SRILM
Corpus English-Hindi
Work Done
Data Pre-Processing Flowchart
Bilingual Text Aligner
Optical character recognition
Convert pdf into jpeg
Sources(pdf)
Data Conversion
pdfConvert to jpeg jpeg
OCR(using Indisenz )
Bilingual Text Alignment(using Microsoft Aligner)
Corpus Preparation
To prepare the data for training the translation system, we have to perform the following steps:
• Tokenisation: This means that spaces have to be inserted between (e.g.) words and punctuation.
• Truecasing: The initial words in each sentence are converted to their most probable casing. This helps reduce data sparsity.
• Cleaning: Long sentences and empty sentences are removed as they can cause problems with the training pipeline, and obviously misaligned sentences are removed.
Training in Moses
1. Prepare data
• Training data has to be provided sentence aligned in two files, one for the foreign sentences, one for the English sentences
• The parallel corpus has to be converted into a format that is suitable to the GIZA++ toolkit.
• Two vocabulary files are generated and the parallel corpus is converted into a numberized format.
• The vocabulary files contain words, integer word identifiers and word count information.
2. Run GIZA++
• GIZA++ is a freely available implementation of the IBM models. We need it as a initial step to establish word alignments.
मेरे दोस्त के लिए पान दोGIVE
A
BETTLE
FOR
MY
FRIEND
3. Align words
• To establish word alignments based on the two GIZA++ alignments, a number of heuristics may be applied.
4. Get lexical translation table Estimate a maximum likelihood lexical translation table.We estimate the w(e|f) as well as the inversew(f|e) word translation table.
5. Extract phrases -all phrases are dumped into one big file
6. Score phrases -estimate the phrase translation probability (ejf)
जहानाबाद *दरभंगा ||| darbhanga* navada* ||| 1 1 1 1 ||| 0-0 1-1 ||| 1 1
7. Build lexicalized reordering model
Moses use lexicalized reordering models for reordering.
8. Build generation models-
The generation model is build from the target side of the parallel corpus.
9. Create Configuration File-
As a final step, a configuration file for the decoder is generated with all the correct paths for the generated model and a number of default parameter settings
Tuning
• Once training is over, the parameters of the log-linear model have to be tuned to avoid over fitting on training data produce the most desirable translation on any test set. This process is called tuning. The basic assumption behind tuning is that the model must be tuned according to the evaluation techniques.
• That’s why tuning technique is known as Maximum Error rate training.
Working of Models performed
1.Working of Phrase based Model
•The Hindi sentence is first broken down into phrases based on statistics drawn from parallel corpora.
•Then these Hindi phrases are translated into English phrases.
•Translated English phrases are reordered.
2.Working of Hierarchical Model• ALL the phases performed by Moses in hierarchal model are same as
phrase passed model but the rule extraction of hierarchal model is differ from phrase based SMT.
It include - Data Preparation
• Tokenization• True casing• Cleaning
Training • word alignment• rule extraction• Glue rule• Extract phrase with phrase extraction table• Reordering Model• Language Modelling
Decoding Tuning
Blue Score
Advantage of Hierarchical Model
• Hierarchical MT replace redundant rule used in phrase based MT into single rule.
• It also overcome the problem of other model it does not require annotated corpora at all or automatically generate it.
• We are working on Hindi to English translation
English already have annotated data and Hindi will be automatically annotated by hierarchical model .• The grammar used correction in known as synchronous context free
grammar.
Synchronous Context Free Grammar
• SCFG is a kind of context free grammar that generates pair of strings.
• Example:- S -> (I, में )• This rule translates ’I’ in English to में in Hindi.
• This rule consists of terminals only but rules may consist of terminals and non-terminals as described below.
• VP ->(V1 NP2, NP2 V1 )
Rule Extraction with SCFG
• Hierarchical model not only reduces the size of a grammar. It also uses the same rules for parsing as well as translation.
Steps performed in rule extraction
• In hierarchical Model intervening words can be separated. these are replace by non-terminal X.
• Synchronization is required between sub-phrases This model does not require parser at the Hindi side because all phrase are labelled as X.
This allow us to build useful translation rule such as
X- ( X1 kA X2 , X2 of X1 )
• Some examples
• भारत का प्रधान मंत्री- -> Prime Minister of India
• जापान का प्रधान मंत्री- -> Prime Minister of Japan
• चीन का वि�त्त मंत्री- -> Finance Minister of China
• भारत का राष्ट्रीय पक्षी-> National bird of India
• Phrase based model memorises all these phrases, but essentially all phrases have the same structure i.e.
• where X1 is prime minister or “प्रधान मंत्री” X2 is India or “भारत”
GLUE RULE• Glue rules facilitate the concatenation of two trees originating from the same
Nonterminal. Here are the two glue rules.
• S-S1 X2, S1 X2
• S- X1, X1
• These two rules in conjunction can be used to concatenate discontigous phrases. So, input to the system is a sentence in hindi and a set of SCFG rules extracted from training set..
• To avoid ruleset of unmanageable size and reduce decoding complexity, we typically set limits on possible rule
• At most 2 non-terminal symbol
• At least one but at most 5 words/language
• Span at most 15 words
3.Working of Syntax Model
• Earlier models did not include any linguistic information on trained data which produced grammatically incoherent output.
• The persistence of reordering problem in translated text led to development of syntax based model. In this model Moses is trained on syntactic phrases on Target side.
• Syntactic information includes root word, word class, POS category. We have syntactic parsing on English language in our work.
ADVANTAGES
• Since Hindi is syntactically divergent language, this model overcomes the reordering problem faced in phrase based and hierarchical based model.
• Syntax based MT performs well in case of structural divergent language. Hindi observes SOV structure while English observes SVO structure.
• This model improves the resultant sentence grammatically.
MODEL
VB PRP VB1 VB2 He adores VB TO Listening TO To MN Music
VB PRP VB2 VB1 He TO VB adores TO MN Listening to music
REORDERING
Cont. …..
VB PRP VB2 VB1 He TO VB ȯ� � � adores ¡ɇ TO MN Listening ȯ� to music
Insertion
VB PRP VB2 VB1 ¡� TO VB ȯ� � � Ü ȡ� � ¡ɇ
TO MN Ǖ� � ȯ ȯ� ȯ� ȲȢ� �
Translation �ह संगीत सुनने के प्यार करते हैं
Working• The string-to-tree model accepts a Hindi string as input and seeks across multiple
parsed English trees and finds the highest scoring tree.
• Input is a string- व्यलि�गत जीवन• Translation Rules-
• [SYM][X] personal [NN][X] [FRAG] ||| [SYM][X] व्यक्ति$गत [NN][X] [X] ||| 0.0326378 0.6 0.0652757 1 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||
• [SYM][X] personal life [FRAG] ||| [SYM][X] व्यक्ति$गत जी�न [X] ||| 0.0326378 0.385714 0.0652757 0.6 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||
• [SYM][X] personal life [TOP] ||| [SYM][X] व्यक्ति$गत जी�न [X] ||| 0.0326378 0.385714 0.0652757 0.6 ||| 0-0 1-1 2-2 ||| 0.285714 0.142857 0.142857 |||
• Decoding by Translation Rules-
• [0..3]: [3..3]=</s> [0..2]=S : S ->S -> S </s> :0-0 : c=0 core=(0,-1,1,0,0,0,0,0,0) 0core=(0,-4,6,-11.5445,-5.99562,-7.46699,-1.60944,1.99979,-16.0431)
• [0..1]: [1..1]= X [0..0]=S : S ->S -> S X :0-0 1-1 : c=0 core=(0,-0,1,0,0,0,0,0.999896,0) 0core=(0,-2,3,-3.35156,-0.916291,-2.43527,0,0.999896,-7.74303)
• [0..0]: [0..0]=<s> : S ->S -> <s> :: c=0 core=(0,-1,1,0,0,0,0,0,0) 0core=(0,-1,1,0,0,0,0,0,0)
• [1..1]: [1..1]=personel : X ->X -> व्यक्ति$गत :: c=0 core=(0,-1,1,-3.35156,-0.916291,-2.43527,0,0,0) 0core=(0,-1,1,-3.35156,-0.916291,-2.43527,0,0,-9.44562)
• ,-
• The target tree it produces is
• Output is a string- personal life
(TOP <s> (S (NP personal) (NP (NN life)))) </s>)
4.Working of Hybrid Translation
• The main disadvantage in Statistical Machine Translation (SMT) is that it only translates phrases which were seen during training.
• Unseen phrases such as named entities are not translated .
• This leads to low bleu score .We can improve bleu score by translating named entities from external source.
Working
PreprocessingTranslation by
Moses Decoder
Postprocessing
आपको नए <n translation=monastery >आश्रम</n> के निनमा�ण के लिए निकतने धन की आवश्यकता है
Preprocessing of Data-Moses accept data in following format for hybrid translation-
Translation by Moses Decoder-
We translate normally using Moses decoder which is trained on our data. The translation using Moses decoder is-
How much money you need for the construction of the new आश्रम??
Here word आश्रम is left untranslated.
Post processing-
The untranslated word can be translated by referring the xml tags. The output obtained is-
How much money youo need fr the construction of the new monastery?
Result of Hybrid Translation
• Exclusive Only the XML-specified translation is used for the input phrase. Any phrases from the phrase table that overlap with that span are ignored.
• Inclusive The XML-specified translation competes with all the phrase table choices for that span.
• Ignore The XML-specified translation is ignored completely.
Xml-exclusive: 7.21
Xml-inclusive 7.36
Xml-ignore 6.18
Syntax Model Parsing Extended
BERKELEY PARSERWe have used Berkeley parser for parsing English language in our project. Since we had parser for English language so we trained our system on string-to-tree and tree-to-string.
Input -Economic Services
ENJU PARSER With a wide-coverage probabilistic HPSG grammar and an efficient parsing algorithm, this parser can effectively analyze syntactic/semantic structures of English sentences and provide a user with phrase structures and predicate-argument structures.
Motivation• Moses accepts data for training syntax model in XML format.
• <tree label="NP"> <tree label="DET"> the </tree> <tree label="NN"> cat </tree> </tree>
• There are a number of parsers available for parsing. Each parser has its own idiosyncratic input and output format. Hence, we need to process the output of these parser in the format compatible with Moses for syntax model. There are 3 wrapper scripts available in Moses decoder /scripts/training/wrapper for converting the parser output into Moses format. These are-
• Parse-en-collins.perl – This script is used with Collins parser available from MIT.
• Parse-de-bitpar.perl – This script is used with Bitapar parser available from University of Munich.
• Parse-de-berkeley- This script is used with Berkeley parser available from UC Berkeley.
• We used Enju parser for our experiment we were motivated to write a wrapper script for this purpose.
• Hence we wrote a wrapper script to convert Enju parser output to Moses format compatible for syntax trees.
Format Conversion-
We designed a program to convert XML output of Enju parser to Moses compatible XML format. But Enju and Penn Tree Bank have different syntactic categories.
Because the output of Enju is based on HPSG and it is different from the annotation policy of PTB, tree structures and/or syntactic categories are often different from those given by the PTB-style annotation. However, these mappings provide a clear image of what Enju expresses. So we mapped Enju categories to PTB style for our experiment.
Steps-
1. For every <sentence> tag , form a output string by adding <tree label =”TOP”>
2. For every <cons> tag i. Retrieve its CAT value ($CAT_VALUE).
ii. Retrieve its XCAT value ($XCAT_VALUE).
iii. If the XCAT value of the CONS element is non-empty:
iv. Find the corresponding POS tag by comparing it with the mapping table.
v. Add new tree tag to the given output string by adding <tree label=”CONS_POS”>
where CONS_POS is the POS category derived from mapping table.
3. For every <tok> tagi. Retrieve its POS value ($POS_VALUE).
ii. Add new tree tag to the given output string by adding <tree label=”POS”> where
POS is the POS category derived from POS attribute from tok tag.
4. For every closing </sentence> tag, add new closing </tree> tag.
5. For every closing </cons> tag, add new closing </tree> tag.
6. For every closing </tok> tag, add new closing </tree> tag.
7. All unnecessary attributes are omitted.
Challenges-
• The deep syntactic parser we used was Enju5 (Miyao and Tsujii, 2005), which is based on HPSG and outputs both (dependency-like) predicate-argument relations (Miyao, 2007) and phrase structure trees (although these do not follow the PTB scheme for phrase structure trees) in an XML format.
• The Berkeley is a phrase structure grammar parser based on PBT grammar.
• The output of both the parsers differ in tree structure since Enju’s syntactic representation is richer, but still quite challenging. Enju parser produces strictly binary trees while Berkeley parser produces binary trees. Also the tress in the number of levels and structure.
• This made the task of converting Enju output to Moses Format difficult.
Conclusion -
• We trained syntax model on converted Enju output. There was not any major effect on the bleu score.
Result
INTERFACE - Phrase Based Translation• Input-ये के्षत्र यमुना पार कहलाते हैं �ैसे ये नई दिदल्ली से बहुत से पुलों द्वारा भली भांवित जुडे़
हुए हैं• Output-it regions are caled yamuna par and they new delhi these are also
joined by many bridges from
Hierarchical Based
• Input-ये क्षेत्र यमुना पार कहलाते हैं �ैसे ये नई दिदल्ली से बहुत से पुलों द्वारा भली भांवित जुडे़ हुए हैं• Output- so these regions are caled yamuna par and they from new delhi पुलों by भली भांवित जुड़े
front are
Syntax based
Input-ये के्षत्र यमुना पार कहलाते हैं �ैसे ये नई दिदल्ली से बहुत से पुलों द्वारा भली भावंित जुडे़ हुए हैं Output-it caled yamuna par regions are and it from new delhi of the world the
very popular from पुलों by भली bridges from are
Corpus
Type Source
Gyan nidhi Downloaded from Joshua
Miscellaneous PM speech(July 2015),Budget
Data( 2014),Vigyan Prashar magazine
ACL2005 Available by Cdac, Noida
Agriculture www.pib.gov.in Govt of India
Result of Comparing Models of SMT
Agriculture ACL 2005 Gyan Nidhi Misc.0
2
4
6
8
10
12
14
16
3.48
6.18
3.61 3.453.27
13.8
4.35.2
2.93
10.79
3.21 2.9
1.22.3
0.91.5
Comparison of SMT Models
Phrase Heirarchical Syntax ST Syntax TS
Corpus
Mod
els S
core
Conclusion
We are developing Hindi to English translation system and comparing the results obtained by various models. .During the course of this project, the various models of translation had been evaluated and it is concluded that “Hierarchical based model” is the best approach to carry out this task. The result is verified both on the various English and Hindi sentences corpus. The project concludes with the tasks showing the excellent and desired result as needed. The project, at the end is completed and successfully tested.
Future Work
We need to –
• Perform and compare results of factored model on Moses.
• Find and replace OOV words.
• Compare the effect of replacing OOV words on blue score.
• Transliterate unknown words.
• We propose a technique “word to vec” for hybrid translation that can automate the process of generating dictionaries and phrase table.
References
• Statistical Phrase-Based Translation by Philipp Koehn, Franz Josef Och, Daniel Marcu Information Sciences Institute Department of Computer Science University of Southern California [email protected] , [email protected] , [email protected]
• A Hierarchical Phrase-Based Model for Statistical Machine Translation by-David Chiang Institute for Advanced Computer Studies (UMIACS)University of Maryland, College Park, MD 20742, USA [email protected]
• Philipp Koehn. 2004b. Statistical significance tests for machine translation evaluation. In Proceedings of the 2004 Conference on Empirical Methods in Natural Language Processing (EMNLP),
• Richard Zens and Hermann Ney. 2004. Improvements in phrase-based statistical machine translation. In Proceedings of HLT-NAACL 2004,
• Hierarchical Phrase-Based Statistical Machine Translation System Mtech. Project Dissertation by Bibek Behera under the guidance of Prof. Pushpak Bhattacharyya Department of Computer Science and Engineering Indian Institute of Technology, Bombay
• Koehn, P., Hoang, H., Birch, A., Callison-Burch, C., Federico, M., Bertoldi, N.,Cowan, B., Shen, W.,
Moran, C., Zens, R., Dyer, C., Bojar, O., Constantin, A.and Herbst, E. (2007). Moses: open source toolkit
for statistical machine translation. In Proceedings of the 45th Annual Meeting of the ACL on Interactive
Poster and Demonstration Sessions, ACL ’07, pages 177–180, Stroudsburg, PA, USA. Association for
Computational Linguistics.
Chiang, D. (2005). A hierarchical phrase-based model for statistical machine translation. In
Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL ’05, pages
263–270, Stroudsburg, PA, USA. Association for Computational Linguistics.
Sinha, R. M. K. and Thakur, A. (2005). Machine translation of bi-lingual hindi-english (hinglish) text.
10th Machine Translation summit (MT Summit X), Phuket, Thailand, pages 149–156.Kunal Sachdeva,
Rishabh Srivastava, Sambhav Jain, Dipti Misra Sharma
Language Technologies Research Center, International Institute of Information Technology, Hyderabad,
Hindi to English Machine Translation: Using Effective Selection in Multi-Model SMT
Amr Ahmed and Greg Hanneman, Syntax-Based Statistical Machine Translation:A review
Aswani, N. and Gaizauskas, R. (2005). A hybrid approach to align sentences and words in English–
Hindi parallel corpora. In Proceedings of the ACL Workshop on Building and Using Parallel Texts, pp.
57–64, Ann Arbor, Michigan. Association for Computational Linguistics.
Jurafsky, D. and Martin, J. H. (2008). Speech and Language Processing (2nd edition). Prentice Hall