Upload
brice-summers
View
215
Download
2
Embed Size (px)
Citation preview
Does Syntactic Knowledge help English-Hindi SMT ?
Avinesh. PVS.
K . Taraka Rama,
Karthik Gali
Statistical Machine Translation MOSES (Koehn et al., 2007)
Started for European language pairs Now being used for linguistically distant pairs
English-Arabic English-Chinese
Surging interest in English-Hindi SMT Simple Syntactic and Morphological Processing
R. Ananthakrishnan et al., 2008 Global Lexical Selection and Sentence Reconstruction
S. Venkatapathy and S. Bangalore, 2006
Evaluated using BLEU score
Observed Problems with English-Hindi SMT
Low BLEU scores recorded for small data sets
Linguistically distant languages Morphological differences leading to data sparseness Problem of unknown words Reordering problems Lack of huge language models Quality of the reference translation BLEU score not directly proportional to the quality of the
translation
Our Approach
Proposed and Tested Solutions
Morphological differences leading to data sparseness Use stemming and dictionary based techniques
Problem of unknown words NER techniques and transliteration
Reordering issues Prior reordering of English side using transfer rules
Lack of domain-specific huge language models Using huge monolingual corpus for generating Language
Models Remove erroneous phrases translations from the phrase table
Data Sets
EILMT parallel corpus Training Set : 7000 sentences Development Set : 500 sentences Test Set : 500 sentences
IIIT-TIDES data set Training Set : 50,000 sentence Development Set : 1000 sentences Test Set : 1000 sentences
Reordering: Transfer Grammar
Transfer rules learned using Dependency tree of English sentence
Libin’s parser used for parsing the English side POS tags from both the sides
Example Rules
IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2
English side of the training and test corpus are reordered
Reordering: Learning rules
Word-Alignment using GIZA++
For each node Consider child nodes
Check relative positions of node and child nodes in Source
Check relative positions of projections of word and child nodes in Target
Combine this information to form transfer rule
Reordering: Example
w2//NN
w1//IN w3//VBZ
...... t2 ....... t1.........t3...........
Source node
Alignment
Target Sentence
Learnt Rule: IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2
Reordering: Simple Syntactic Rules
Movement of the verb in a sentence
30 % (~2100) sentences are compound and complex sentences Based on POS tags No deep syntactic information (like parsing) Handcrafted rules
Tag the corpus with Complex and Compound Sentence tags and, but, because, or mark the conjunctions Move the verbs on the both sides of the connective (e.g. and, but) to
the end of the conjuncts
Handling Unknown words
Dictionary Extract the root from the word on English side Generate TAM information Translate the root into Hindi using dictionary Map the TAM information to Hindi TAM Generate the Hindi word using root and TAM information
Transliteration Transliterate the NE Tool currently not available
Experiments with phrase table Observed that end-of-sentence markers such as (.) are
aligned wrongly Remove the phrase translations with (.)s aligned to words Found 10,000 of them (~2.5% of the total phrase table) Train and tune the system with EOS markers removed and
stored elsewhere Add the EOS markers after running the Decoder Evaluate the system Observed that the BLEU score increases The quality of the translation improves
Effect of language models
Experimented with 7K tourism corpus
No conclusion can be drawn Hugely varying results Cannot explain the variation in the results
Results - I
Without Tuning System
18.09Phrase Removal
18.35Additional Language Model
7K Hindi corpus
17.75Dictionary
17.70Reordering with TG
17.70Baseline
Results - II
With Tuning System
22.60Phrase Removal
20.68Additional Language Model
7K Hindi corpus
-Dictionary
-Reordering with TG
20.18Baseline
Conclusions & Future Work
Incorporate syntactic phrase into SMT Long Distance Reordering Models are required Current systems handle local reordering Robust TG rules for reordering the source side Build huge language models
ReferencesP. Koehn and H. Hoang. 2007. Factored translation models. In Proc. of the 2007 Conference on Empirical Methods in Natural Language Processing(EMNLP/Co-NLL).P. Koehn, F.J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on HumanLanguage Technology-Volume 1, pages 48–54. Association for Computational Linguistics Morristown, NJ, USA.P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, volume 45, page 2.S. Kumar, W. Byrne, JOHNS HOPKINS UNIV BALTIMORE MD CENTER FOR LANGUAGE, and SPEECH PROCESSING (CLSP. 2004. MinimumBayes-Risk Decoding for Statistical Machine Translation.
ReferencesFranz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.F.J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for ComputationalLinguistics-Volume 1, pages 160–167. Association for Computational Linguistics Morristown, NJ, USA.K. Papineni, S. Roukos, T. Ward, and WJ Zhu. 2001. BLEU: a method for automatic evaluation of MT. Research Report, Computer Science RC22176(W0109-022), IBM Research Division, TJ Watson Research Center, 17.L. Shen. 2006. STATISTICAL LTAG PARSING. Ph.D. thesis, University of Pennsylvania.A. Stolcke. 2002. Srilm-an extensible language modeling toolkit, international conference spoken language processing. SRI, Denver, Colorado, Tech.Rep. S. Venkatapathy and S. Bangalore. Three models for discriminative machine translation using Global Lexical Selection and Sentence Reconstruction. InSSST.