Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali

Does Syntactic Knowledge help English-Hindi SMT ?

Avinesh. PVS.

K . Taraka Rama,

Karthik Gali

Statistical Machine Translation MOSES (Koehn et al., 2007)

Started for European language pairs Now being used for linguistically distant pairs

English-Arabic English-Chinese

Surging interest in English-Hindi SMT Simple Syntactic and Morphological Processing

R. Ananthakrishnan et al., 2008 Global Lexical Selection and Sentence Reconstruction

S. Venkatapathy and S. Bangalore, 2006

Evaluated using BLEU score

Observed Problems with English-Hindi SMT

Low BLEU scores recorded for small data sets

Linguistically distant languages Morphological differences leading to data sparseness Problem of unknown words Reordering problems Lack of huge language models Quality of the reference translation BLEU score not directly proportional to the quality of the

translation

Our Approach

Proposed and Tested Solutions

Morphological differences leading to data sparseness Use stemming and dictionary based techniques

Problem of unknown words NER techniques and transliteration

Reordering issues Prior reordering of English side using transfer rules

Lack of domain-specific huge language models Using huge monolingual corpus for generating Language

Models Remove erroneous phrases translations from the phrase table

Data Sets

EILMT parallel corpus Training Set : 7000 sentences Development Set : 500 sentences Test Set : 500 sentences

IIIT-TIDES data set Training Set : 50,000 sentence Development Set : 1000 sentences Test Set : 1000 sentences

Reordering: Transfer Grammar

Transfer rules learned using Dependency tree of English sentence

Libin’s parser used for parsing the English side POS tags from both the sides

Example Rules

IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2

English side of the training and test corpus are reordered

Reordering: Learning rules

Word-Alignment using GIZA++

For each node Consider child nodes

Check relative positions of node and child nodes in Source

Check relative positions of projections of word and child nodes in Target

Combine this information to form transfer rule

Reordering: Example

w2//NN

w1//IN w3//VBZ

...... t2 ....... t1.........t3...........

Source node

Alignment

Target Sentence

Learnt Rule: IN˜1_NN&_VB˜2 ==> NN&_IN˜1_VB˜2

Reordering: Simple Syntactic Rules

Movement of the verb in a sentence

30 % (~2100) sentences are compound and complex sentences Based on POS tags No deep syntactic information (like parsing) Handcrafted rules

Tag the corpus with Complex and Compound Sentence tags and, but, because, or mark the conjunctions Move the verbs on the both sides of the connective (e.g. and, but) to

the end of the conjuncts

Handling Unknown words

Dictionary Extract the root from the word on English side Generate TAM information Translate the root into Hindi using dictionary Map the TAM information to Hindi TAM Generate the Hindi word using root and TAM information

Transliteration Transliterate the NE Tool currently not available

Experiments with phrase table Observed that end-of-sentence markers such as (.) are

aligned wrongly Remove the phrase translations with (.)s aligned to words Found 10,000 of them (~2.5% of the total phrase table) Train and tune the system with EOS markers removed and

stored elsewhere Add the EOS markers after running the Decoder Evaluate the system Observed that the BLEU score increases The quality of the translation improves

Effect of language models

Experimented with 7K tourism corpus

No conclusion can be drawn Hugely varying results Cannot explain the variation in the results

Results - I

Without Tuning System

18.09Phrase Removal

18.35Additional Language Model

7K Hindi corpus

17.75Dictionary

17.70Reordering with TG

17.70Baseline

Results - II

With Tuning System

22.60Phrase Removal

20.68Additional Language Model

7K Hindi corpus

-Dictionary

-Reordering with TG

20.18Baseline

Conclusions & Future Work

Incorporate syntactic phrase into SMT Long Distance Reordering Models are required Current systems handle local reordering Robust TG rules for reordering the source side Build huge language models

ReferencesP. Koehn and H. Hoang. 2007. Factored translation models. In Proc. of the 2007 Conference on Empirical Methods in Natural Language Processing(EMNLP/Co-NLL).P. Koehn, F.J. Och, and D. Marcu. 2003. Statistical phrase-based translation. In Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on HumanLanguage Technology-Volume 1, pages 48–54. Association for Computational Linguistics Morristown, NJ, USA.P. Koehn, H. Hoang, A. Birch, C. Callison-Burch, M. Federico, N. Bertoldi, B. Cowan, W. Shen, C. Moran, R. Zens, et al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. In ANNUAL MEETING-ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, volume 45, page 2.S. Kumar, W. Byrne, JOHNS HOPKINS UNIV BALTIMORE MD CENTER FOR LANGUAGE, and SPEECH PROCESSING (CLSP. 2004. MinimumBayes-Risk Decoding for Statistical Machine Translation.

ReferencesFranz Josef Och and Hermann Ney. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1):19–51.F.J. Och. 2003. Minimum error rate training in statistical machine translation. In Proceedings of the 41st Annual Meeting on Association for ComputationalLinguistics-Volume 1, pages 160–167. Association for Computational Linguistics Morristown, NJ, USA.K. Papineni, S. Roukos, T. Ward, and WJ Zhu. 2001. BLEU: a method for automatic evaluation of MT. Research Report, Computer Science RC22176(W0109-022), IBM Research Division, TJ Watson Research Center, 17.L. Shen. 2006. STATISTICAL LTAG PARSING. Ph.D. thesis, University of Pennsylvania.A. Stolcke. 2002. Srilm-an extensible language modeling toolkit, international conference spoken language processing. SRI, Denver, Colorado, Tech.Rep. S. Venkatapathy and S. Bangalore. Three models for discriminative machine translation using Global Lexical Selection and Sentence Reconstruction. InSSST.

Documents

Does Syntactic Knowledge help English- Hindi SMT ? Avinesh. PVS. K. Taraka Rama, Karthik Gali