33
Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Embed Size (px)

Citation preview

Page 1: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Translating

from Morphologically Complex Languages:

A Paraphrase-Based Approach

Preslav Nakov & Hwee Tou Ng

Page 2: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Overview

Page 3: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Overview

Statistical Machine Translation systems Typically assume that word is the basic token-unit of translation

ProblemData sparseness issues for languages with rich morphology.

Our Solution Paraphrase-based approach to translating morphological variants.

3Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 4: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Introduction

Page 5: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Morphologyin Statistical Machine Translation (SMT)

Traditionally, word was the basic token-unit of translation The earliest SMT models (aka, IBM models) were proposed for French and English, which have little morphology.

Most subsequent models remain word-atomic phrase-based hierarchical treelet syntactic

5Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 6: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Morphologyin Statistical Machine Translation (SMT)

Word as an atomic token-unit of translation

Fine for languages with little morphology: English, French, Spanish Chinese (almost no morphology)

Inadequate for morphologically rich languages: Arabic, Turkish, Finnish

word inflections

word-attached clitics German

compounds

6Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 7: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

The Case of Malay

Malay language rich derivational morphology but poor in

word inflections (unlike Arabic, Turkish, Finnish)

word-attached clitics (unlike Arabic, Turkish, Finnish)

concatenated compounds (unlike German, Finnish)

Problem: classic methods do not work for Malay

Solution: paraphrasing techniques word-level phrase-level sentence-level

7Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 8: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Related Work

Page 9: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Related Work

Two general lines of research

1. Inflected forms of the same word are used as equivalence classes or as possible alternatives in translation stemming (Yang and Kirchhoff, 2006) lemmatization (Al-Onaizan et al., 1999; Goldwater and McClosky, 2005; Dyer, 2007) direct clustering (Talbot and Osborne, 2006) factored models (Koehn and Hoang, 2007).

2. Word segmentation compound words (Koehn and Knight,2003; Yang and Kirchhoff, 2006) clitics attached to the preceding word (Habash and Sadat, 2006) morpheme sequence representations (Lee, 2004;Dyer et al., 2008; Dyer, 2009).

Do not work well for Malay It has very little inflectional morphology, if any

compounds are not concatenated

clitics are rare

9Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 10: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Malay Morphology

Page 11: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

The Malay Language

Malay Astronesian language ~180M speakers official in Malaysia, Indonesia, Singapore, and Brunei two major standard versions (mutually intelligible)

Bahasa Malaysia (lit. ‘language of Malaysia’)

Bahasa Indonesia (lit. ‘language of Indonesia’).

11Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 12: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

The Malay Language

Malay – an agglutinative language very rich derivational morphology but nearly non-existent derivational morphology

Inflectionally, Malay is like Chinese:

no grammatical gender, number or tense,

verbs are not marked for person, etc.

12Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 13: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Malay Morphology

New word formation processes affixation compounding reduplication

Other morphological processes clitic attachment

13Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 14: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

New Word Formation Processes in Malay

Affixation – attaching affixes, which are not words, to a word prefixes (e.g., ajar/‘teach’ pelajar/‘student’) suffixes (e.g., ajar ajaran/‘teachings’) circumfixes (e.g., ajar pengajaran/‘lesson’) infixes (e.g., gigi/‘teeth’ gerigi/‘toothed blade’)

Compounding – putting two or more existing words together e.g., kereta/‘car’ + api/‘fire’ keretapi or kereta api typically not concatenated

Reduplication – word repetition e.g., pelajar-pelajar/‘students’

14Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 15: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Clitics in Malay

Examples duduk/‘sit down’ + lah duduklah/‘please, sit down’, kereta + nya keretanya/‘his car’.

Notes: Clitics are not affixes. Clitic attachment is NOT

word inflection process

word derivation process

15Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 16: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

TranslatingMalay

Morphology

A Paraphrase-based Approach

to Translating from Malay

Page 17: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Paraphrase-based Approachto Morphology

Given a complex Malay word, we generate morphologically simpler words from which it can be derived alternative word segmentations

We treat these forms as potential paraphrases of the original word.

We use paraphrasing techniques at three levels: word-level phrase-level sentence-level

17Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 18: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

GeneratingSimpler Morphological Variants Given a complex Malay word, we generate

1. words obtainable by affix strippinge.g., pelajaran pelajar, ajaran, ajar

2. words that are part of a compound worde.g., kerjasama kerja, sama

3. words appearing on either side of a dashe.g., adik-beradik adik, beradik

4. words without cliticse.g., keretanya kereta

5. clitic-segmented word sequencese.g., keretanya kereta nya

6. dash-segmented wordformse.g., aceh-nias aceh – nias

7. combinations of the above.

18

adik-beradiknya adik-beradiknyaadik-beradik nyaadik-beradikberadiknyaberadik nyaberadikadik nyaadik

berpelajaran berpelajaranpelajaranpelajarajaranajar

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 19: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Word-Level Paraphrases

Given a dev/test sentence:1. We generate a list of variants {w’} for each Malay word w.2. We add them to the sentence, thus forming a lattice.

19Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 20: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Word-Level Paraphrases (cont.)

The lattice requires a weight for each arc. We set 1.0 for the original word w. For each paraphrase w’ of w, we use the probability Pr(w’|w), estimated using word-level pivoting over English:

20Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 21: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Word-Level Paraphrases (cont.)

Estimating the probability Pr(w’|w):

21Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 22: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Sentence-Level Paraphrases

dev/test word-level paraphrases need matching phrases

Paraphrase the training data at the sentence-level: For each paraphrasable word w & for each of its paraphrases w’:

we create a version of the sentence with w substituted by w’.

Pair each paraphrased sentence with the original target

22

dia mahu membeli keretanya . || she wants to buy his car .dia mahu beli keretanya . || she wants to buy his car .dia mahu membeli kereta . || she wants to buy his car .dia mahu membeli kereta nya . || she wants to buy his car .

Paraphrased bi-text

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 23: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Sentence-Level Paraphrases (cont.)

We build two phrase tables Torig from the original training bi-text

Tpar from the paraphrased bi-text

We merge these tables1. Keep all entries from Torig.2. Add those phrase pairs from Tpar that are not in Torig. 3. Add extra features:

F1: 1 if the entry came from Torig, 0.5 otherwise.

F2: 1 if the entry came from Tpar, 0.5 otherwise.

F3: 1 if the entry was in both tables, 0.5 otherwise.

The feature weights are set using MERT, and the number of features is optimized on the development set.

23Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 24: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Phrase-Level Paraphrases

We further augment the phrase table with an extra feature, which is calculated using phrase-level pivoting: 1, for phrase pairs coming from Torig

maxp Pr(p’|p), for phrase pairs coming from Tpar

where p’ is a paraphrase of some original Malay phrase p

24Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 25: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Experimentsand Evaluation

Page 26: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Data Training

bi-text: 350K sentence pairsEnglish: 10.4M words

Malay: 9.7M words

Developmentbi-text: 2,000 sentence pairs

English: 63.4K words

Malay: 58.5K words

Testingbi-text: 1,420 sentences

Malay: 28.8K words.

English: 32.8K, 32.4K, and 32.9K words (3 reference translations)

LM49.8M English words

26Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 27: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Evaluation Results: BLEU

27Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 28: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

28

Detailed BLEU

Improvementfor all n-gramsused in BLEU

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 29: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

29

Evaluation With 5 Measures

Consistent improvementfor 5 measures

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 30: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

30

Example Translations

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 31: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

Conclusion

Page 32: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Conclusion

Presented a novel approach to translating from a morphologically complex languageuses paraphrases at three levels of translation

word-level

phrase-level

sentence-level

Demonstrated the potential of the approach to Malay derivationally rich but almost no inflectional morphology

32Translating from Morphologically Complex Languages: A Paraphrase-Based Approach

Page 33: Translating from Morphologically Complex Languages: A Paraphrase-Based Approach Preslav Nakov & Hwee Tou Ng

ACL’2011 : Preslav Nakov & Hwee Tou Ng

Future Work

Improve the paraphrasing models use a richer sense similarity model that combines monolingual and bilingual similarity (Chen et al., 2010)

Try phrase table paraphrasing

instead of sentence-level paraphrasing (Nakov, 2008)

Try other morphologically complex languages SMT models

33

The presented work is supportedby research grant POD0713875.

Translating from Morphologically Complex Languages: A Paraphrase-Based Approach