1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC

11

Automatic Transliteration of Proper Nouns from

Arabic to English

Automatic Transliteration of Proper Nouns from

Arabic to EnglishMehdi M. Kashani, Fred Popowich, Anoop

SarkarSimon Fraser University

Vancouver, BCJuly 22, 2007

Mehdi M. Kashani, Fred Popowich, Anoop Sarkar

Simon Fraser UniversityVancouver, BCJuly 22, 2007

Second Workshop on Computational Approaches to Arabic Script-based Languages

22

OverviewOverview

• Problem Definition and Challenges

• Related Work • Our Approach• Evaluation• Discussion

• Problem Definition and Challenges

• Related Work • Our Approach• Evaluation• Discussion

33

TransliterationTransliteration

• Translation tools facilitate dialogue across cultures.• source language target language

• Transliteration is a subtask dealing with transcribing a word written in one writing system into another writing system.• Forward Transliteration

• … Mohammed, Mohammad, Mohamed, Muhammad محمد

• Backward Transliteration• Robert روبرت

• Our task: Arabic to English (for machine(translation

• Translation tools facilitate dialogue across.cultures• source languagetarget language

• Transliteration is a subtask dealing with transcribing a word written in one writing system.into another writing system• Forward Transliteration

• … Mohammed, Mohammad, Mohamed, Muhammad محمد

• Backward Transliteration• Robert روبرت

• Our task: Arabic to English (for machine translation)

44

ChallengesChallenges• Not a 1-to-1 relationship

• can be the equivalent for both Catherine and کاترين.Katharine

• .Context can disambiguate: Katharine Hepburn• Lack of diacritics in Arabic writings

• .Long vowels are always explicitly written• ا ی و

• .Short vowels are omitted in writings• محمد مLحKمJد• MohammedMhmmd

• Lack of certain sounds in Arabic• PopowichBobowij

• Different pronunciations based on the letter.position in the word• how isي ?at the beginning is pronounced• how isي ?at the middle or end is usually pronounced

• Not a 1-to-1 relationship• can be the equivalent for both Catherine and کاترين

Katharine.• Context can disambiguate: Katharine Hepburn.

• Lack of diacritics in Arabic writings• Long vowels are always explicitly written.

• ا ی و• Short vowels are omitted in writings.

• محمد مLحKمJد• Mohammed Mhmmd

• Lack of certain sounds in Arabic• Popowich Bobowij

• Different pronunciations based on the letter position in the word.• how is ي at the beginning is pronounced?• how is ي at the middle or end is usually pronounced?

55

ConventionConvention

• Cursive script but not always.• م ی ه ا ر ب ا ابراهيم

• Right to left• From now on, Arabic words are shown letter by letter and from leftto right• م ی ه ا ر ب ا ابراهيم e b r a h i m

• .Cursive script but not always• م ی ه ا ر ب ا ابراهيم

• Right to left• From now on, Arabic words are

shown letter by letter and from left to right• م ی ه ا ر ب ا ابراهيم e b r a h i m

66

OverviewOverview

• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion


77

Related WorkRelated Work

• Stalls and Knight (1998) Arabic to English using a noisy channel model for phonemes.

• Al-Onaizan and Knight (2002) Combining phonetic- and spelling-based methods• They show a spelling-based approach works

better than phonetic approach.

• Using parallel corpora (Samy et al., 2005) or comparable corpora (Sproat et al., 2006) (Klementiev and Roth, 2006) to discover the transliterations.• Not very useful for machine translation task.

• Stalls and Knight (1998) Arabic to English using a noisy channel model for phonemes.

• Al-Onaizan and Knight (2002) Combining phonetic- and spelling-based methods• They show a spelling-based approach works

better than phonetic approach.

• Using parallel corpora (Samy et al., 2005) or comparable corpora (Sproat et al., 2006) (Klementiev and Roth, 2006) to discover the transliterations.• Not very useful for machine translation task.

88

OverviewOverview



99

Our ApproachOur Approach

• Consists of three phases• Phase 1 (generative): ignore diacritic, simply

turn the Arabic letters into English letters.• m h mm d م ح م د

• Phase 2 (generative): use best candidates from.phase 1 to guess the omitted short vowels

• m h mm d &mo ha mm d م ح م د

• Phase 3 (comparative): compare best candidates from phase 2 with entries in a.monolingual dictionary

• mo ha mm d mohammd ,mohammed … muhammed

• Consists of three phases• Phase 1 (generative): ignore diacritic, simply

.turn the Arabic letters into English letters• m h mm d م ح م د

• Phase 2 (generative): use best candidates from phase 1 to guess the omitted short vowels.

• m h mm d mo ha mm d & م ح م د

• Phase 3 (comparative): compare best candidates from phase 2 with entries in a monolingual dictionary.

• mo ha mm d mohammd mohammed, muhammed …

1010

Training Data Preparation

Training Data Preparation

• Extract name pairs from two different sources• Named entities annotated in the LDC

Arabic Treebank 3• Arabic-English parallel news corpus

tagged by an entity tagger

• In total, 9660 pairs are prepared.

• Extract name pairs from two different sources• Named entities annotated in the LDC

Arabic Treebank 3• Arabic-English parallel news corpus

tagged by an entity tagger

• In total, 9660 pairs are prepared.

1111

ToolsTools

• GIZA++ is used for alignment• Implementation of IBM Model 4• Output files are used to rearrange letters• Alignment score is used to filter out

noise• Cambridge Language Model Toolkit• For us to use these tools…

• our words are treated as "sentences"• our letters are treated as "words"

• GIZA++ is used for alignment• Implementation of IBM Model 4• Output files are used to rearrange letters• Alignment score is used to filter out

noise• Cambridge Language Model Toolkit• For us to use these tools…

• our words are treated as "sentences"• our letters are treated as "words"

1212

PreprocessingPreprocessing

• Noise Filtering• GIZA++ is run on the character-level training data

• Bad pairs have low alignment scores and are filtered out• the 9660 pairs are reduced to 4255 pairs

• Normalizing the training data• Convert names to lower case.• Put space between word letters.• Add prefix (B) and suffix (E) to names.• example: if we were actually dealing with English

• Noise Filtering• GIZA++ is run on the character-level training data

• Bad pairs have low alignment scores and are filtered out• the 9660 pairs are reduced to 4255 pairs

• Normalizing the training data• Convert names to lower case.• Put space between word letters.• Add prefix (B) and suffix (E) to names.• example: if we were actually dealing with English

1313

PreprocessingPreprocessing

• Run GIZA++ with Arabic as the source and English as the target. • the most frequent sequences of English letters

aligned to the same Arabic letter are added to the alphabet.

• Apply the new alphabet to the training data.

• Run GIZA++ with Arabic as the source and English as the target. • the most frequent sequences of English letters

aligned to the same Arabic letter are added to the alphabet.

• Apply the new alphabet to the training data.

م ح م دm o h a m m e d m o h a mm e d

م ح م د

1414

Phase 1Phase 1

• Run GIZA++ with Arabic as the source and English as the target.

• Remove English letters aligned to null from the training set

• Run GIZA++ with Arabic as the source and English as the target.

• Remove English letters aligned to null from the training set

m o h a mm e d

م ح م د

1515

Phase 1Phase 1

• Translation Model: run GIZA++ with English as the source and Arabic as the target.

• Language Model: Run Cambridge LM toolkit on the English training set.

• Use unigram and bigram models for Viterbi training and trigram model for rescoring.

• Translation Model: run GIZA++ with English as the source and Arabic as the target.

• Language Model: Run Cambridge LM toolkit on the English training set.

• Use unigram and bigram models for Viterbi training and trigram model for rescoring.

P(ej|ej-1)P(ai|ej)

EE = argmaxE P(A|E) P(E)

A = a0…aI , E = e0…eJ

1616

Phase 1Phase 1

• Beam Search Decoding is used.• Relative Threshold Pruning.

• k best candidates are returned.

• Beam Search Decoding is used.• Relative Threshold Pruning.

• k best candidates are returned.

BpBmBsBshBllBk

dsmjha

mmndml

dhEfEkEdEghEwE

م ح م د

1717

Phase 2Phase 2

• Instead of removing the letters aligned to null, they are concatenated to their first immediate neighbor. • New letters (phrases) are formed.

• New translation and language models are created using the new training set.

• Instead of removing the letters aligned to null, they are concatenated to their first immediate neighbor. • New letters (phrases) are formed.

• New translation and language models are created using the new training set.

m o h a mm e d

م ح م د

1818

Phase 2Phase 2

• Use phase 1 candidates.

• Phase 1 candidates: e0|e1|…|en

• Phase 2 phrases: p0|p1|…|pn

• All the probabilities P(ai|pi) where pi is not prefixed by given ei are set to zero.

• The rest is similar to phase 1.

• Use phase 1 candidates.

• Phase 1 candidates: e0|e1|…|en

• Phase 2 phrases: p0|p1|…|pn

• All the probabilities P(ai|pi) where pi is not prefixed by given ei are set to zero.

• The rest is similar to phase 1.

1919

Phase 2Phase 2

• The same decoding technique applied.• For each candidate of phase 1, l new names are

generated kl candidates overall.• New combined score

• NewScore = log(S1) + log(S2)

• The same decoding technique applied.• For each candidate of phase 1, l new names are

generated kl candidates overall.• New combined score

• NewScore = log(S1) + log(S2)

BmaBm

BmouBmoBnBno

hehts

hahou

mmmmmemin

mema

dhEdeEdEshEdiEdoE

m/م h/ح mm/م d/د

2020

Phase 3Phase 3

• 94646 first and last names.• US census bureau.• OAK System.

• All the entries are stripped of the vowels.• Francisco frncsc

• Stripped versions of the candidates are compared to the stripped versions of the dictionary entries.

• If matched, the distance of the original names is computed.

• Levenshtein (Edit) Distance.

• 94646 first and last names.• US census bureau.• OAK System.

• All the entries are stripped of the vowels.• Francisco frncsc

• Stripped versions of the candidates are compared to the stripped versions of the dictionary entries.

• If matched, the distance of the original names is computed.

• Levenshtein (Edit) Distance.

2121

Phase 3Phase 3

2222

Word FilteringWord Filtering

• To avoid adding every output that the HMM generates, a word filtering step is necessary.• Web Filtering

• Requires online queries for each execution.• Not suitable for most offline tasks.

• Language Model Filtering • Requires rich and updated language model.

• Google Unigram Model is used.• Over 13 million words with frequency over 200 on

the internet• A huge FSA is built and HMM candidates that are

accepted by the FSA remain in the system.

• To avoid adding every output that the HMM generates, a word filtering step is necessary.• Web Filtering

• Requires online queries for each execution.• Not suitable for most offline tasks.

• Language Model Filtering • Requires rich and updated language model.

• Google Unigram Model is used.• Over 13 million words with frequency over 200 on

the internet• A huge FSA is built and HMM candidates that are

accepted by the FSA remain in the system.

2323

ScoreScore

• Final Score = S + D + R• S is the combined Viterbi score from last

two phases.• D is Levenshtein Distance• R is the number of repetitions.

• All the kl outputs from phase 2 are among the final outputs to accommodate those names not found in dictionary (LD = 0).

• Final Score = S + D + R• S is the combined Viterbi score from last

two phases.• D is Levenshtein Distance• R is the number of repetitions.

• All the kl outputs from phase 2 are among the final outputs to accommodate those names not found in dictionary (LD = 0).

2424

OverviewOverview



2525

Test Data PreparationTest Data Preparation

• Extracted from Arabic Treebank 2 part 2• 1167 Transliteration pairs

• First 300 pairs as development test set• Second 300 pairs as blind test set

• Filter out explicit translations or wrong pairs manually

• 273 pairs for development test set• 291 pairs for blind test set

• Extracted from Arabic Treebank 2 part 2• 1167 Transliteration pairs

• First 300 pairs as development test set• Second 300 pairs as blind test set

• Filter out explicit translations or wrong pairs manually

• 273 pairs for development test set• 291 pairs for blind test set

2626

Distribution of NamesDistribution of Names

• Distribution of Seen and Unseen Names

• Number of Alternatives for Names.

• Distribution of Seen and Unseen Names

• Number of Alternatives for Names.

Seen Unseen Total

Dev Set

164 109 273

Blind Set

192 99 291

One Two Three Four

Dev Set

161 85 22 5

Blind Set

185 79 20 7

2727

Performance on DevPerformance on DevTop 1 Top 2 Top 5 Top 10 Top 20

Single-phase HMM

44% 59% 73% 81% 85%

Double-phase HMM

45% 60% 72% 84% 88%

HMM+Dict. 52% 64% 73% 84% 88%

2828

Performance on BlindPerformance on BlindTop 1 Top 2 Top 5 Top 10 Top 20

Single-phase HMM

38% 54% 72% 80% 83%

Double-phase HMM

41% 57% 75% 82% 85%

HMM+Dict. 46% 61% 76% 84% 86%

2929

OverviewOverview



3030

DiscussionDiscussion• Does the "use of a dictionary help a lot"?• You can never have enough training data

• Rare alignments: N i e t z s c h e هت ش ی ن

• Issues with names with different origins• depends on the task

• Appropriate for incorporation into an MTsystem

• Issues introduced in the introduction• absence of short vowels (3)• ambiguity resolution (4)

• ?"Does the "use of a dictionary help a lot• You can never have enough training data

• Rare alignments: N i e t z s c h e هت ش ی ن

• Issues with names with different origins• depends on the task

• Appropriate for incorporation into an MT system

• Issues introduced in the introduction• absence of short vowels (3)• ambiguity resolution (4)

3131

Questions?Questions?

Documents

1 1 Automatic Transliteration of Proper Nouns from Arabic to English Mehdi M. Kashani, Fred Popowich, Anoop Sarkar Simon Fraser University Vancouver, BC