Upload
abel-lloyd
View
220
Download
0
Embed Size (px)
Citation preview
11
Automatic Transliteration of Proper Nouns from
Arabic to English
Automatic Transliteration of Proper Nouns from
Arabic to EnglishMehdi M. Kashani, Fred Popowich, Anoop
SarkarSimon Fraser University
Vancouver, BCJuly 22, 2007
Mehdi M. Kashani, Fred Popowich, Anoop Sarkar
Simon Fraser UniversityVancouver, BCJuly 22, 2007
Second Workshop on Computational Approaches to Arabic Script-based Languages
22
OverviewOverview
• Problem Definition and Challenges
• Related Work • Our Approach• Evaluation• Discussion
• Problem Definition and Challenges
• Related Work • Our Approach• Evaluation• Discussion
33
TransliterationTransliteration
• Translation tools facilitate dialogue across cultures.• source language target language
• Transliteration is a subtask dealing with transcribing a word written in one writing system into another writing system.• Forward Transliteration
• … Mohammed, Mohammad, Mohamed, Muhammad محمد
• Backward Transliteration• Robert روبرت
• Our task: Arabic to English (for machine(translation
• Translation tools facilitate dialogue across.cultures• source languagetarget language
• Transliteration is a subtask dealing with transcribing a word written in one writing system.into another writing system• Forward Transliteration
• … Mohammed, Mohammad, Mohamed, Muhammad محمد
• Backward Transliteration• Robert روبرت
• Our task: Arabic to English (for machine translation)
44
ChallengesChallenges• Not a 1-to-1 relationship
• can be the equivalent for both Catherine and کاترين.Katharine
• .Context can disambiguate: Katharine Hepburn• Lack of diacritics in Arabic writings
• .Long vowels are always explicitly written• ا ی و
• .Short vowels are omitted in writings• محمد مLحKمJد• MohammedMhmmd
• Lack of certain sounds in Arabic• PopowichBobowij
• Different pronunciations based on the letter.position in the word• how isي ?at the beginning is pronounced• how isي ?at the middle or end is usually pronounced
• Not a 1-to-1 relationship• can be the equivalent for both Catherine and کاترين
Katharine.• Context can disambiguate: Katharine Hepburn.
• Lack of diacritics in Arabic writings• Long vowels are always explicitly written.
• ا ی و• Short vowels are omitted in writings.
• محمد مLحKمJد• Mohammed Mhmmd
• Lack of certain sounds in Arabic• Popowich Bobowij
• Different pronunciations based on the letter position in the word.• how is ي at the beginning is pronounced?• how is ي at the middle or end is usually pronounced?
55
ConventionConvention
• Cursive script but not always.• م ی ه ا ر ب ا ابراهيم
• Right to left• From now on, Arabic words are shown letter by letter and from leftto right• م ی ه ا ر ب ا ابراهيم e b r a h i m
• .Cursive script but not always• م ی ه ا ر ب ا ابراهيم
• Right to left• From now on, Arabic words are
shown letter by letter and from left to right• م ی ه ا ر ب ا ابراهيم e b r a h i m
66
OverviewOverview
• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion
• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion
77
Related WorkRelated Work
• Stalls and Knight (1998) Arabic to English using a noisy channel model for phonemes.
• Al-Onaizan and Knight (2002) Combining phonetic- and spelling-based methods• They show a spelling-based approach works
better than phonetic approach.
• Using parallel corpora (Samy et al., 2005) or comparable corpora (Sproat et al., 2006) (Klementiev and Roth, 2006) to discover the transliterations.• Not very useful for machine translation task.
• Stalls and Knight (1998) Arabic to English using a noisy channel model for phonemes.
• Al-Onaizan and Knight (2002) Combining phonetic- and spelling-based methods• They show a spelling-based approach works
better than phonetic approach.
• Using parallel corpora (Samy et al., 2005) or comparable corpora (Sproat et al., 2006) (Klementiev and Roth, 2006) to discover the transliterations.• Not very useful for machine translation task.
88
OverviewOverview
• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion
• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion
99
Our ApproachOur Approach
• Consists of three phases• Phase 1 (generative): ignore diacritic, simply
turn the Arabic letters into English letters.• m h mm d م ح م د
• Phase 2 (generative): use best candidates from.phase 1 to guess the omitted short vowels
• m h mm d &mo ha mm d م ح م د
• Phase 3 (comparative): compare best candidates from phase 2 with entries in a.monolingual dictionary
• mo ha mm d mohammd ,mohammed … muhammed
• Consists of three phases• Phase 1 (generative): ignore diacritic, simply
.turn the Arabic letters into English letters• m h mm d م ح م د
• Phase 2 (generative): use best candidates from phase 1 to guess the omitted short vowels.
• m h mm d mo ha mm d & م ح م د
• Phase 3 (comparative): compare best candidates from phase 2 with entries in a monolingual dictionary.
• mo ha mm d mohammd mohammed, muhammed …
1010
Training Data Preparation
Training Data Preparation
• Extract name pairs from two different sources• Named entities annotated in the LDC
Arabic Treebank 3• Arabic-English parallel news corpus
tagged by an entity tagger
• In total, 9660 pairs are prepared.
• Extract name pairs from two different sources• Named entities annotated in the LDC
Arabic Treebank 3• Arabic-English parallel news corpus
tagged by an entity tagger
• In total, 9660 pairs are prepared.
1111
ToolsTools
• GIZA++ is used for alignment• Implementation of IBM Model 4• Output files are used to rearrange letters• Alignment score is used to filter out
noise• Cambridge Language Model Toolkit• For us to use these tools…
• our words are treated as "sentences"• our letters are treated as "words"
• GIZA++ is used for alignment• Implementation of IBM Model 4• Output files are used to rearrange letters• Alignment score is used to filter out
noise• Cambridge Language Model Toolkit• For us to use these tools…
• our words are treated as "sentences"• our letters are treated as "words"
1212
PreprocessingPreprocessing
• Noise Filtering• GIZA++ is run on the character-level training data
• Bad pairs have low alignment scores and are filtered out• the 9660 pairs are reduced to 4255 pairs
• Normalizing the training data• Convert names to lower case.• Put space between word letters.• Add prefix (B) and suffix (E) to names.• example: if we were actually dealing with English
• Noise Filtering• GIZA++ is run on the character-level training data
• Bad pairs have low alignment scores and are filtered out• the 9660 pairs are reduced to 4255 pairs
• Normalizing the training data• Convert names to lower case.• Put space between word letters.• Add prefix (B) and suffix (E) to names.• example: if we were actually dealing with English
1313
PreprocessingPreprocessing
• Run GIZA++ with Arabic as the source and English as the target. • the most frequent sequences of English letters
aligned to the same Arabic letter are added to the alphabet.
• Apply the new alphabet to the training data.
• Run GIZA++ with Arabic as the source and English as the target. • the most frequent sequences of English letters
aligned to the same Arabic letter are added to the alphabet.
• Apply the new alphabet to the training data.
م ح م دm o h a m m e d m o h a mm e d
م ح م د
1414
Phase 1Phase 1
• Run GIZA++ with Arabic as the source and English as the target.
• Remove English letters aligned to null from the training set
• Run GIZA++ with Arabic as the source and English as the target.
• Remove English letters aligned to null from the training set
m o h a mm e d
م ح م د
1515
Phase 1Phase 1
• Translation Model: run GIZA++ with English as the source and Arabic as the target.
• Language Model: Run Cambridge LM toolkit on the English training set.
• Use unigram and bigram models for Viterbi training and trigram model for rescoring.
• Translation Model: run GIZA++ with English as the source and Arabic as the target.
• Language Model: Run Cambridge LM toolkit on the English training set.
• Use unigram and bigram models for Viterbi training and trigram model for rescoring.
P(ej|ej-1)P(ai|ej)
EE = argmaxE P(A|E) P(E)
A = a0…aI , E = e0…eJ
1616
Phase 1Phase 1
• Beam Search Decoding is used.• Relative Threshold Pruning.
• k best candidates are returned.
• Beam Search Decoding is used.• Relative Threshold Pruning.
• k best candidates are returned.
BpBmBsBshBllBk
dsmjha
mmndml
dhEfEkEdEghEwE
م ح م د
1717
Phase 2Phase 2
• Instead of removing the letters aligned to null, they are concatenated to their first immediate neighbor. • New letters (phrases) are formed.
• New translation and language models are created using the new training set.
• Instead of removing the letters aligned to null, they are concatenated to their first immediate neighbor. • New letters (phrases) are formed.
• New translation and language models are created using the new training set.
m o h a mm e d
م ح م د
1818
Phase 2Phase 2
• Use phase 1 candidates.
• Phase 1 candidates: e0|e1|…|en
• Phase 2 phrases: p0|p1|…|pn
• All the probabilities P(ai|pi) where pi is not prefixed by given ei are set to zero.
• The rest is similar to phase 1.
• Use phase 1 candidates.
• Phase 1 candidates: e0|e1|…|en
• Phase 2 phrases: p0|p1|…|pn
• All the probabilities P(ai|pi) where pi is not prefixed by given ei are set to zero.
• The rest is similar to phase 1.
1919
Phase 2Phase 2
• The same decoding technique applied.• For each candidate of phase 1, l new names are
generated kl candidates overall.• New combined score
• NewScore = log(S1) + log(S2)
• The same decoding technique applied.• For each candidate of phase 1, l new names are
generated kl candidates overall.• New combined score
• NewScore = log(S1) + log(S2)
BmaBm
BmouBmoBnBno
hehts
hahou
mmmmmemin
mema
dhEdeEdEshEdiEdoE
m/م h/ح mm/م d/د
2020
Phase 3Phase 3
• 94646 first and last names.• US census bureau.• OAK System.
• All the entries are stripped of the vowels.• Francisco frncsc
• Stripped versions of the candidates are compared to the stripped versions of the dictionary entries.
• If matched, the distance of the original names is computed.
• Levenshtein (Edit) Distance.
• 94646 first and last names.• US census bureau.• OAK System.
• All the entries are stripped of the vowels.• Francisco frncsc
• Stripped versions of the candidates are compared to the stripped versions of the dictionary entries.
• If matched, the distance of the original names is computed.
• Levenshtein (Edit) Distance.
2121
Phase 3Phase 3
2222
Word FilteringWord Filtering
• To avoid adding every output that the HMM generates, a word filtering step is necessary.• Web Filtering
• Requires online queries for each execution.• Not suitable for most offline tasks.
• Language Model Filtering • Requires rich and updated language model.
• Google Unigram Model is used.• Over 13 million words with frequency over 200 on
the internet• A huge FSA is built and HMM candidates that are
accepted by the FSA remain in the system.
• To avoid adding every output that the HMM generates, a word filtering step is necessary.• Web Filtering
• Requires online queries for each execution.• Not suitable for most offline tasks.
• Language Model Filtering • Requires rich and updated language model.
• Google Unigram Model is used.• Over 13 million words with frequency over 200 on
the internet• A huge FSA is built and HMM candidates that are
accepted by the FSA remain in the system.
2323
ScoreScore
• Final Score = S + D + R• S is the combined Viterbi score from last
two phases.• D is Levenshtein Distance• R is the number of repetitions.
• All the kl outputs from phase 2 are among the final outputs to accommodate those names not found in dictionary (LD = 0).
• Final Score = S + D + R• S is the combined Viterbi score from last
two phases.• D is Levenshtein Distance• R is the number of repetitions.
• All the kl outputs from phase 2 are among the final outputs to accommodate those names not found in dictionary (LD = 0).
2424
OverviewOverview
• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion
• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion
2525
Test Data PreparationTest Data Preparation
• Extracted from Arabic Treebank 2 part 2• 1167 Transliteration pairs
• First 300 pairs as development test set• Second 300 pairs as blind test set
• Filter out explicit translations or wrong pairs manually
• 273 pairs for development test set• 291 pairs for blind test set
• Extracted from Arabic Treebank 2 part 2• 1167 Transliteration pairs
• First 300 pairs as development test set• Second 300 pairs as blind test set
• Filter out explicit translations or wrong pairs manually
• 273 pairs for development test set• 291 pairs for blind test set
2626
Distribution of NamesDistribution of Names
• Distribution of Seen and Unseen Names
• Number of Alternatives for Names.
• Distribution of Seen and Unseen Names
• Number of Alternatives for Names.
Seen Unseen Total
Dev Set
164 109 273
Blind Set
192 99 291
One Two Three Four
Dev Set
161 85 22 5
Blind Set
185 79 20 7
2727
Performance on DevPerformance on DevTop 1 Top 2 Top 5 Top 10 Top 20
Single-phase HMM
44% 59% 73% 81% 85%
Double-phase HMM
45% 60% 72% 84% 88%
HMM+Dict. 52% 64% 73% 84% 88%
2828
Performance on BlindPerformance on BlindTop 1 Top 2 Top 5 Top 10 Top 20
Single-phase HMM
38% 54% 72% 80% 83%
Double-phase HMM
41% 57% 75% 82% 85%
HMM+Dict. 46% 61% 76% 84% 86%
2929
OverviewOverview
• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion
• Problem Definition and Challenges• Related Work• Our Approach• Evaluation• Discussion
3030
DiscussionDiscussion• Does the "use of a dictionary help a lot"?• You can never have enough training data
• Rare alignments: N i e t z s c h e هت ش ی ن
• Issues with names with different origins• depends on the task
• Appropriate for incorporation into an MTsystem
• Issues introduced in the introduction• absence of short vowels (3)• ambiguity resolution (4)
• ?"Does the "use of a dictionary help a lot• You can never have enough training data
• Rare alignments: N i e t z s c h e هت ش ی ن
• Issues with names with different origins• depends on the task
• Appropriate for incorporation into an MT system
• Issues introduced in the introduction• absence of short vowels (3)• ambiguity resolution (4)
3131
Questions?Questions?