Abhijeet Padhye Translit Final Stage

Embed Size (px)

Citation preview

Linguistic Enrichment of Statistical Transliteration

MTP Final Stage Presentation

Guided by:Prof. Pushpak Bhattacharyya

Presented by:Abhijeet Padhye (06305902)

Department of Computer Science & Engineering IIT Bombay

Presentation Pathway Problem Statement Motivation What is Transliteration? Syllables and their Structure Sonority Theory Concept of Schwa Proposed Transliteration Model Experiments and Results Discussions Conclusion and Future Work References

Problem StatementTo exploit the Phonological similarities of Roman and Devanagari in order to linguistically aid the process of Statistical Transliteration.

Motivation An important component of Machine Translation When you cannot Translate Transliterate. Critical in tackling problem of OOV words and proper nouns Proves acute in translating Named entities for CLIR Transliteration a Phonetic translation process; Apt to exploit phonetic and phonological properties

What is Transliteration? A process of phonetically translating words like named entities or technical terms from source to target language alphabet. Examples: Gandhiji OOV words like - Namaskar

Humans translate/transliterate frequently for different reasons

An example of how transliteration comes to rescue when no translations exist

x Overview of TransliterationSource Word Target Word

Transliteration Units

Transliteration Units

Basic of syllablesSyllable is a unit of spoken language consisting of a single uninterrupted sound formed generally by a Vowel and preceded or followed by one or more consonants. Vowels are the heart of a syllable(Most Sonorous Element) Consonants act as sounds attached to vowels.

Syllable Structure Simple syllables Baba,Ba + ba +

Complex syllablesAn VC?

Andrewdrew CVC?

Possible syllable structures The Nucleus is always present Onset and Coda may be absent Possible structures V CV VC CVC

Introduction to sonority theoryThe Sonority of a sound is its loudness relative to other sounds with the same length, stress and speech. Some sounds are more sonorous Words in a language can be divided into syllables Sonority theory distinguishes syllables on the basis of sounds.

Sonority Hierarchy Obstruents can be further classified into: Fricatives Affricates Stops

Sonority sequencing principleThe Sonority Profile of a syllable must rise until its Peak(Nucleus), and then fall. Peak (Nucleus) Onset Coda

example ABHIJEET Sonority Profile 1 A I E H J B Sonority Profile 2 A I E H J B E T

E T

The concept of schwaFirst alphabet of IAL {a} Unstressed and Toneless neutral vowel Some schwas deleted and some are not Schwa deletion important issue for grapheme to phoneme conversion Handled using a well-established schwa deletion algorithm Example: Priyatama Last a changes the Gender

Proposed Transliteration ModelSource Language Words Target Language Words

Syllabification Modules

Source Language Syllables

Target Language Syllables

Moses TrainingTarget Language Model Phrase translation tables

SRILM

Source Language Words

Moses Decoder

Transliterated output

Transliteration system workflow Syllabification of parallel list of names in Roman and Devanagari Using these parallel list for: Alignment of syllables Training Moses translation toolkit Language model generation using SRILM

Decoding using trained phrase-translation tables and language model Comparing results to analyze performance

Experiments and Results1. Syllabification of Roman and Devanagari words

Fig : Syllabification Algorithm

Syllabification results

A few examplesLanguage English Hindi Correct Soloman Akbarkhan Incorrect Venkatachalam

Transliteration Process Syllabification of list of 10000 parallel names written in Roman and Devanagari and preparing a parallel aligned list of syllables. Training Language Models for target language using SRILM toolkit. Training MOSES with aligned corpus of 7500 names and target language model as input. Testing with a list of 2500 proper names using the trained model for transliteration.

Roman to Devanagari TransliterationFig : Result for Roman to Devanagari Transliteration

Fig : Top-n Inclusion results

Devanagari to Roman Transliteration

Fig : Result for Devanagari to Roman Transliteration

Fig : Top-n translation results

Comparison with Character ngram based model Same Experimental setup; Transliteration units changed to n-grams Bigrams (Sandeep Sa, an, nd, de, ee, ep) Trigrams (Sandeep San, and, nde, dee, eep) Quadrigrams (Sandeep Sand, ande, ndee, deep) Observations suggest performance improvement using syllables as transliteration units n-gram based models prove to be ignorant to phonological properties like unstressed vowels

Fig : Comparison with N-gram based model

Comparison with State-of-theart Systems Google transliteration engine and Quillpad used as benchmarks for comparison A list of 1000 words written in Roman alphabet used as test input Our system outperforms Quillpad and just falls short of Googles results. A more intense training with larger training set might improve system performance.

Fig : Comparison with State-of-the-art transliteration systems

Discussions Accents : Thoda or thora? Mapping of sounds Mahaan Kahaan -

Silent Letters Psychatrist -

Discussions (cntd) Improper Schwa deletion Venkatachalam + + + + +

Improper placement (Onset or Coda)-

or

Similar phonological structure but different pronunciation and

Conclusion and Future work Transliteration can prove critical in supporting Machine Translation Phonologically aware transliteration units like syllables show strong signs of performance improvement Syllable-based transliteration performs at least up to the state-of-the-art systems. Syllabification algorithms should be subjected to further improvement Developed system should be supplied with larger and more accurate training set. Some linguistic issues discussed above are very challenging cases for future work on transliteration

References Pirkola A., Toivonen J., Keskustalo H., Visala K., Jarvelin K. 2003. Fuzzy Translation of Cross-Lingual Spelling Variants. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Gao W., Lam W., and Wang K. 2004. Phoneme-based Transliteration of Foreign Names for OOV Problem. International Joint Conference on Natural Language Processing. Osamu F. 1975. Syllable as a unit of Speech Recognition. IEEE Transactions on Acoustics, Speech and Signal Processing. Phillip Koehn et.al. 2007. Moses: Open Source Toolkit for Statistical Machine Translation. Annual Meeting of the Association for Computational Linguistics (ACL), Demonstration Session, Prague, Czech Republic. Laver J. 1994. Principles of Phonetics. Cambridge University Publications. PG. 114. Knight L. and Graehl J. 1997. Machine Transliteration. Proceedings of ACL 1997. Pg 128-135. Stolcke A. 2002. SRILM An Extensible Language Modeling Toolkit. In proceedings of International Conference on Spoken Language Processing. Choudhury M. and Bose A. 2002. A Rule Based Schwa Deletion Algorithm for Hindi. Technical Report. Dept. of Comp. Sci. & Engg. Indian Institute of Technology, Kharagpur.

Background Theory

Approaches towards Transliteration

Complex Syllable structure

Fig : Detailed syllable structure

Fig : Complex syllables fitting in above structure

Sonority theory & syllablesA Syllable is a cluster of sonority, defined by a sonority peak acting as a structural magnet to the surrounding lower sonority elements.

Represented as waves of sonority or Sonority Profile of that syllable Nucleus Onset Coda

Sonority Hierarchy for English and HindiSound Segment Vowels Liquids Nasals Fricatives Affricates Stops Letters contained {"a", "e", "i", "o", "u"} {"y", "r", "l", "v", "w"} {"n", "m"} {"s", "z", "f", "th", "h", "sh", "x"} {"ch", "j"} {"b", "d", "g", "p", "t", "k", "q", c}

Fig : Sonority hierarchy for EnglishSound Segment Vowels Matras Liquids Nasals Fricatives Affricates Stops Letters contained {" {" {" {" {" {" {" "," "," "," ", " ", " ", " ", " "," "," "," "," "," "," "," "," "," "," "," "} "," "," "," "," "," "," "," "," "} "," "," "} ", " " ," " ," ", " "} ", " ", " ", " ", " ", " ", " ", " "} ", " "} ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " ", " "}

Fig : Sonority hierarchy for Hindi

Maximal Onset PrincipleThe Intervocalic consonants are maximally assigned to the Onsets of syllables in conformity with Universal and Language-Specific Conditions.

In case of words having two valid syllable set, one with maximum onset length would be preferred. Example Diploma Di + plo + ma Dip + lo + ma

Schwa deletion algorithmProcedure delete_schwa (DS) Input : word (String of alphabets) Output : Input word with some schwas deleted. Mark all the full vowels and consonants followed by vowels other than the inherent schwas (i.e. consonants with Matras) and all the hs in the word as F unless it is explicitly marked as half by use of halant. Mark all the consonants immediately followed by consonants or halants (i.e consonants of conjugate syllables) as H. Mark all the remaining consonants, which are followed by implicit schwas as U. If in the word, y is marked as U and preceded by i, I, ri, u or U, mark it F. If y, r, l or v are marked U and preceded by consonants marked H, then mark them F. If a consonant marked U is followed by a full vowel, then mark that consonant as F. While traversing the word from left to right, if a consonant marked U is encountered before any consonant or vowel marked F, then mark that consonant as F. If the last consonant is marked U, mark it H. If any consonant marked U is immediately followed by a consonant marked H, mark it F. While traversing the word from left to right, for every consonant marked U, mark it H if it is preceded by F and followed by F or U, otherwise mark it F. For all consonants marked H, if it is followed by a schwa in the original word, then delete the schwa from the word. The resulting new word is the required output. End procedure delete_schwa

Example of Schwa deletion

Fig : Application of Schwa deletion Algorithm

Examples Correct TransliterationsSource Language English Hindi Target Language Hindi English ExamplesREYMOND MUKUNDAN SALEEMABEGAM RAJGURU

Incorrect TransliterationSource Language English Hindi Target Language Hindi English ExamplesVENKATACHALAM DHRUVA UNK COMLATA