Upload
dangkhue
View
231
Download
3
Embed Size (px)
Citation preview
NLP TOOLS DEVELOPMENT FOR TAMIL LANGUAGELoganathan, UFAL
Overview
IntroductionMy work involving Tamil NLP
English – Tamil MTTamil Morphological Analyzer
Introduction – Indian Languages
23 official languages29 languages have
more than 1 million native speakers in India.
TamilApprox 67 million speakers in India
Introduction – Resources for Tamil
Tamil Editing/Unicode Support - Available
Dictionary – AvailableTamil lexicon, Winslow, Fabricius, McAlpin, Kathirvelu pillai – Published online by Univ. Of Chicago
Morphological Analyzer/Tagger – Partially Available
Phrase Structure/Dependency Parser – NO
Parallel Corpora – NO (publicly, readily)Active Bilingual websites: www.wsws.org, www.cinesouth.comTamilnadu government schoolbooks (with English translations)
My Work involving Tamil NLP
Enlgish – Tamil Translation System (Master’s Thesis)Morphological Analyzer
English – Tamil Translation System
General Differences
Morphological
Noun Cases Tamil suffixes English words
Nominative Ø Ø
Accusative ai Ø
Dative kku, ukku, ku Ø, to, for, at
Benefactive (u)kkuAka for
Instrumental Al with, of, by
Sociative Otu, utan with
Locative il, itam in, on, among, to, with, from
Ablative iliruwTu, itamiruwTu from
Genitive in, utaiya, aTu ‘s, of
General Differences
Syntactical difference
Kumar talked about linguistics
kumAr moziyiyalaip paRRip pEcinAn
குமார் ெமாழியியைலப் பற்றிப் ேபசினான்
moziyiyalaip paRRip kumAr pEcinAn
General Differences
Syntactical (complex sentences)
Pollution Control Authority’s regional officer said
mAcuk kattuppAttu vAriya aTikAri TeriviTTArmaTTiya amaiccarin karuTTil TangkaLaTu TuRaikku utanpAtu illai enamAcuk kattuppAttu vAriyam ceyalpaTavillai enRa
that his department is not agreeing with the central minister’s opinionthat Pollution Control Authority is not functioning
மாசுக்கட் ப்பாட் வாாியம் ெசயல்படவில்ைல என்றமத்திய அைமச்சாின் க த்தில் தங்கள ைறக்குஉடன்பா இல்ைல எனமாசுக்கட் ப்பாட் வாாிய அதிகாாி ெதாிவித்தார்
MCSCRC
MC SC RC-> RC SC MC
How hard is English-Tamil MT
The previous examples illustrateTamil -> SOV, English -> SVOTamil is a restricted free word order languageTamil is agglutinative
Difference occurs inSyntactical level i.e word orderingMorphological level
We need An efficient syntax reordering moduleMorphological generator
Approaches
Syntax Transfer Based MTStatistical Machine Translation (SMT)
Syntax Transfer MT
Syntax Transfer MT: Architecture
MT using Synchronous CFG
Source Grammar – Tree Adjoining Grammar
Tree Generating System introduced by Aravind JoshiTAG – Multilevel tree rewriting systemBasic units (Elementary trees)
Initial trees (Basic structures)Auxiliary trees (Recursive structures)
TAG – Formal Definition
TAG Operations
Substitution and Adjunction/ Ex: from XTAG Manual
TAG Derivation Structure
Derivation Tree: “Srini bought a book”/ Ex: from XTAG Manual
Parsing Lexicalized TAGs
Many parsing algorithms were suggested, including CYK parser for TAG, Head-Corner parsing algorithm, Bidirectional parsing algorithm and more recent work on Statistical LTAG parsing.For parsing source side, Yves Schabes algorithm was implemented in Java.
Tran
sfer
Gra
mm
ar
Experiments and Results
The entire translation system is written in Java.Implemented modules include LTAG parser for English, STAG system for syntax reordering of English into Tamil.Our system uses the same language resources developed for XTAG system for parsing the source side sentence. All XTAG related databases have been converted into Mysql format.
LTAG Tree Editor for Visualization
Collaborative effort between Amrita and CDAC
Experiments and Results
Sample Output
Statistical Machine Translation
EILMT English-Tamil Parallel Corpus
Monolingual Tamil Data
Health Tourism
6000 15000
#Sentences #Words
Training data 95464 >1.2 million
Test data 1000 12K
Statistical Machine Translation
Monolingual data: Sensitivity to Morphology
SMT – Results for Health Corpus
a) Sensitivity to sentence length & Morphology
b) Sensitivity to training corpus size & Morphology
SMT – Results for Tourism Corpus
a) Sensitivity to sentence length & Morphology
b) Sensitivity to training corpus size & Morphology
SMT – Sample Output
Tamil Morphological Analyzer
NLP @ Amrita – Morphological Analyzer for Tamil
Tamil is agglutinative
The major inflectional categories in Tamil are nouns and verbs.
Noun morphology of Tamil is simple compared to verb morphology.
Extremely simple paradigms were used to categorize the root words.
The lexicon includes 50000 nouns and few hundred verbs.
FSTs were used to build Morphological Analyzer/generator
Figure: Morph Generator FST
NLP @ Amrita – Morphological Analyzer for Tamil
Lexicon FST
Rule FST
TEnI+N+PL
TEnIkkaL
Morph Generator
Rule Invert FST
Lexicon inv FST
Morph Analyzer
TEnIkkaL
TEnI+N+PL
Figure: Finite State Transducer for Morphological Processing
FST composition
NLP @ Amrita – Morphological Analyzer for Tamil
Figure: Morphological Analyzer screenshot
Thank you