Upload
karthik-sankar
View
105
Download
2
Embed Size (px)
DESCRIPTION
An Efficient Rule-Based System for Morphological Parsing of Tamil Language
Citation preview
An Efficient Rule-Based System for Morphological Parsing of Tamil Language
தமிழஉருபனியலஆயவு
STUDENTS:Karthik S 106106029Praveen Kumar 106106045Venkataraman GB 106106073
GUIDE:Dr. V. Gopalakrishnan
Final Semester ProjectDepartment of Computer Science and EngineeringNational Institute of Technology, Tiruchirappalli
May 2010
Agenda Overview of the Project NLP Applications – The Stakeholders The problem at hand The proposed solution
◦ Rule – Based Morphological Analysis◦ Machine Learning
Where does it all fit in ? Need for Tamil Morphological Analysis Resources Obtained Implementation Details Demonstration Future Scope
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW1
Overview of the Project Natural Language Processing Morphological Analysis Tamil Language
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW
Morphing …
… And in Tamilநடநதான நடநதனர
நடககினறாள
நடபபான
நடககினறான
2
NLP Appl icat ions – The Stakeholders
WHO ARE THE STAKEHOLDERS ?Natural Language Processing Applications like: Stemming Machine Translation Speech Recognition Information Retrieval
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW
WHY ARE THESE APPLICATION THE STAKEHOLDERS ?
3
The problem at handMorphological Analysis of Tamil involves understanding the word structure and its inflections
AGGLUTINATION IN TAMIL Agglutination is the morphological process of adding affixes to the base of a word Typical Tamil verb form will have a number of suffixes showing person, number,
mood, tense and voice.
INFLECTIONS IN TAMIL
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW
பால - Gender
எண - Number
திணை - Class
காலம - Tense இடம - Person
4
The problem at handMorphological Analysis of Tamil involves understanding the word structure and its inflections
AGGLUTINATION IN TAMIL Agglutination is the morphological process of adding affixes to the base of a word Typical Tamil verb form will have a number of suffixes showing person, number,
mood, tense and voice.
INFLECTIONS IN TAMIL Example: vAlntukkontiruntēṉ: [வாழநதுககாணடிருநதேதன]
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW
vAl - வாழ intu - நது kontu - ககாணடு irunta - இருநத ēn - ஏன
root voice marker tense marker aspect marker person marker
live past tenseobject voice
during past progressive first person,Singular
4
The proposed solution There are two levels called lexical and surface levels. In the surface level, a word is represented in its original orthographic form. In the lexical level, a word is represented by denoting all of the functional components of the word.
RULE – BASED MORPHOLOGICAL ANALYSISAnalyzing word inflections using rules specified in Tamil Grammar
அன ஆன அள ஆள அர ஆர பமமார
அஆ குடுதுறு என ஏன அல அன
அம ஆம எம ஏம ஓமமா டுமமூர
கடதற ஐ ஆய இமமின இரஈர
ஈயர கயவு மமனபவும பிறவும
விணை&யின விகுதி மபயரினும சிலவேவ
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW5
SURFACE LEVEL LEXICAL LEVEL
நனனூல
மதாலகாபபியம
The proposed solution MACHINE LEARNING APPROACH
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW6
While checking for suffixes in a given word, more than one suffix might be possible, if the rules are strictly followed. But only one suffix is semantically possible.
விகுதி : படிதது – “ ” உ படிததது – “ ” து or “ ” உ ???
M/L approach helps the system in “learning” the correct parsing method for the word, and in the subsequent processing of the same word, the wrong possibilities are automatically eliminated.
1
Two words might share the same inflectional part.
நடககினறான படிககினறான
The inflectional part of every word is learnt by the system. This helps in optimization by eliminating the need to analyse the second word again from scratch
2
Where does it all fit in ?
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW7
Characters
Word – Tokenization
Morphological Analysis
Sentence Syntax Analysis
Semantic Analysis
ப டி த தா ன
படிததான
படி - தத - ஆன
அவன புததகதணைதப படிததான
Meaning of the sentence ???
Need for Tamil Morphological AnalysisENGLISH vs. TAMIL
TRANSLATION AND SEMANTIC ANALYSIS
அவன மதுரை$ககு வநதாள -- Semantically Wrong
To check semantic correctness of a sentence, morphological analysis is needed. How to translate the above sentence ??
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW8
I came நான வநதேதனYou came ந வநதாயThey came அவரகள வநதனர
He came அவன வநதானShe came அவள வநதாள
Resources ObtainedEMILLE – CIIL TAMIL MONOLINGUAL CORPUS Enabling Minority Language Engineering Collaborative Venture of
◦ Lancaster University, UK ◦ Central Institute of Indian Languages (CIIL), Mysore, India
Distributed by European Language Resources Association [ELRA]
TAMIL WORDNET The database is a semantic dictionary that is designed as a lexical network Developed by
◦ Department of Linguistics of Tamil University◦ AU-KBC Research Centre, Chennai
Tamil Wordnet resembles a traditional dictionary. It also contains valuable information about morphologically related words
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW9
Implementation Details - 1
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW10
Input Tamil Word
Check in DB
C-V Segmentation
Root verb
?
Backward Scanning of inflections
Classify and Remove Inflection
Output
Conflict ResolutionMachine Learning
No
YesYes
No
Implementation Details - 2
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW11
படிததான
ப டி த தா ன
ப - அ ட - இ த த - ஆ ன
ப அ ட இ த த ஆ ன
படி < VERB_ROOT >
தத < PAST TENSE >
ஆன < 3SM >
Implementation Details - 3UNICODE SUPPORT FOR TAMIL U+0B80 – U+0BFF
GOOGLE TAMIL TRANSLITERATOR IME (Input Method) Google Transliteration IME is an input method editor which allows users to
enter text Tamil using a roman keyboard
PROGRAMMING LANGUAGE Java
DATABASES MySQL Databases, with JDBC to access the database
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW12
Implementation Details - 3TRANSLITERATION MODULE A simple Transliterator module - to enable conversion from Tamil to English
and vice-versa Example:
◦ அ - a◦ ஆ - aa◦ க - ka
HASH TABLE GENERATOR The application uses two data files, containing a list of vigudhi and idainilai. The Java Hash Generator Code loads the data from the workbooks, adds
them to a hash table, and serializes the data and outputs to an external data file, which can be loaded whenever the application requires access.
08/04/2023National Institute of Technology,
Tiruchirappalli
WHO WHAT WHYWHERE HOW13
Future Scope The algorithm can be extended to cover nouns and noun forms too.
The algorithm can be improved to incorporate stricter rules so as to reduce conflicts that arise in the output generated by the current system.
The algorithm can be extended for other agglutinative languages.
The various resources obtained as a part of this project, including the EMILLE-CIIL ELRA Corpus, the Tamil Wordnet Database and other tools can be used for further study, research and development in the field of Natural Language Processing at our college in the years to come.
08/04/2023National Institute of Technology,
Tiruchirappalli
14
References A Novel Approach to Morphological Analysis for Tamil Language
◦ Anand kumar M1, Dhanalakshmi V1, Rajendran S2, Soman K P Nannool and Tholkaapiyam
◦ Tamil Grammar texts The Morphological Generator and Parsing Engine for Tamil Verb
Forms. ◦ Ultimate Software Solution, Dindigul
Morphological Analyzer for Tamil ◦ Anandan. P, Ranjani Parthasarathy, Geetha T.V. [2002]◦ ICON 2002, RCILTS-Tamil, Anna University, India.
Morphology. A Handbook on Inflection and Word Formation◦ Daelemans Walter, G. Booij, Ch. Lehmann, and J. Mugdan (eds.) [2004]
Tamil Part-of-Speech tagger based on SVMTool◦ Dhanalakshmi V, Anandkumar M, Vijaya M.S, Loganathan R, Soman K.P, Rajendran S [2008]◦ Proceedings of the COLIPS International Conference on Asian Language Processing 2008 (IALP).
Unsupervised Learning of the Morphology of a Natural Language.◦ John Goldsmith. [2001]◦ Computational Linguistics, 27(2):153–198.
Computational morphology of verbal complex ◦ Rajendran, S., Arulmozi, S., Ramesh Kumar, Viswanathan, S. [2001]◦ Paper read in Conference at Dravidan University, Kuppam, December 26-29, 2001.
08/04/2023National Institute of Technology,
Tiruchirappalli
15
Thank you
08/04/2023National Institute of Technology,
Tiruchirappalli