Ma Language Format Final

8/7/2019 Ma Language Format Final

1/35

0

Tamil Morphological Analyser

Vijay Sundar Ram R, Menaka S and Sobha Lalitha DeviAU-KBC Research Centre. MIT Campus of Anna University, Chennai-44

{sundar,menaka,[email protected]}


2/35

1

Tamil Morphological Analyser


3/35

2

Abstract. In morphologically rich languages, the word bears more grammatical information, due to rich suffix

affixation. Morphological analysis is the process of segmenting the given word into component morphemes and

assigning the correct morphosyntactic information. In this paper, we discuss about a method for developing a

morphological analyser for Tamil, a morphologically rich Dravidian language. This is designed using paradigm

based approach and Finite State Automata, which works efficiently in recursive tasks and considers only the

current state for having a transition. We test the morphological analyser with online web data and it

performances with a correctness of 91.70%.


4/35

3

1. Introduction. Morphological analysis of a word is the process of segmenting the word into component

morphemes and assigning the correct morphosyntactic information. For a given word, a morphological analyser

(MA) will return its root word and the word class along with the other grammatical information depending upon

its word class. MA returns all possible parse for a given word, without considering the context. MA is a very

essential for languages having rich inflectional and derivational morphology such as morphologically rich

languages like Dravidian languages (Tamil, Telugu, Malayalam and Kannada), Finno-Ugric languages (Finnish,

Estonian, Hungarian, Turkish), Indo-Aryan languages (Hindi, Bengali, Marathi, Gujarati). In Indo-European

languages (French, English), as the affixations to the root word are less, lemmatization, the process of getting

the root word (lemma), serves the purpose of MA. MA is a vital tool in NLP applications. In morphological rich

languages, as there are multiple affixation, the finer grammatical information which helps in building efficient

NLP applications, can be obtained only from MA. MA is required in most of the applications such as

information extraction, QA system, machine translation, even in the information retrieval task to get the correct

root word.

There are several approaches attempted for MA. The two-level morphology approach by Kimmo

Koskenniemi is the early attempts, where he tested this formalism for Finnish (Koskenniemi 1983). In this two-

level representation, the surface level is to describe word forms as they occur in written text and lexical level to

encode lexical units such as stem and suffixes. The two-level rules define a mapping between the two levels and

they are represented in a Finite State Automata. This approach is used for recognizing and generating word

forms. This formalism is also used in other languages such as Arabic, Dutch, English, French, German, Italian,

Japanese, Portuguese, Swedish and Turkish (Schulze 1994). A rule based, heuristic analyser for Finnish nominal

and verb forms was developed by Jappinen (Jappinen 1983), a word-grammar based morphological analyser for

agglutinative languages using the two-level formalism and a unification-based formalism was introduced by

Agirve (Itziar 2000), here they have worked on Basque, a highly agglutinative language. Arabic Finite State

Transducer for morphological analysis using Xerox Finite State Transducer (XFST) was built by Beesley, by

reworking extensively on the lexicon and rules in the Kimmo-style (Beesley 1996). Similarly using XFST,

Wintner came up for Hebrew (Wintner 2005) and Karine made a Persian MA (Karine 2004). Oflazer Kamel

developed a Finite State Machine (FSM) based Turkish MA. For Swahili, using the syllables, utilizing the

surface level clues, the features present in a word are identified by Robert Elwell (Elwell 2008). A weighted

Finite State approach was used to handle Finnish compound words. Finite State Automata based MA was

developed in Tamil (). In Bengali, unsupervised methodology is used in developing a MA (Sajib Dasgupta,


5/35

4

2007) and two-level morphology approach was used to handle Bengali compound words. There are rule based

MA was developed for Sanskrit (Girish Nath Jha 2007) and Oriya (Mohanty 2004).

In this paper, we present a methodology for morphological analysis of Tamil, morphologically rich

language. Here we have used Finite State Automata (FSA) and the paradigm approach. The reminder of the

paper is organized as follows. In the following section we have short description on morphology of Tamil,

where we have explained the inflections and derivations in nouns, verb. The third section is Orthographic rules

in Tamil, we have briefed on the orthographic changes that occur during affixation. We have explained our

approach in building Tamil morphological analyzer in section four. Section 5 discusses the different

experiments done to evaluate the MA and the paper concludes with the conclusion section.

2. Tamil morphology. Tamil belongs to the South Dravidian family of languages. It is a verb-final language

and has a relatively free word order; It is an inflectional language. Agglutination is another feature of the

language.

Tamil morphology is characterized as agglutinative or concatenative, i.e., Words are formed by successfully

adding suffixes to the root word in series. When suffixes attach to the root several morphophonemic changes

take place. The order in which suffixes attach to a root form determine the morphosyntax of the language and

the various changes that take place when a suffix attaches are called the morphophonemics.

The lexical categories in Tamil and their morphological processes are discussed below.

2.1. Nouns. Nouns form an important lexical category in the language and they take inflectional as well as

derivational suffixes.

Suffixation to a noun is not arbitrary. They attach in a particular order. The number suffix attaches to the noun

root, followed by the case suffix. Postpositions follow case. This, in turn, is followed by the clitics.

The number suffix is for singular and kaL for plural. In a few cases, the plural suffix is mAr.

After the number suffix, the stem takes the case suffix. Computationally, Tamil has 8 casesi. Lehmann

(Lehmann 1989) also classifies the Tamil case system in a similar manner.

INSERT Table 1. HERE

The next suffix in the series may be the disjunction clitic(o:) or the coordination clitic(um) or the emphatic

clitic(e:). After the addition of the above suffixes, the emphatic suffix (taan) is added. This may be followed by

the fifth suffix which can be interrogative (a:) or supposition (a:m).


6/35

5

The Morphosyntax of Noun inflections may be summarized as

root + {number} + {case} + {DISJ/COOR/EMPH} + {PSP} + {EMP} + {INT/SUPP}

The following examples illustrate the inflections of a noun.

INSERT Table 2 HERE.

Productive suffixes of nouns.

1. The suffixes -ka:ran/-ka:ri denoting 'man/woman' as in pa:l-ka:ran (milk-man)/pa:l-ka:ri (milk-woman)

2. The suffixes -an/-i attach to a noun denoting a quality to derive the noun for the man/woman with thequality.

Eg. kuruTu + an -> kuruTan

'blindness' + MAS -> 'blind person(MAS)'

kuruTu + i -> kuruTi

'blindness' + FEM -> 'blind person(FEM)'

3. The suffix tanam attaches to a noun to show the habit.Eg. piTiva:Tam + tanam -> piTiva:Tattanam

'stubbornness' + SUFF -> 'habit of stubbornness'

aTimai + tanam -> aTimaittanam

'slave' + SUFF -> 'slavery'

4. The negative suffix -inmaican be added to several nouns to add the meaning of -lessness.Eg. tu:kkam + inmai -> tu:kkaminmai

'sleep' + SUFF -> 'sleeplessness'

payam + inmai -> payaminmai

'fear' + SUFF -> 'fearlessness'

The derivations that are possible from noun roots are discussed below.

Derivation of verbs from nouns.

1. Certain verbs like aLi/aTi/cey/koTu/paNNuadd to a noun to form the corresponding verb. These arequite productive, especially in the case of loan-words


7/35

6

Eg. ka:pi + aTi -> ka:piyaTi

'copy' + 'beat' -> 'copy'

va:y + aTi -> va:yaTi

'mouth' + 'beat' -> 'chatter'

veLLai + aTi -> veLLaiyaTi

'white' + beat -> 'whitewash'

ca:vi + koTu -> ca:vikoTu

'key' + 'give' -> 'wind up'

kaTan + koTu -> kaTankoTu

'loan' + 'give' -> 'lend'

Derivation of adjectives from nouns

1. The dravidian suffix for adjective formation is a.Eg. azaku + a -> azakiya

'beauty' + ADJ -> 'beautiful'

2. The extremely productive suffix a:na (which is a frozen form reduced from the past tense relativeparticiple a:kiyaof the verba:ku) attaches to almost any noun denoting quality.

Eg. azaku + a:na -> azaka:na


ve:kam + a:na -> ve:kama:na

'speed' + ADJ -> 'fast'

3. The bound postposition uLLa/uTaiya attaches to several nouns to produce adjectives.Eg. azaku + uLLa-> azakuLLa


4. The suffix -aavatu/-a:m is added to ordinals to form the corresponding adjective.Eg.

iraNTu + a:vatu -> iraNTa:vatu

'two' + ADJ -> 'second'

iraNTu + a:m -> iraNTa:m

'two' + ADJ -> 'second'


8/35

7

5. The place names with an ending a: shorten the last vowel to form the corresponding adjective.Eg. intiya: + tu:tarakam -> intiya tu:tarakam

'India' + Embassy -> 'Indian Embassy'

6. The negative suffix -aRRaadds to nouns to produce an adjective.Eg. oLi + aRRa -> oliyaRRa

'light' + 'without' -> 'without light'

Derivation of adverbs from nouns

1. The suffix a:ka is equally productive in deriving adverbs from nouns as a:na in deriving adjectivesfrom nouns. Sometimes a:kagets reduced to a:y

Eg.azaku + a:ka-> azaka:ka

'beauty' + ADV -> 'beautifully'

azaku + a:y -> azaka:y

'beauty' + ADV -> 'beautifully'

2. The negative suffixes -inRi/-anRi/-aRRuadd to the root noun to produce an adverb.Eg. paNam + anRi -> paNamaNri

money + apart -> apart from money

pa:tuka:ppu + inRi -> pa:tuka:ppinRi

'safety' + 'without' -> 'without safety'

mati + aRRu -> matiyaRRu

'intelligence' + 'without' -> 'without intelligence'

3. The bound postposition e:Rpa adds to the noun in dative case to give an adverb.vitikku + e:Rpa -> vitikke:Rpa

'fate-DAT' + 'according' -> 'according to fate'

Compound Nouns. Compound nouns formation is productive in Tamil. They may be formed by several

strategies as explained by Rajendran(Rajendran 2004). Some examples below illustrate the abundance of

compound nouns in Tamil.

Eg.

vattam + me:jai -> vatta me:jai


9/35

8

'round' + 'table' -> 'round table'

kuzantai + paruvam -> kuzantaipparuvam

'child' + 'period' -> 'childhood'

marapu + aNu -> marapaNu

'tradition' + 'atom' -> 'gene'

coTTu + ni:r + pa:canam -> coTTu ni:rp pa:canam

'drop' + 'water' + 'irrigation' -> 'drip irrigation'

aTi + vayiRu -> aTivayiRu

'below' + 'stomach' -> 'abdomen'

ka:l + kaTTu -> ka:lkaTTu

'leg' + 'knot' -> 'marriage'

Pronouns. Pronouns in Tamil are a closed set of words. They have person, number and gender (PNG)

information in them.

The following table summarizes the pronouns in Tamil


Pronouns are part of nouns and hence behave like nouns. The above pronouns inflect by taking the Case suffix

and other suffixes that a noun takes. Since number is inherent in the pronoun, it doesn't take an explicit Number

suffix. The only exception is ivai which sometimes takes a redundant plural suffix kaL.

2.2. Verbs. Verb forms can be broadly classified into two types.

1. Finite verbs2. Non-finite verbs

Finite verbs. The verb root takes the tense suffix first, followed by a fused PNG suffix. This can be followed by

any of the clitics.

The Tense can be Past/Present/Future if it is in the affirmative. The negative form does not take tense.

The PNG Suffixes may be as in the table below.


The Morphosyntax of finite verbs may be summarized as

root + Tense + PNG + {DISJ/COOR/EMPH/EMP/INT/SUPP}


10/35

9

root + INF + NEGVERB + {DISJ/COOR/EMPH/EMP/INT/SUPP }

The following examples illustrate the above morphosyntactic rule.


The verb root, after taking the tense suffix may take the relative participle markera instead of the PNG suffix to

produce the relative participles.


The above inflections can be summarized as

root + Tense/NEG + RP

Only derivations can happen at this point. One of the pronominal endings from avan(3SM) /avaL(3SF)

/avar(3SH) /atu(3SN)/avai(3PN)/avarkaL(3PE) may agglutinate to the relative participle, thus forming a noun.

Now, the word behaves as a noun and takes noun inflections.

Eg.

paTi + tt + a + avan + o:Tu -> paTittavano:Tu

read + PST + RP + 3SM + SOC -> with the one(MAS) who read

Non-finite verbs. The verb root may directly one of the following suffixes Infinitive, Verbal Participle,

Conditional, Concessive, Hortative, Optative. These forms may also have a negative suffix attaching to the root

before taking on these suffixes. Some of these forms take an auxiliary verb like iru/viTuto produce the negative

form.

Eg.

pa:Tu + a -> pa:Ta

'sing' + INF

pa:Tu + a: + a -> pa:Ta:ta

'sing' + NEG + INF

pa:Tu + i -> pa:Ti

'sing' + VBP

pa:Tu + a: + u -> pa:Ta:tu

'sing' + NEG + VBP

pa:Tu + a:l -> pa:Tina:l

'sing' + COND

pa:Tu + a: +viTu+ a:l -> pa:Ta:viTTa:l


11/35

10

'sing' + NEG + AUXV + COND

pa:Tu + a:lum -> pa:Tina:lum

'sing' + CONC

pa:Tu + a: +viTu +a:lum -> pa:Ta:viTTa:lum

'sing' + NEG + AUXV + CONC

pa:Tu + ala:m -> pa:Tala:m

'sing' + HORT

pa:Tu + a: + iru + ala:m -> pa:Ta:tirukkala:m

'sing' + NEG + AUXV + HORT

pa:Tu + aTTum -> pa:TaTTum

'sing' + OPT

pa:Tu + a: + iru + aTTum -> pa:Ta:tirukkaTTum

'sing' + NEG + AUXV + OPT

The Morphosyntax of Non-finite Verbs may be summarized as

root + {NEG} + INF/VBP/COND/CONC/HORT/OPT + {DISJ/COOR/EMPH} + {EMP} + {INT/SUPP}

Derivation of nouns from verbs

1. From the Relative participle (RP) forms, nouns can be derived by the pronominalisation process. i.e.,one of the pronominal suffixes avan, avaL, avar, avarkaL, atu, avaiattach to the RP form to produce a noun.

This is very productive.

2. The suffix -kai is added to some verbs to produce nounsEg. cey + kai -> ceykai

'do' + SUFF -> 'act'

3. The suffix -talis added to several verbs to form the corresponding noun.Eg. paRa + tal -> paRattal

'fly' + SUFF -> 'flying'

makiz + tal -> makiztal

'enjoy' + SUFF -> 'enjoying'

vaLar + tal -> vaLartal

'grow' + SUFF -> 'growing'


12/35

11

4. The suffix -puis added to some verbs to form the corresponding noun.Eg. vaLar + pu -> vaLarppu

'grow' + SUFF -> ' bringing up'

ninai + pu -> ninaippu

'think' + SUFF -> 'thought'

5. One of the sequences of suffixes -v-atu/p-atu/pp-atuis added to any verb to denote the action denotedby the verb.

Eg. cey + v +atu -> ceyvatu

do + FUT + SUFF -> doing

Derivation of adjectives from verbs

1. Suffixes like -ataRka:na/ -avaRRukka:na / ataRkuriya/ atarke:RRa / takka are actually frozen forms ofagglutinating words that can add to a verb root to form an adjective.

For instance, if we consider ataRka:na,

it is atu + ku + a:na -> ataRka:na

'that' + DAT + ADJ ->'for that'

Now this can attach to a verb

cey + ataRka:na -> ceyvataRka:na

'do' + 'for that' -> 'for doing'

Derivation of adverbs from verbs

1. Suffixes like a:Rpo:la/ ava:Ru/ a:ka/ ma:Ru/ a:maland certain postpositions like anRi add to verbs toform adverbs.

Eg. paTi + tt + a:Rpo:la -> paTitta:Rpo:la

'read' + PST + ADV -> 'as though ... was reading'

paTi + ma:Ru -> paTikkuma:Ru

'read' + ADV -> 'to read'

paTi + a:mal -> paTikka:mal

'read' + NEG -> 'without reading'

2. enRuand enaare synonymous Complementizers which add to reduplicating onomatopoeic forms toform adverbs.

Eg. kalakala + enRu -> kalakalavenRu


13/35

12

ONOM + ADV -> 'happily'

kalakala + ena -> kalakalavena

ONOM + ADV -> 'happily'

2.3.Inflections of other categoriesIn Tamil, the other POS categories do not inflect, but they take the clitics that the nouns / verbs take.

Hence for any other category apart from the ones discussed above, the morphosyntax is

root + {DISJ/COOR/EMPH} + {EMP} + {INT/SUPP}

Agglutination. Agglutination is a feature of the Tamil language. Due to the highly agglutinating nature of this

language and the morphophonemic variations that take place at the point of agglutination, it is very difficult to

mark the word boundaries.

Eg. arapi + katal + in + araci -> arapikkatalinaraci

'Arabian' + 'sea' + GEN + 'queen' -> 'Queen of the Arabian Sea'

paTi + ttu + koL + NT u + iru +t + a + avan + ai -> paTittukkoNTirutavanai

'read' + VBP + AUXV + VBP + AUXV + PST + RP + PRON-3SM + ACC -> 'the one(MAS) who was

reading(OBJ)'

3. Orthographic Ruless in Tamil "The ways in which the morphemes of a given language are variously

represented by phonemic shapes can be regarded as a kind of code. This code is the orthographic system of the

language." (Hockett 1958:135). It is also known as Internal Sandhi.

The orthographic rules in Tamil given below were arrived at from the works of Pope (Pope 1979) and tamiz

(Subramanian and Gnanasundaram 2001).

1. When the root word ends in a vowel and the attaching suffix begins with any vowel, the glide vor y isadded depending on the following rules.

INSERT Table 7 HERE.

2. When the root word ends in one of the long Close vowels (i:/u: )and the attaching suffix/word beginswith one of the stop consonants k/c/T/t/p/R, the stop consonant doubles at the end of the root.

Eg. i: + kaL -> i:kkaL

'fly' + PL -> 'flies'

3. When the root word is of two syllables, with a short first syllable and ends in u, and the attachingsuffix/word begins with one of the stop consonants k/c/T/t/p/R, the stop consonant doubles at the end of the


14/35

13

root. In all other cases of root word ending in u, and the attaching suffix/word begins with one of the stop

consonants k/c/T/t/p/R, there is no change.

Eg. maTu + kaL -> maTukkaL

'hillock' + PL -> 'hillocks'

4. When the root word ends in TTu,ttuand the suffix starts with k/c/T/t/p/R, there is no change.Eg. pa:TTu + kaL -> pa:TTukaL

'song' + PL -> 'songs'

5. When the root word ends in the labial nasal m, and the attaching suffix/word begins with one of thestop consonants k/c/T/t/p/R, the m is replaced by the homorganic nasal of the stop consonant.

Eg. maram + kaL -> marakaL

'tree' + PL -> 'trees'

6. When the root word ends in the labial nasal m, and the attaching suffix begins with a vowel, the m isreplaced by the oblique suffix tt.

Eg. maram + ai -> marattai

'tree' + ACC -> 'tree(OBJ)'

7. When the root word has a short single syllable and ends with the nasal n, and the attaching suffix/wordbegins with a vowel, the n doubles.

Eg. pon + iliruntu -> ponniliruntu

'gold' + ABL -> 'from gold'

8. When the root ends in the nasal , and the attaching suffix starts with a vowel, the homoorganic stopconsonant is added in between.

Eg. manmo:hanci + a:l -> manmo:hancika:l

'Manmohan Singh'+INS -> 'by Manmohan Singh'

9. When the root word has short single syllable ending in a glide (y/v), and the attaching suffix startswith a vowel, the glide doubles. Native words do not end in v.

Eg. poy + ai -> poyyai

'lie' + ACC -> 'lie(OBJ)'

lav + a:l -> lavva:l

'love' + INS -> 'by love'

10. When the root word has a short single syllable and ends with a lateral (l/L ), or if the root word is


15/35

14

another word with such a word at the end, and the attaching suffix/word begins with one of the stop

consonants k/c/T/t/p/R, the lateral may be replaced with the homorganic stop consonant.

Eg. kal + kaL -> kaRkaL

'stone' + PL -> 'stones'

cekal + kaL -> cekaRkaL

'brick' + PL -> 'bricks

poruL + kaL -> poruTkaL

'thing' + PL -> 'things'

11. When the root word has a short single syllable and ends with a lateral (l/L), and the attaching suffixstarts with a vowel, the lateral doubles.

Eg. kaL + il -> kaLLil

'toddy' + LOC -> 'in toddy'

pal + il -> pallil

'tooth' + LOC -> 'in the tooth'

12. When the root ends in a stop consonant k/c/T/t/p/R/j, and the attaching suffix starts with a vowel, theconsonant doubles.

Eg. maik + ai -> maikkai

'mike' + ACC -> 'mike(obj)'

Tip + il -> Tippil

'tip' + LOC -> 'in the tip'

jeT + il -> jeTTil

'jet' + LOC -> 'in the jet'

ko:c + o:tu -> ko:cco:tu

'coach' + SOC -> 'with the coach'

pert + ukku -> perttukku

'berth' + DAT -> 'to the berth'

haj + ukku -> hajjukku

'Haj' + DAT -> 'to Haj'

But when the stop consonant is p, and it is preceded by the modifier H to denote the labiodental

fricative, there is no doubling.


16/35

15

Eg. vulHp + in -> vulHpin

'wolf' + GEN -> 'wolf's'

When the root is a loan-word, it may end in the stop consonant, but may be voiced, preceded by a long

vowel, and the attaching suffix starts with a vowel, there is no change.

Eg. la:lpa:k + il -> la:lpa:kil

'Lalbagh' + LOC -> 'in Lalbagh'

vik + ai -> vikkai

'wig' + ACC -> 'wig(OBJ)'

13. When the root ends in a stop (k/c/T/t/p/R), preceded by the homorganic nasal (//N//m/n), and theattaching suffix starts with a vowel, there is no change.

Eg. vik + il -> vikil

'wing' + LOC -> 'in the wing'

pec + il -> pecil

'bench' + LOC -> 'on the bench'

pat + a:l -> pata:l

'bandh' + INS -> 'due to bandh'

14. In cases where the root ends in a sibilant (s/sh), preceded by a short vowel, the sibilant doubles.Eg. push +in -> pushshin

'Bush' + GEN -> 'Bush's'

pas + il -> passil

'bus' + LOC -> 'on the bus'

In other cases where the ending sibilant is not preceded by a short vowel, the suffix can attach without

any change. Sometimes, we observe that the s is replaced with c.

Eg. pars + il -> parcil /parsil

'purse' + LOC -> 'in the purse'

When the attaching suffix starts with a consonant, there may be no change or the smay change to cu.

Eg.ke:s + kaL -> ke:skaL/ke:cukaL

'case' + PL -> 'cases'


17/35

16

Some of the rules above where there is a difference in behaviour when the loan-word ends in a particular

consonant and the corresponding phoneme is voiced or voiceless in the source language of the loan-word, it is

not directly possible to encode this info in the rules. Hence the default rule for the particular consonant ending is

applied.

4. Our Approach.In this approach, we built a FSA using all possible suffixes, categorize the root word lexiconbased on paradigm approach to optimize the number of orthographic rules and use morphosyntax rules to get the

correct analysis for the given word. FSA is used FSA using as the analysis of the word is done suffix by suffix.

FSA are the proven technology for efficient and speedy processing.

When applying the formalism of two-level morphology for morphologically rich languages, there are

some well-known limitations such as

1, developing Finite State transducers that encode very complex two-level rules is not easy.

2, morphological categories are not directly encoded as a part of the lexical form.

3, lexical representation tends to be arbitrary.

4, various diacritical features inserted into the lexical strings to insure proper analysis makes Kimmo-

style awkward or impractical for generation (Beesley 1996).

In our approach the complex affixations are easily handled by FSA and in the FSA, the required

orthographic changes are handled in every state.

Our MA consists of three major components

1, Finite State Automata, modeled using all possible suffixes (allomorphs).

2, lexicon, categorized based on the paradigm approach

3, morphosyntax rules, for filter the correct parse of the word

4.1. Finite State Automata (FSA). FSA is a model of behavior composed of a finite number of states andtransitions between these states. FSA is an abstract device used for recognizing simple syntactic structures or

patterns. An automata is normally depicted by directed graph, called State Diagram and it is also represented in

a tabular form as State Table. An FSA as a string processing device accepts strings as input and decides if the

structure is correct, that is, it either accepts or rejects the string. From a mathematical perspective it is regarded

as a function, mapping a set of string to the set {Accept, Reject}. Based on the transition given by the FSA, it is

classified as Non-deterministic FSA (NDFSA) and deterministic FSA (DFSA).


18/35

17

The requirements of DFSA is

1, there are no transition involving (no null transition).

2, no state has two outgoing transition based on the same symbol.

Modeling of Suffix based FSA. FSA is modeled using all possible suffixes ie all the allmorphs, where

allomorphs are defined as a morpheme that is manifested as one or more morphs in different environment. Eg:

th, nth, in, i are the allomorphs of the past tense marker.

Here FSA is built by considering the suffixes from left to right of the word, ie moving from end of the

word towards the root word. Our implementation is varied for the other Finite State MAs, the suffixes are the

symbols, which trigger the transition. After determinising the DFSA reduces to two states. The 1st suffixes that

are affixed to the root word immediately triggers the transition from state 0 to state 1. And the other suffixes

that are affixed to the 1st

suffix form a self-loop at the state 0. Sample State Table is shown in the table 8 and

Sample State Diagram is shown in figure1.


INSERT Figure 1. HERE

The word is parsed in the FSA by identifying suffix by suffix, from the last suffix to the first suffix. Whenever

the transition is triggered by the suffix, that suffix is stripped from the word and required orthographic

corrections are done.

Orthographic Rules in FSA. Orthographic rules are the spelling rules used to model the changes that occur in a

word, usually when morphemes are combined (Jurafsky 2000). The characters that are deleted from the root

word or the suffix, when a suffix (allomorph) is affixed, it is stored after the suffix in the state table. Example is

given below

0 0 atu a

Consider the word makanuTaiyatu, in this word there are two suffixes uTaiya and atu. When the word

is parsed in the FSA, the last suffix atu is first identified. It triggers a transition to the same state and in the

current word this suffix is stripped and the orthographic correction character a is added. Thus the remaining

word makanuTaiya is further parsed.


19/35

18

Root Information in FSA. In the end state of the state table, for the suffixes that are affixed to the root word,

after the orthographic correction characters the category of the root is added. Sample of the state table is given

below.

0 1 kaL m N13

Consider the word marakaL, here kaL is the suffix added to the root word. When this word is parsed

through the FSA, the suffix kaL triggers from state 0 to state 1 and in the suffix the current word, the suffix

is stripped and the orthographic correction is done. The reminder word maram is compared in the particular

category of the root word lexicon. If this matches the root word lexicon, then this parse of the word is

considered as a valid parse for this input word.

4.2. Lexicon paradigm based approach.In paradigm approach, we group the root words into different groups,where every word in each group will have similar orthographic changes (sandhi changes), when a suffix is

added to it.

Consider the words paTam and varam. These two words, when inflected with plural marker kaL, m, the

last character is deleted in both the words and kaL is added to the words to form paTakaL and varakaL. As

these two words show same orthographic changes they are grouped under the same paradigm.

In our task, we have categorized noun into 36 paradigms and verbs into 34 paradigms. The lexicon has

44055 root words.

Apart from the root word lexicon, a suffix list with suffixes and the corresponding syntactic

information is used, as MA has to assign the correct morphosyntactic information to the component morphemes.

4.3. Morphosyntax Rules. A set of rules that explains which classes of morphemes can follow other classes of

morphemes inside a word. Example plural marker can occur only immediately after the noun root word, and this

can be followed by a case marker or clitic. This set of rules filter out the correct parsing of the word from the

FSA. Here we have 286 rules.

Handling of Compound Words. In morphological rich and productive languages like Tamil, occurrence of

compounding words are high. In compound words, only the last word in the compound words is inflected. This

was have handled as follows


20/35

19

Step 1: Parsing the suffixes from the last suffix to the first suffix in the word, and checks for the root

word in the given category in the FSA.

Step 2: If the root word is not matched then step 3

Step 3: The root word is split based on syllables and checked with the root dictionary

Step 4: Once a word is matched, the remaining part of the word is splitted similarly and compared with

the root dictionary.

Step 5: If the complete root word, is matched into different root words in the dictionary, this multiple

words as root with suffix information is given as analysis.

Step 6: If the complete root is not matched even after splitting into multiple words, the analysis is given

as unknown word.

The other form is the verb which is inflected agglutinated with the pronoun, which can also be inflected, such as

vatavan -> va: + t + a + avan

come+root past RP pronoun

Here the relative participle verb vata is agglutinated with avan, a pronoun. This we have handled by

having a separate rule in the morphosyntax rule file.

Agglutination of inflected verb and verb illai (negation), the verb illai agglutinate with the infinite verb forming

one word, such as

varavillai -> va: + a + illai

Come+root inf negative verb

This is also handled similarly as the previous example by adding a separate rule.

5. Evaluation. We have evaluated the system with two sets of web data, first set is the words collected from

general domain and the second set is the words collected from the tourism domain. The detail of evaluations is

shown in table 9.


The tourism documents have more compound words and the agglutination of words is more. In this domain,

there are more number of named entities such as person name, place name, area specific words. The sentences

commonly end with a:kum, a copula verb. This verb is agglutinated to the preceding noun phrase, such as


21/35

20

u:ra:kum -> u:r + a:kum

place + copula verb

Similarly there are more compound nouns, such as

maNme:TukaLuTaiya -> maN+me:Tu + kaL + uTaiya

sand dune pl genetive

Compound root suffix

Consider the word maNme:TukaLuTaiya, kaL and uTaiya are the two suffixes. After removing the

suffixes, the reminder is maNme:Tu, which does not match with root word dictionary. The word is spliited and

compared with the root word list and man, me:Tu are two root words forming the word maNme:Tu. Similarly

iravupakalai -> iravu+pakal + ai

night+day accusative

Compound root Suffix

teyvacceyalil -> teyvam+ceyal + il

Compound root Suffix

Periodic updating of the root word lexicon will help in improving the performance of the system.

6. Conclusion.The paper is about the design and development of Tamil morphological analysis, using the FiniteState Automata and the paradigm approach. The complex suffixation is effectively handled using FSA. The

system performs at an average precision of 91.70%.


22/35

21

Reference

Beesley, Kenneth R. 1996. Arabic Finite-State Morphological Analysis and generation. Proceedings of the 16th

International Conference on Computational Linguistics, Vol.1.Copenhagen, Denmark. 89-94.

Elwell, Robert., Jason Baldridge. 2008. Using Syllables as Features in Morpheme Tagging in Swahili.

Proceedings of the Fifth Midwest Computational Linguistics Colloquium, East Lansing.

Itziar Aduriz, Eneko Agirre, Izaskun Aldezabal, Inaki Alegria, Xabier Arregi, Jose Maria Arriola, Xabier

Artola, Koldo Gojenola Galletebeitia, Montse Maritxalar, Kepa Sarasola, Miriam Urkia. 2000. A word-grammar

based morphological analyer for agglutinative languages. In Proceedings of COLING'2000. 1-7.

Jppinen, H., Lehtola, A., Nelimarkka, E. and Ylilammi, M. 1983. Morphological Analysis of Finnish: A

Heuristic Approach. Report B26, Helsinki University of Technology, Digital Systems Laboratory, Helsinki,

Finland.

Jurafsky, Daniel and James H. Martin. 2000. Speech and Language processing. Prentice Hall.

Hockett, Charles F. 1958. A course in modern linguistics. New York: Macmillan.

Girish Nath Jha., Muktanand Agarwal., Subash., Sudhir K Mishra., DiwakarMishra., Manji Bhadra Surjit K

Singh. 2007. Inflectional Morphology for Sanskrit. In Proceedings of First

International Symposium on Sanskrit Computational Linguistics. 46-77.

Koskenniemi, Kimmo. 1983. Two-Level Morphology: A General Computational Model for Word-Form

Recognition and Production. Publication No. 11. Helsinki: Department of General Linguistics, University of

Helsinki.

Lehmann, T. 1989. A Grammar of Modern Tamil, Pondicherry: Pondicherry Institute of Linguistics and Culture.


23/35

22

Megerdoomian, Karine. 2004. Finite-State morphological analysis of Persian. In Proceedings of the Workshop

on Computational Approaches to Arabic Scriptbased Languages. Coling 2004, University of Geneva,

Switzerland.

Mohanty, S., Santi, P.K., Adhikary, K.P.D. 2004. Analysis and Design of Oriya Morphological Analyser: Some

Tests with OriNet. In Proceeding of symposium on Indian Morphology, phonology and Language Engineering,

IIT Kharagpur.

Pope, G. U. 1904. A handbook of the Tamil language. 7th ed. New Delhi, First published Oxford. Asian

Educational Services, 1979.

Sajib Dasgupta, Vincent Ng. 2007. Unsupervised morphological parsing of Bengali. Language Resources and

Evaluation 40:3-4, pp 311-330

Sajib Dasgupta, Dewan Shahriar Hossain Pavel, Asif Iqbal Sarkar, Naira Khan and Mumit Khan., 2005.

Morphological Analysis of Inflecting Compound Words in Bangla, Proc. 8th International Conference on

Computer & Information Technology (ICCIT), Islamic University of Technology (IUT), Dhaka, Bangladesh.

Schulze, B. M. et al. 1994. DECIDE Designing and Evaluating Extraction Tools for Collocations in Dictionaries

and Corpora. MLAP Project 93- 19.

Viswanathan, S., Ramesh Kumar, S., Kumara Shanmugam, B., Arulmozi, S. and Vijay Shanker, K. (2003). A

Tamil Morphological Analyser, Proceedings of the International Conference On Natural language processing

ICON 2003, Central Institute of Indian Languages, Mysore, India, pp. 3139.

Yona, S. and Wintner, S. 2005. A finite-state morphological grammar of Hebrew. In Proceedings of the ACL-

2005 Workshop on Computational Approaches to Semitic Languages, Ann Arbor.


24/35

23

Table1. Tamil Case System.

Case Case Suffix

Nominative

Accusative ai

Dative kuInstrumental a:l

Sociative o:Tu/uTan

Locative il/iTam

Ablative ilirutu

Genitive in/atu/uTaiya


25/35

24

Table.2 Inflections of a noun.

Root Number Case Postposition Clitic Word

paiyan

'boy'

SG

NOM

paiyan

'boy'

paiyan

'boy'

SG

ai

ACC

paiyanai

'boy(OBJ)'paiyan'boy'

SG

kuDAT

paiyanukku'to the boy'

paiyan'boy'

SG

a:lINS

paiyana:l'by the boy'

paiyan

'boy'

SG

o:Tu

SOC

paiyano:Tu

'with the boy'

pai'bag'

SG

ilLOC

paiyil'in the bag''

pai

'bag'

SG

iliruntu

ABL

paiyiliruntu

'from the bag'

pai'bag'

SG

inNOM

paiyin'bag's'

paiyan'boy'

SG

aiACC

e:EMPH

paiyanaiye:'the boy(OBJ) himself'

paiyan

'boy'

SG

ai

ACC

a:

INT

paiyanaiya:

'the boy(OBJ)?'

paiyan'boy'

SG

ukkuDAT

ta:nEMPH

paiyanukkuta:n'it is for the boy'

paiyan'boy'

SG

ukkuDAT

ku:TaPSP

ta:nEMPH

paiyanukkukku:Tata:n'it is also for the boy'

paiyan

'boy'

kaL

PL

NOM

paiyankaL

'boys'

paiyan'boy'

kaLPL

aiACC

paiyankaLai'boys(OBJ)'

paiyan

'boy'

kaL

PL

ku

DAT

paiyankaLukku

'to the boys'

paiyan

'boy'

kaL

PL

a:l

INS

paiyankaLa:l

'by the boys'

paiyan'boy'

kaLPL

o:TuSOC

paiyankaLo:Tu'with the boys'

pai

'bag'

kaL

PL

il

LOC

paikaLil

'in the bags'

pai'bag'

kaLPL

iliruntuABL

paikaLiliruntu'from the bags'

pai'bag'

kaLPL

inNOM

paikaLin'of the bags'

paiyan

'boy'

kaL

PL

ai

ACC

e:

EMPH

paiyankaLaiye:

'the boys(OBJ) themselves'

paiyan'boy'

kaLPL

aiACC

a:INT

paiyankaLaiya:'the boys(OBJ)?'

paiyan'boy'

kaLPL

ukkuDAT

ta:nEMPH

paiyankaLukkuta:n'it is for the boys'

paiyan

'boy'

kaL

PL

ai

ACC

viTa

PSP

a:

INT

paiyankaLaiviTava:

'than boys(OBJ)?'


26/35

25

Table. 3. Pronouns in Tamil.

Non-neuter Neuter

Singular Plural Honorific Singular Plural

a:m'We'

(inclusive)

a:m'We'

(inclusive)

FirstPerson

a:n'I'

a:kaL'We'

(exclusive)

a:n'I'

a:kaL'We'(exclusive)

SecondPerson

i:'You'

i:kaL/i:vir'You'

i:kaL'You'

i:'You'

i:kaL'You'

avan'He'

ThirdPerson

avaL

'She'

avarkaL'They'

avar'He/She'

atu'It'

avai'Those'


27/35

26

Table 4. PNG in Tamil

Person Number Gender PNG Suffix

Singular Masculine/Feminine

-e:n

First

Plural Masculine/

Feminine

-o:m

Singular Masculine/ -a:y

Plural Masculine/Feminine

-i:rkaLSecond

SingularHonorific

Masculine/Feminine

-i:r

Singular Masculine -a:n

Singular Feminine -a:L

Plural Masculine/

Feminine

-a:rkaL

SingularHonorific

Masculine/Feminine

-a:r

Singular Neuter -atu

Third

Plural Neuter -ana


28/35

27

Table 5. Inflections of verbs.

Root Tense/Inf+NEG PNG Clitics Example

paTiread

ttPST

a:n3SM

paTitta:n(He) read

paTiread

kkiRPRE

a:L3SF

paTikkiRa:L(She) is reading

paTiread

umFUT

3SN

paTikkum(It) will read

paTiread

tt:PST

a:n3SM

a:INT

paTitta:na:?Did (he) read?

paTiread

a + illaiINF+NEGVERB

a:INT

patikkavillaiya:?Did not read?


29/35

28

Table 6. Relative participle formation

Root Tense RelativeParticiplemarker

Form

paTi tt a paTitta

paTi kkiR a paTikkiRa

paTi um paTikkum

paTi a: a paTikka:ta


30/35

29

Table 7. Glides that a word ending in a vowel take.

Ending Vowel Glide Example

Mid Open shorta

v Native root words do not end in a

Mid Open long

a:

v ci:ta: + ai -> ci:ta:vai

'Sita' + ACC-> 'Sita(obj)'

Front Close short

i

y puli + uTan -> puliyuTan

'tiger'+SOC ->'with the tiger'

Front Close long

i:

y ti: + a:l -> ti:ya:l

'fire' + INS-> 'due to fire'

Back Close short

u

v e:cu +ai -> e:cuvai

'Jesus' + ACC -> 'Jesus(obj)'

Back Close long

u:v pu: +in -> pu:vin

'flower + GEN -> 'flower's'

Front Mid short

e

y Native root words do not end in e

puNe + il -> puNeyil'Pune' + LOC -> 'in Pune'

Front Mid longe:

y Native root words do not end in e:me: + il -> me:yil

'May' + LOC -> 'in May'

Back Mid short

o

v Native root words do not end in o

Back Mid longo:

v Native root words do not end in o:a:TTo: + in -> a:TTo:vin'auto' + GEN-> 'auto's'

Diphthongai

y a:cai + a:l -> a:caiya:l'desire'+INS->'due to desire'

Diphthongau

v Native root words do not end in aulaknau + il -> laknauvil'Lucknow'+LOC->'in Lucknow'


31/35

30

Table 8. Sample of the State Table

Current State Next State Symbol

0 0 ai

0 0 utaiya

0 1 kal0 1 ai

0 1 utaiya


32/35

31

Figure 1. Sample State Diagram


33/35

32

Table 9. Evaluation of Morphological analyser

Types General Domain Tourism Domain

Total number of Words 50,000 50,000

Analysed words 46620 45085

Error due to Missing

morphosyntax rules and statetable entries

223 344

Error due to agglutination 485 531

Error due to missing root word 1345 1987

Input Error 1327 2053

Correctness of analysis 93.24% 90.17%


34/35

33

Table 10. Linguistic abbreviations.

Abbreviation Full Form

3PE 3rd

person Plural Epicene

3PN 3rd person Plural Neuter

3SF 3

rd

person SingularFeminine

3SH 3rd

person SingularHonorific

3SM 3rd

person SingularMasculine

3SN 3rd

person Singular Neuter

ABL Ablative

ACC Accusative

ADJ Adjective Suffix

ADV Adverb Suffix

AUXV Auxiliary Verb

CAUS Causative

COND Conditional

CONC Concessive

COOR Coordination Clitic

DAT Dative

DISJ Disjunction Clitic

EMPH Emphatic Clitic

EMP Emphatic Suffix

FEM Feminine

FUT Future Tense

GEN Genitive

HORT Hortative

INF Infinitive

INS Instrumental

INT Interrogative

LOC Locative

MAS Masculine

NEG Negative Suffix

NEGVERB Negative Verb

OBJ Object

ONOM Onomatopoeic form

OPT Optative

PL Plural

PRON-3SM Pronominal - 3rd personSingular Masculine

PRE Present Tense


35/35

34

PSP Postposition

PST Past Tense

RP Relative Participle

SOC Sociative

SUFF Suffix

SUPP Supposition marker

i According to Schiffman (Schiffman 1994), Thus the usual treatment of Tamil case (Arden 1942) is one wherethere are seven cases--the nominative (first case), accusative (second case), instrumental (third), dative (fourth),ablative (fifth), genitive (sixth), and locative (seventh). The vocative is sometimes given a place in the casesystem as an eighth case, although vocative forms do not participate in usual morphophonemic alternations, nordo they govern the use of any postpositions.

Documents

Ma Language Format Final