Ma Language Format Final

Embed Size (px)

Citation preview

  • 8/7/2019 Ma Language Format Final

    1/35

    0

    Tamil Morphological Analyser

    Vijay Sundar Ram R, Menaka S and Sobha Lalitha DeviAU-KBC Research Centre. MIT Campus of Anna University, Chennai-44

    {sundar,menaka,[email protected]}

  • 8/7/2019 Ma Language Format Final

    2/35

    1

    Tamil Morphological Analyser

  • 8/7/2019 Ma Language Format Final

    3/35

    2

    Abstract. In morphologically rich languages, the word bears more grammatical information, due to rich suffix

    affixation. Morphological analysis is the process of segmenting the given word into component morphemes and

    assigning the correct morphosyntactic information. In this paper, we discuss about a method for developing a

    morphological analyser for Tamil, a morphologically rich Dravidian language. This is designed using paradigm

    based approach and Finite State Automata, which works efficiently in recursive tasks and considers only the

    current state for having a transition. We test the morphological analyser with online web data and it

    performances with a correctness of 91.70%.

  • 8/7/2019 Ma Language Format Final

    4/35

    3

    1. Introduction. Morphological analysis of a word is the process of segmenting the word into component

    morphemes and assigning the correct morphosyntactic information. For a given word, a morphological analyser

    (MA) will return its root word and the word class along with the other grammatical information depending upon

    its word class. MA returns all possible parse for a given word, without considering the context. MA is a very

    essential for languages having rich inflectional and derivational morphology such as morphologically rich

    languages like Dravidian languages (Tamil, Telugu, Malayalam and Kannada), Finno-Ugric languages (Finnish,

    Estonian, Hungarian, Turkish), Indo-Aryan languages (Hindi, Bengali, Marathi, Gujarati). In Indo-European

    languages (French, English), as the affixations to the root word are less, lemmatization, the process of getting

    the root word (lemma), serves the purpose of MA. MA is a vital tool in NLP applications. In morphological rich

    languages, as there are multiple affixation, the finer grammatical information which helps in building efficient

    NLP applications, can be obtained only from MA. MA is required in most of the applications such as

    information extraction, QA system, machine translation, even in the information retrieval task to get the correct

    root word.

    There are several approaches attempted for MA. The two-level morphology approach by Kimmo

    Koskenniemi is the early attempts, where he tested this formalism for Finnish (Koskenniemi 1983). In this two-

    level representation, the surface level is to describe word forms as they occur in written text and lexical level to

    encode lexical units such as stem and suffixes. The two-level rules define a mapping between the two levels and

    they are represented in a Finite State Automata. This approach is used for recognizing and generating word

    forms. This formalism is also used in other languages such as Arabic, Dutch, English, French, German, Italian,

    Japanese, Portuguese, Swedish and Turkish (Schulze 1994). A rule based, heuristic analyser for Finnish nominal

    and verb forms was developed by Jappinen (Jappinen 1983), a word-grammar based morphological analyser for

    agglutinative languages using the two-level formalism and a unification-based formalism was introduced by

    Agirve (Itziar 2000), here they have worked on Basque, a highly agglutinative language. Arabic Finite State

    Transducer for morphological analysis using Xerox Finite State Transducer (XFST) was built by Beesley, by

    reworking extensively on the lexicon and rules in the Kimmo-style (Beesley 1996). Similarly using XFST,

    Wintner came up for Hebrew (Wintner 2005) and Karine made a Persian MA (Karine 2004). Oflazer Kamel

    developed a Finite State Machine (FSM) based Turkish MA. For Swahili, using the syllables, utilizing the

    surface level clues, the features present in a word are identified by Robert Elwell (Elwell 2008). A weighted

    Finite State approach was used to handle Finnish compound words. Finite State Automata based MA was

    developed in Tamil (). In Bengali, unsupervised methodology is used in developing a MA (Sajib Dasgupta,

  • 8/7/2019 Ma Language Format Final

    5/35

    4

    2007) and two-level morphology approach was used to handle Bengali compound words. There are rule based

    MA was developed for Sanskrit (Girish Nath Jha 2007) and Oriya (Mohanty 2004).

    In this paper, we present a methodology for morphological analysis of Tamil, morphologically rich

    language. Here we have used Finite State Automata (FSA) and the paradigm approach. The reminder of the

    paper is organized as follows. In the following section we have short description on morphology of Tamil,

    where we have explained the inflections and derivations in nouns, verb. The third section is Orthographic rules

    in Tamil, we have briefed on the orthographic changes that occur during affixation. We have explained our

    approach in building Tamil morphological analyzer in section four. Section 5 discusses the different

    experiments done to evaluate the MA and the paper concludes with the conclusion section.

    2. Tamil morphology. Tamil belongs to the South Dravidian family of languages. It is a verb-final language

    and has a relatively free word order; It is an inflectional language. Agglutination is another feature of the

    language.

    Tamil morphology is characterized as agglutinative or concatenative, i.e., Words are formed by successfully

    adding suffixes to the root word in series. When suffixes attach to the root several morphophonemic changes

    take place. The order in which suffixes attach to a root form determine the morphosyntax of the language and

    the various changes that take place when a suffix attaches are called the morphophonemics.

    The lexical categories in Tamil and their morphological processes are discussed below.

    2.1. Nouns. Nouns form an important lexical category in the language and they take inflectional as well as

    derivational suffixes.

    Suffixation to a noun is not arbitrary. They attach in a particular order. The number suffix attaches to the noun

    root, followed by the case suffix. Postpositions follow case. This, in turn, is followed by the clitics.

    The number suffix is for singular and kaL for plural. In a few cases, the plural suffix is mAr.

    After the number suffix, the stem takes the case suffix. Computationally, Tamil has 8 casesi. Lehmann

    (Lehmann 1989) also classifies the Tamil case system in a similar manner.

    INSERT Table 1. HERE

    The next suffix in the series may be the disjunction clitic(o:) or the coordination clitic(um) or the emphatic

    clitic(e:). After the addition of the above suffixes, the emphatic suffix (taan) is added. This may be followed by

    the fifth suffix which can be interrogative (a:) or supposition (a:m).

  • 8/7/2019 Ma Language Format Final

    6/35

    5

    The Morphosyntax of Noun inflections may be summarized as

    root + {number} + {case} + {DISJ/COOR/EMPH} + {PSP} + {EMP} + {INT/SUPP}

    The following examples illustrate the inflections of a noun.

    INSERT Table 2 HERE.

    Productive suffixes of nouns.

    1. The suffixes -ka:ran/-ka:ri denoting 'man/woman' as in pa:l-ka:ran (milk-man)/pa:l-ka:ri (milk-woman)

    2. The suffixes -an/-i attach to a noun denoting a quality to derive the noun for the man/woman with thequality.

    Eg. kuruTu + an -> kuruTan

    'blindness' + MAS -> 'blind person(MAS)'

    kuruTu + i -> kuruTi

    'blindness' + FEM -> 'blind person(FEM)'

    3. The suffix tanam attaches to a noun to show the habit.Eg. piTiva:Tam + tanam -> piTiva:Tattanam

    'stubbornness' + SUFF -> 'habit of stubbornness'

    aTimai + tanam -> aTimaittanam

    'slave' + SUFF -> 'slavery'

    4. The negative suffix -inmaican be added to several nouns to add the meaning of -lessness.Eg. tu:kkam + inmai -> tu:kkaminmai

    'sleep' + SUFF -> 'sleeplessness'

    payam + inmai -> payaminmai

    'fear' + SUFF -> 'fearlessness'

    The derivations that are possible from noun roots are discussed below.

    Derivation of verbs from nouns.

    1. Certain verbs like aLi/aTi/cey/koTu/paNNuadd to a noun to form the corresponding verb. These arequite productive, especially in the case of loan-words

  • 8/7/2019 Ma Language Format Final

    7/35

    6

    Eg. ka:pi + aTi -> ka:piyaTi

    'copy' + 'beat' -> 'copy'

    va:y + aTi -> va:yaTi

    'mouth' + 'beat' -> 'chatter'

    veLLai + aTi -> veLLaiyaTi

    'white' + beat -> 'whitewash'

    ca:vi + koTu -> ca:vikoTu

    'key' + 'give' -> 'wind up'

    kaTan + koTu -> kaTankoTu

    'loan' + 'give' -> 'lend'

    Derivation of adjectives from nouns

    1. The dravidian suffix for adjective formation is a.Eg. azaku + a -> azakiya

    'beauty' + ADJ -> 'beautiful'

    2. The extremely productive suffix a:na (which is a frozen form reduced from the past tense relativeparticiple a:kiyaof the verba:ku) attaches to almost any noun denoting quality.

    Eg. azaku + a:na -> azaka:na

    'beauty' + ADJ -> 'beautiful'

    ve:kam + a:na -> ve:kama:na

    'speed' + ADJ -> 'fast'

    3. The bound postposition uLLa/uTaiya attaches to several nouns to produce adjectives.Eg. azaku + uLLa-> azakuLLa

    'beauty' + ADJ -> 'beautiful'

    4. The suffix -aavatu/-a:m is added to ordinals to form the corresponding adjective.Eg.

    iraNTu + a:vatu -> iraNTa:vatu

    'two' + ADJ -> 'second'

    iraNTu + a:m -> iraNTa:m

    'two' + ADJ -> 'second'

  • 8/7/2019 Ma Language Format Final

    8/35

    7

    5. The place names with an ending a: shorten the last vowel to form the corresponding adjective.Eg. intiya: + tu:tarakam -> intiya tu:tarakam

    'India' + Embassy -> 'Indian Embassy'

    6. The negative suffix -aRRaadds to nouns to produce an adjective.Eg. oLi + aRRa -> oliyaRRa

    'light' + 'without' -> 'without light'

    Derivation of adverbs from nouns

    1. The suffix a:ka is equally productive in deriving adverbs from nouns as a:na in deriving adjectivesfrom nouns. Sometimes a:kagets reduced to a:y

    Eg.azaku + a:ka-> azaka:ka

    'beauty' + ADV -> 'beautifully'

    azaku + a:y -> azaka:y

    'beauty' + ADV -> 'beautifully'

    2. The negative suffixes -inRi/-anRi/-aRRuadd to the root noun to produce an adverb.Eg. paNam + anRi -> paNamaNri

    money + apart -> apart from money

    pa:tuka:ppu + inRi -> pa:tuka:ppinRi

    'safety' + 'without' -> 'without safety'

    mati + aRRu -> matiyaRRu

    'intelligence' + 'without' -> 'without intelligence'

    3. The bound postposition e:Rpa adds to the noun in dative case to give an adverb.vitikku + e:Rpa -> vitikke:Rpa

    'fate-DAT' + 'according' -> 'according to fate'

    Compound Nouns. Compound nouns formation is productive in Tamil. They may be formed by several

    strategies as explained by Rajendran(Rajendran 2004). Some examples below illustrate the abundance of

    compound nouns in Tamil.

    Eg.

    vattam + me:jai -> vatta me:jai

  • 8/7/2019 Ma Language Format Final

    9/35

    8

    'round' + 'table' -> 'round table'

    kuzantai + paruvam -> kuzantaipparuvam

    'child' + 'period' -> 'childhood'

    marapu + aNu -> marapaNu

    'tradition' + 'atom' -> 'gene'

    coTTu + ni:r + pa:canam -> coTTu ni:rp pa:canam

    'drop' + 'water' + 'irrigation' -> 'drip irrigation'

    aTi + vayiRu -> aTivayiRu

    'below' + 'stomach' -> 'abdomen'

    ka:l + kaTTu -> ka:lkaTTu

    'leg' + 'knot' -> 'marriage'

    Pronouns. Pronouns in Tamil are a closed set of words. They have person, number and gender (PNG)

    information in them.

    The following table summarizes the pronouns in Tamil

    INSERT Table 3. HERE

    Pronouns are part of nouns and hence behave like nouns. The above pronouns inflect by taking the Case suffix

    and other suffixes that a noun takes. Since number is inherent in the pronoun, it doesn't take an explicit Number

    suffix. The only exception is ivai which sometimes takes a redundant plural suffix kaL.

    2.2. Verbs. Verb forms can be broadly classified into two types.

    1. Finite verbs2. Non-finite verbs

    Finite verbs. The verb root takes the tense suffix first, followed by a fused PNG suffix. This can be followed by

    any of the clitics.

    The Tense can be Past/Present/Future if it is in the affirmative. The negative form does not take tense.

    The PNG Suffixes may be as in the table below.

    INSERT Table 4. HERE

    The Morphosyntax of finite verbs may be summarized as

    root + Tense + PNG + {DISJ/COOR/EMPH/EMP/INT/SUPP}

  • 8/7/2019 Ma Language Format Final

    10/35

    9

    root + INF + NEGVERB + {DISJ/COOR/EMPH/EMP/INT/SUPP }

    The following examples illustrate the above morphosyntactic rule.

    INSERT Table 5. HERE

    The verb root, after taking the tense suffix may take the relative participle markera instead of the PNG suffix to

    produce the relative participles.

    INSERT Table 6. HERE

    The above inflections can be summarized as

    root + Tense/NEG + RP

    Only derivations can happen at this point. One of the pronominal endings from avan(3SM) /avaL(3SF)

    /avar(3SH) /atu(3SN)/avai(3PN)/avarkaL(3PE) may agglutinate to the relative participle, thus forming a noun.

    Now, the word behaves as a noun and takes noun inflections.

    Eg.

    paTi + tt + a + avan + o:Tu -> paTittavano:Tu

    read + PST + RP + 3SM + SOC -> with the one(MAS) who read

    Non-finite verbs. The verb root may directly one of the following suffixes Infinitive, Verbal Participle,

    Conditional, Concessive, Hortative, Optative. These forms may also have a negative suffix attaching to the root

    before taking on these suffixes. Some of these forms take an auxiliary verb like iru/viTuto produce the negative

    form.

    Eg.

    pa:Tu + a -> pa:Ta

    'sing' + INF

    pa:Tu + a: + a -> pa:Ta:ta

    'sing' + NEG + INF

    pa:Tu + i -> pa:Ti

    'sing' + VBP

    pa:Tu + a: + u -> pa:Ta:tu

    'sing' + NEG + VBP

    pa:Tu + a:l -> pa:Tina:l

    'sing' + COND

    pa:Tu + a: +viTu+ a:l -> pa:Ta:viTTa:l

  • 8/7/2019 Ma Language Format Final

    11/35

    10

    'sing' + NEG + AUXV + COND

    pa:Tu + a:lum -> pa:Tina:lum

    'sing' + CONC

    pa:Tu + a: +viTu +a:lum -> pa:Ta:viTTa:lum

    'sing' + NEG + AUXV + CONC

    pa:Tu + ala:m -> pa:Tala:m

    'sing' + HORT

    pa:Tu + a: + iru + ala:m -> pa:Ta:tirukkala:m

    'sing' + NEG + AUXV + HORT

    pa:Tu + aTTum -> pa:TaTTum

    'sing' + OPT

    pa:Tu + a: + iru + aTTum -> pa:Ta:tirukkaTTum

    'sing' + NEG + AUXV + OPT

    The Morphosyntax of Non-finite Verbs may be summarized as

    root + {NEG} + INF/VBP/COND/CONC/HORT/OPT + {DISJ/COOR/EMPH} + {EMP} + {INT/SUPP}

    Derivation of nouns from verbs

    1. From the Relative participle (RP) forms, nouns can be derived by the pronominalisation process. i.e.,one of the pronominal suffixes avan, avaL, avar, avarkaL, atu, avaiattach to the RP form to produce a noun.

    This is very productive.

    2. The suffix -kai is added to some verbs to produce nounsEg. cey + kai -> ceykai

    'do' + SUFF -> 'act'

    3. The suffix -talis added to several verbs to form the corresponding noun.Eg. paRa + tal -> paRattal

    'fly' + SUFF -> 'flying'

    makiz + tal -> makiztal

    'enjoy' + SUFF -> 'enjoying'

    vaLar + tal -> vaLartal

    'grow' + SUFF -> 'growing'

  • 8/7/2019 Ma Language Format Final

    12/35

    11

    4. The suffix -puis added to some verbs to form the corresponding noun.Eg. vaLar + pu -> vaLarppu

    'grow' + SUFF -> ' bringing up'

    ninai + pu -> ninaippu

    'think' + SUFF -> 'thought'

    5. One of the sequences of suffixes -v-atu/p-atu/pp-atuis added to any verb to denote the action denotedby the verb.

    Eg. cey + v +atu -> ceyvatu

    do + FUT + SUFF -> doing

    Derivation of adjectives from verbs

    1. Suffixes like -ataRka:na/ -avaRRukka:na / ataRkuriya/ atarke:RRa / takka are actually frozen forms ofagglutinating words that can add to a verb root to form an adjective.

    For instance, if we consider ataRka:na,

    it is atu + ku + a:na -> ataRka:na

    'that' + DAT + ADJ ->'for that'

    Now this can attach to a verb

    cey + ataRka:na -> ceyvataRka:na

    'do' + 'for that' -> 'for doing'

    Derivation of adverbs from verbs

    1. Suffixes like a:Rpo:la/ ava:Ru/ a:ka/ ma:Ru/ a:maland certain postpositions like anRi add to verbs toform adverbs.

    Eg. paTi + tt + a:Rpo:la -> paTitta:Rpo:la

    'read' + PST + ADV -> 'as though ... was reading'

    paTi + ma:Ru -> paTikkuma:Ru

    'read' + ADV -> 'to read'

    paTi + a:mal -> paTikka:mal

    'read' + NEG -> 'without reading'

    2. enRuand enaare synonymous Complementizers which add to reduplicating onomatopoeic forms toform adverbs.

    Eg. kalakala + enRu -> kalakalavenRu

  • 8/7/2019 Ma Language Format Final

    13/35

    12

    ONOM + ADV -> 'happily'

    kalakala + ena -> kalakalavena

    ONOM + ADV -> 'happily'

    2.3.Inflections of other categoriesIn Tamil, the other POS categories do not inflect, but they take the clitics that the nouns / verbs take.

    Hence for any other category apart from the ones discussed above, the morphosyntax is

    root + {DISJ/COOR/EMPH} + {EMP} + {INT/SUPP}

    Agglutination. Agglutination is a feature of the Tamil language. Due to the highly agglutinating nature of this

    language and the morphophonemic variations that take place at the point of agglutination, it is very difficult to

    mark the word boundaries.

    Eg. arapi + katal + in + araci -> arapikkatalinaraci

    'Arabian' + 'sea' + GEN + 'queen' -> 'Queen of the Arabian Sea'

    paTi + ttu + koL + NT u + iru +t + a + avan + ai -> paTittukkoNTirutavanai

    'read' + VBP + AUXV + VBP + AUXV + PST + RP + PRON-3SM + ACC -> 'the one(MAS) who was

    reading(OBJ)'

    3. Orthographic Ruless in Tamil "The ways in which the morphemes of a given language are variously

    represented by phonemic shapes can be regarded as a kind of code. This code is the orthographic system of the

    language." (Hockett 1958:135). It is also known as Internal Sandhi.

    The orthographic rules in Tamil given below were arrived at from the works of Pope (Pope 1979) and tamiz

    (Subramanian and Gnanasundaram 2001).

    1. When the root word ends in a vowel and the attaching suffix begins with any vowel, the glide vor y isadded depending on the following rules.

    INSERT Table 7 HERE.

    2. When the root word ends in one of the long Close vowels (i:/u: )and the attaching suffix/word beginswith one of the stop consonants k/c/T/t/p/R, the stop consonant doubles at the end of the root.

    Eg. i: + kaL -> i:kkaL

    'fly' + PL -> 'flies'

    3. When the root word is of two syllables, with a short first syllable and ends in u, and the attachingsuffix/word begins with one of the stop consonants k/c/T/t/p/R, the stop consonant doubles at the end of the

  • 8/7/2019 Ma Language Format Final

    14/35

    13

    root. In all other cases of root word ending in u, and the attaching suffix/word begins with one of the stop

    consonants k/c/T/t/p/R, there is no change.

    Eg. maTu + kaL -> maTukkaL

    'hillock' + PL -> 'hillocks'

    4. When the root word ends in TTu,ttuand the suffix starts with k/c/T/t/p/R, there is no change.Eg. pa:TTu + kaL -> pa:TTukaL

    'song' + PL -> 'songs'

    5. When the root word ends in the labial nasal m, and the attaching suffix/word begins with one of thestop consonants k/c/T/t/p/R, the m is replaced by the homorganic nasal of the stop consonant.

    Eg. maram + kaL -> marakaL

    'tree' + PL -> 'trees'

    6. When the root word ends in the labial nasal m, and the attaching suffix begins with a vowel, the m isreplaced by the oblique suffix tt.

    Eg. maram + ai -> marattai

    'tree' + ACC -> 'tree(OBJ)'

    7. When the root word has a short single syllable and ends with the nasal n, and the attaching suffix/wordbegins with a vowel, the n doubles.

    Eg. pon + iliruntu -> ponniliruntu

    'gold' + ABL -> 'from gold'

    8. When the root ends in the nasal , and the attaching suffix starts with a vowel, the homoorganic stopconsonant is added in between.

    Eg. manmo:hanci + a:l -> manmo:hancika:l

    'Manmohan Singh'+INS -> 'by Manmohan Singh'

    9. When the root word has short single syllable ending in a glide (y/v), and the attaching suffix startswith a vowel, the glide doubles. Native words do not end in v.

    Eg. poy + ai -> poyyai

    'lie' + ACC -> 'lie(OBJ)'

    lav + a:l -> lavva:l

    'love' + INS -> 'by love'

    10. When the root word has a short single syllable and ends with a lateral (l/L ), or if the root word is

  • 8/7/2019 Ma Language Format Final

    15/35

    14

    another word with such a word at the end, and the attaching suffix/word begins with one of the stop

    consonants k/c/T/t/p/R, the lateral may be replaced with the homorganic stop consonant.

    Eg. kal + kaL -> kaRkaL

    'stone' + PL -> 'stones'

    cekal + kaL -> cekaRkaL

    'brick' + PL -> 'bricks

    poruL + kaL -> poruTkaL

    'thing' + PL -> 'things'

    11. When the root word has a short single syllable and ends with a lateral (l/L), and the attaching suffixstarts with a vowel, the lateral doubles.

    Eg. kaL + il -> kaLLil

    'toddy' + LOC -> 'in toddy'

    pal + il -> pallil

    'tooth' + LOC -> 'in the tooth'

    12. When the root ends in a stop consonant k/c/T/t/p/R/j, and the attaching suffix starts with a vowel, theconsonant doubles.

    Eg. maik + ai -> maikkai

    'mike' + ACC -> 'mike(obj)'

    Tip + il -> Tippil

    'tip' + LOC -> 'in the tip'

    jeT + il -> jeTTil

    'jet' + LOC -> 'in the jet'

    ko:c + o:tu -> ko:cco:tu

    'coach' + SOC -> 'with the coach'

    pert + ukku -> perttukku

    'berth' + DAT -> 'to the berth'

    haj + ukku -> hajjukku

    'Haj' + DAT -> 'to Haj'

    But when the stop consonant is p, and it is preceded by the modifier H to denote the labiodental

    fricative, there is no doubling.

  • 8/7/2019 Ma Language Format Final

    16/35

    15

    Eg. vulHp + in -> vulHpin

    'wolf' + GEN -> 'wolf's'

    When the root is a loan-word, it may end in the stop consonant, but may be voiced, preceded by a long

    vowel, and the attaching suffix starts with a vowel, there is no change.

    Eg. la:lpa:k + il -> la:lpa:kil

    'Lalbagh' + LOC -> 'in Lalbagh'

    vik + ai -> vikkai

    'wig' + ACC -> 'wig(OBJ)'

    13. When the root ends in a stop (k/c/T/t/p/R), preceded by the homorganic nasal (//N//m/n), and theattaching suffix starts with a vowel, there is no change.

    Eg. vik + il -> vikil

    'wing' + LOC -> 'in the wing'

    pec + il -> pecil

    'bench' + LOC -> 'on the bench'

    pat + a:l -> pata:l

    'bandh' + INS -> 'due to bandh'

    14. In cases where the root ends in a sibilant (s/sh), preceded by a short vowel, the sibilant doubles.Eg. push +in -> pushshin

    'Bush' + GEN -> 'Bush's'

    pas + il -> passil

    'bus' + LOC -> 'on the bus'

    In other cases where the ending sibilant is not preceded by a short vowel, the suffix can attach without

    any change. Sometimes, we observe that the s is replaced with c.

    Eg. pars + il -> parcil /parsil

    'purse' + LOC -> 'in the purse'

    When the attaching suffix starts with a consonant, there may be no change or the smay change to cu.

    Eg.ke:s + kaL -> ke:skaL/ke:cukaL

    'case' + PL -> 'cases'

  • 8/7/2019 Ma Language Format Final

    17/35

    16

    Some of the rules above where there is a difference in behaviour when the loan-word ends in a particular

    consonant and the corresponding phoneme is voiced or voiceless in the source language of the loan-word, it is

    not directly possible to encode this info in the rules. Hence the default rule for the particular consonant ending is

    applied.

    4. Our Approach.In this approach, we built a FSA using all possible suffixes, categorize the root word lexiconbased on paradigm approach to optimize the number of orthographic rules and use morphosyntax rules to get the

    correct analysis for the given word. FSA is used FSA using as the analysis of the word is done suffix by suffix.

    FSA are the proven technology for efficient and speedy processing.

    When applying the formalism of two-level morphology for morphologically rich languages, there are

    some well-known limitations such as

    1, developing Finite State transducers that encode very complex two-level rules is not easy.

    2, morphological categories are not directly encoded as a part of the lexical form.

    3, lexical representation tends to be arbitrary.

    4, various diacritical features inserted into the lexical strings to insure proper analysis makes Kimmo-

    style awkward or impractical for generation (Beesley 1996).

    In our approach the complex affixations are easily handled by FSA and in the FSA, the required

    orthographic changes are handled in every state.

    Our MA consists of three major components

    1, Finite State Automata, modeled using all possible suffixes (allomorphs).

    2, lexicon, categorized based on the paradigm approach

    3, morphosyntax rules, for filter the correct parse of the word

    4.1. Finite State Automata (FSA). FSA is a model of behavior composed of a finite number of states andtransitions between these states. FSA is an abstract device used for recognizing simple syntactic structures or

    patterns. An automata is normally depicted by directed graph, called State Diagram and it is also represented in

    a tabular form as State Table. An FSA as a string processing device accepts strings as input and decides if the

    structure is correct, that is, it either accepts or rejects the string. From a mathematical perspective it is regarded

    as a function, mapping a set of string to the set {Accept, Reject}. Based on the transition given by the FSA, it is

    classified as Non-deterministic FSA (NDFSA) and deterministic FSA (DFSA).

  • 8/7/2019 Ma Language Format Final

    18/35

    17

    The requirements of DFSA is

    1, there are no transition involving (no null transition).

    2, no state has two outgoing transition based on the same symbol.

    Modeling of Suffix based FSA. FSA is modeled using all possible suffixes ie all the allmorphs, where

    allomorphs are defined as a morpheme that is manifested as one or more morphs in different environment. Eg:

    th, nth, in, i are the allomorphs of the past tense marker.

    Here FSA is built by considering the suffixes from left to right of the word, ie moving from end of the

    word towards the root word. Our implementation is varied for the other Finite State MAs, the suffixes are the

    symbols, which trigger the transition. After determinising the DFSA reduces to two states. The 1st suffixes that

    are affixed to the root word immediately triggers the transition from state 0 to state 1. And the other suffixes

    that are affixed to the 1st

    suffix form a self-loop at the state 0. Sample State Table is shown in the table 8 and

    Sample State Diagram is shown in figure1.

    INSERT Table 8. HERE

    INSERT Figure 1. HERE

    The word is parsed in the FSA by identifying suffix by suffix, from the last suffix to the first suffix. Whenever

    the transition is triggered by the suffix, that suffix is stripped from the word and required orthographic

    corrections are done.

    Orthographic Rules in FSA. Orthographic rules are the spelling rules used to model the changes that occur in a

    word, usually when morphemes are combined (Jurafsky 2000). The characters that are deleted from the root

    word or the suffix, when a suffix (allomorph) is affixed, it is stored after the suffix in the state table. Example is

    given below

    0 0 atu a

    Consider the word makanuTaiyatu, in this word there are two suffixes uTaiya and atu. When the word

    is parsed in the FSA, the last suffix atu is first identified. It triggers a transition to the same state and in the

    current word this suffix is stripped and the orthographic correction character a is added. Thus the remaining

    word makanuTaiya is further parsed.

  • 8/7/2019 Ma Language Format Final

    19/35

    18

    Root Information in FSA. In the end state of the state table, for the suffixes that are affixed to the root word,

    after the orthographic correction characters the category of the root is added. Sample of the state table is given

    below.

    0 1 kaL m N13

    Consider the word marakaL, here kaL is the suffix added to the root word. When this word is parsed

    through the FSA, the suffix kaL triggers from state 0 to state 1 and in the suffix the current word, the suffix

    is stripped and the orthographic correction is done. The reminder word maram is compared in the particular

    category of the root word lexicon. If this matches the root word lexicon, then this parse of the word is

    considered as a valid parse for this input word.

    4.2. Lexicon paradigm based approach.In paradigm approach, we group the root words into different groups,where every word in each group will have similar orthographic changes (sandhi changes), when a suffix is

    added to it.

    Consider the words paTam and varam. These two words, when inflected with plural marker kaL, m, the

    last character is deleted in both the words and kaL is added to the words to form paTakaL and varakaL. As

    these two words show same orthographic changes they are grouped under the same paradigm.

    In our task, we have categorized noun into 36 paradigms and verbs into 34 paradigms. The lexicon has

    44055 root words.

    Apart from the root word lexicon, a suffix list with suffixes and the corresponding syntactic

    information is used, as MA has to assign the correct morphosyntactic information to the component morphemes.

    4.3. Morphosyntax Rules. A set of rules that explains which classes of morphemes can follow other classes of

    morphemes inside a word. Example plural marker can occur only immediately after the noun root word, and this

    can be followed by a case marker or clitic. This set of rules filter out the correct parsing of the word from the

    FSA. Here we have 286 rules.

    Handling of Compound Words. In morphological rich and productive languages like Tamil, occurrence of

    compounding words are high. In compound words, only the last word in the compound words is inflected. This

    was have handled as follows

  • 8/7/2019 Ma Language Format Final

    20/35

    19

    Step 1: Parsing the suffixes from the last suffix to the first suffix in the word, and checks for the root

    word in the given category in the FSA.

    Step 2: If the root word is not matched then step 3

    Step 3: The root word is split based on syllables and checked with the root dictionary

    Step 4: Once a word is matched, the remaining part of the word is splitted similarly and compared with

    the root dictionary.

    Step 5: If the complete root word, is matched into different root words in the dictionary, this multiple

    words as root with suffix information is given as analysis.

    Step 6: If the complete root is not matched even after splitting into multiple words, the analysis is given

    as unknown word.

    The other form is the verb which is inflected agglutinated with the pronoun, which can also be inflected, such as

    vatavan -> va: + t + a + avan

    come+root past RP pronoun

    Here the relative participle verb vata is agglutinated with avan, a pronoun. This we have handled by

    having a separate rule in the morphosyntax rule file.

    Agglutination of inflected verb and verb illai (negation), the verb illai agglutinate with the infinite verb forming

    one word, such as

    varavillai -> va: + a + illai

    Come+root inf negative verb

    This is also handled similarly as the previous example by adding a separate rule.

    5. Evaluation. We have evaluated the system with two sets of web data, first set is the words collected from

    general domain and the second set is the words collected from the tourism domain. The detail of evaluations is

    shown in table 9.

    INSERT Table 9. HERE

    The tourism documents have more compound words and the agglutination of words is more. In this domain,

    there are more number of named entities such as person name, place name, area specific words. The sentences

    commonly end with a:kum, a copula verb. This verb is agglutinated to the preceding noun phrase, such as

  • 8/7/2019 Ma Language Format Final

    21/35

    20

    u:ra:kum -> u:r + a:kum

    place + copula verb

    Similarly there are more compound nouns, such as

    maNme:TukaLuTaiya -> maN+me:Tu + kaL + uTaiya

    sand dune pl genetive

    Compound root suffix

    Consider the word maNme:TukaLuTaiya, kaL and uTaiya are the two suffixes. After removing the

    suffixes, the reminder is maNme:Tu, which does not match with root word dictionary. The word is spliited and

    compared with the root word list and man, me:Tu are two root words forming the word maNme:Tu. Similarly

    iravupakalai -> iravu+pakal + ai

    night+day accusative

    Compound root Suffix

    teyvacceyalil -> teyvam+ceyal + il

    Compound root Suffix

    Periodic updating of the root word lexicon will help in improving the performance of the system.

    6. Conclusion.The paper is about the design and development of Tamil morphological analysis, using the FiniteState Automata and the paradigm approach. The complex suffixation is effectively handled using FSA. The

    system performs at an average precision of 91.70%.

  • 8/7/2019 Ma Language Format Final

    22/35

    21

    Reference

    Beesley, Kenneth R. 1996. Arabic Finite-State Morphological Analysis and generation. Proceedings of the 16th

    International Conference on Computational Linguistics, Vol.1.Copenhagen, Denmark. 89-94.

    Elwell, Robert., Jason Baldridge. 2008. Using Syllables as Features in Morpheme Tagging in Swahili.

    Proceedings of the Fifth Midwest Computational Linguistics Colloquium, East Lansing.

    Itziar Aduriz, Eneko Agirre, Izaskun Aldezabal, Inaki Alegria, Xabier Arregi, Jose Maria Arriola, Xabier

    Artola, Koldo Gojenola Galletebeitia, Montse Maritxalar, Kepa Sarasola, Miriam Urkia. 2000. A word-grammar

    based morphological analyer for agglutinative languages. In Proceedings of COLING'2000. 1-7.

    Jppinen, H., Lehtola, A., Nelimarkka, E. and Ylilammi, M. 1983. Morphological Analysis of Finnish: A

    Heuristic Approach. Report B26, Helsinki University of Technology, Digital Systems Laboratory, Helsinki,

    Finland.

    Jurafsky, Daniel and James H. Martin. 2000. Speech and Language processing. Prentice Hall.

    Hockett, Charles F. 1958. A course in modern linguistics. New York: Macmillan.

    Girish Nath Jha., Muktanand Agarwal., Subash., Sudhir K Mishra., DiwakarMishra., Manji Bhadra Surjit K

    Singh. 2007. Inflectional Morphology for Sanskrit. In Proceedings of First

    International Symposium on Sanskrit Computational Linguistics. 46-77.

    Koskenniemi, Kimmo. 1983. Two-Level Morphology: A General Computational Model for Word-Form

    Recognition and Production. Publication No. 11. Helsinki: Department of General Linguistics, University of

    Helsinki.

    Lehmann, T. 1989. A Grammar of Modern Tamil, Pondicherry: Pondicherry Institute of Linguistics and Culture.

  • 8/7/2019 Ma Language Format Final

    23/35

    22

    Megerdoomian, Karine. 2004. Finite-State morphological analysis of Persian. In Proceedings of the Workshop

    on Computational Approaches to Arabic Scriptbased Languages. Coling 2004, University of Geneva,

    Switzerland.

    Mohanty, S., Santi, P.K., Adhikary, K.P.D. 2004. Analysis and Design of Oriya Morphological Analyser: Some

    Tests with OriNet. In Proceeding of symposium on Indian Morphology, phonology and Language Engineering,

    IIT Kharagpur.

    Pope, G. U. 1904. A handbook of the Tamil language. 7th ed. New Delhi, First published Oxford. Asian

    Educational Services, 1979.

    Sajib Dasgupta, Vincent Ng. 2007. Unsupervised morphological parsing of Bengali. Language Resources and

    Evaluation 40:3-4, pp 311-330

    Sajib Dasgupta, Dewan Shahriar Hossain Pavel, Asif Iqbal Sarkar, Naira Khan and Mumit Khan., 2005.

    Morphological Analysis of Inflecting Compound Words in Bangla, Proc. 8th International Conference on

    Computer & Information Technology (ICCIT), Islamic University of Technology (IUT), Dhaka, Bangladesh.

    Schulze, B. M. et al. 1994. DECIDE Designing and Evaluating Extraction Tools for Collocations in Dictionaries

    and Corpora. MLAP Project 93- 19.

    Viswanathan, S., Ramesh Kumar, S., Kumara Shanmugam, B., Arulmozi, S. and Vijay Shanker, K. (2003). A

    Tamil Morphological Analyser, Proceedings of the International Conference On Natural language processing

    ICON 2003, Central Institute of Indian Languages, Mysore, India, pp. 3139.

    Yona, S. and Wintner, S. 2005. A finite-state morphological grammar of Hebrew. In Proceedings of the ACL-

    2005 Workshop on Computational Approaches to Semitic Languages, Ann Arbor.

  • 8/7/2019 Ma Language Format Final

    24/35

    23

    Table1. Tamil Case System.

    Case Case Suffix

    Nominative

    Accusative ai

    Dative kuInstrumental a:l

    Sociative o:Tu/uTan

    Locative il/iTam

    Ablative ilirutu

    Genitive in/atu/uTaiya

  • 8/7/2019 Ma Language Format Final

    25/35

    24

    Table.2 Inflections of a noun.

    Root Number Case Postposition Clitic Word

    paiyan

    'boy'

    SG

    NOM

    paiyan

    'boy'

    paiyan

    'boy'

    SG

    ai

    ACC

    paiyanai

    'boy(OBJ)'paiyan'boy'

    SG

    kuDAT

    paiyanukku'to the boy'

    paiyan'boy'

    SG

    a:lINS

    paiyana:l'by the boy'

    paiyan

    'boy'

    SG

    o:Tu

    SOC

    paiyano:Tu

    'with the boy'

    pai'bag'

    SG

    ilLOC

    paiyil'in the bag''

    pai

    'bag'

    SG

    iliruntu

    ABL

    paiyiliruntu

    'from the bag'

    pai'bag'

    SG

    inNOM

    paiyin'bag's'

    paiyan'boy'

    SG

    aiACC

    e:EMPH

    paiyanaiye:'the boy(OBJ) himself'

    paiyan

    'boy'

    SG

    ai

    ACC

    a:

    INT

    paiyanaiya:

    'the boy(OBJ)?'

    paiyan'boy'

    SG

    ukkuDAT

    ta:nEMPH

    paiyanukkuta:n'it is for the boy'

    paiyan'boy'

    SG

    ukkuDAT

    ku:TaPSP

    ta:nEMPH

    paiyanukkukku:Tata:n'it is also for the boy'

    paiyan

    'boy'

    kaL

    PL

    NOM

    paiyankaL

    'boys'

    paiyan'boy'

    kaLPL

    aiACC

    paiyankaLai'boys(OBJ)'

    paiyan

    'boy'

    kaL

    PL

    ku

    DAT

    paiyankaLukku

    'to the boys'

    paiyan

    'boy'

    kaL

    PL

    a:l

    INS

    paiyankaLa:l

    'by the boys'

    paiyan'boy'

    kaLPL

    o:TuSOC

    paiyankaLo:Tu'with the boys'

    pai

    'bag'

    kaL

    PL

    il

    LOC

    paikaLil

    'in the bags'

    pai'bag'

    kaLPL

    iliruntuABL

    paikaLiliruntu'from the bags'

    pai'bag'

    kaLPL

    inNOM

    paikaLin'of the bags'

    paiyan

    'boy'

    kaL

    PL

    ai

    ACC

    e:

    EMPH

    paiyankaLaiye:

    'the boys(OBJ) themselves'

    paiyan'boy'

    kaLPL

    aiACC

    a:INT

    paiyankaLaiya:'the boys(OBJ)?'

    paiyan'boy'

    kaLPL

    ukkuDAT

    ta:nEMPH

    paiyankaLukkuta:n'it is for the boys'

    paiyan

    'boy'

    kaL

    PL

    ai

    ACC

    viTa

    PSP

    a:

    INT

    paiyankaLaiviTava:

    'than boys(OBJ)?'

  • 8/7/2019 Ma Language Format Final

    26/35

    25

    Table. 3. Pronouns in Tamil.

    Non-neuter Neuter

    Singular Plural Honorific Singular Plural

    a:m'We'

    (inclusive)

    a:m'We'

    (inclusive)

    FirstPerson

    a:n'I'

    a:kaL'We'

    (exclusive)

    a:n'I'

    a:kaL'We'(exclusive)

    SecondPerson

    i:'You'

    i:kaL/i:vir'You'

    i:kaL'You'

    i:'You'

    i:kaL'You'

    avan'He'

    ThirdPerson

    avaL

    'She'

    avarkaL'They'

    avar'He/She'

    atu'It'

    avai'Those'

  • 8/7/2019 Ma Language Format Final

    27/35

    26

    Table 4. PNG in Tamil

    Person Number Gender PNG Suffix

    Singular Masculine/Feminine

    -e:n

    First

    Plural Masculine/

    Feminine

    -o:m

    Singular Masculine/ -a:y

    Plural Masculine/Feminine

    -i:rkaLSecond

    SingularHonorific

    Masculine/Feminine

    -i:r

    Singular Masculine -a:n

    Singular Feminine -a:L

    Plural Masculine/

    Feminine

    -a:rkaL

    SingularHonorific

    Masculine/Feminine

    -a:r

    Singular Neuter -atu

    Third

    Plural Neuter -ana

  • 8/7/2019 Ma Language Format Final

    28/35

    27

    Table 5. Inflections of verbs.

    Root Tense/Inf+NEG PNG Clitics Example

    paTiread

    ttPST

    a:n3SM

    paTitta:n(He) read

    paTiread

    kkiRPRE

    a:L3SF

    paTikkiRa:L(She) is reading

    paTiread

    umFUT

    3SN

    paTikkum(It) will read

    paTiread

    tt:PST

    a:n3SM

    a:INT

    paTitta:na:?Did (he) read?

    paTiread

    a + illaiINF+NEGVERB

    a:INT

    patikkavillaiya:?Did not read?

  • 8/7/2019 Ma Language Format Final

    29/35

    28

    Table 6. Relative participle formation

    Root Tense RelativeParticiplemarker

    Form

    paTi tt a paTitta

    paTi kkiR a paTikkiRa

    paTi um paTikkum

    paTi a: a paTikka:ta

  • 8/7/2019 Ma Language Format Final

    30/35

    29

    Table 7. Glides that a word ending in a vowel take.

    Ending Vowel Glide Example

    Mid Open shorta

    v Native root words do not end in a

    Mid Open long

    a:

    v ci:ta: + ai -> ci:ta:vai

    'Sita' + ACC-> 'Sita(obj)'

    Front Close short

    i

    y puli + uTan -> puliyuTan

    'tiger'+SOC ->'with the tiger'

    Front Close long

    i:

    y ti: + a:l -> ti:ya:l

    'fire' + INS-> 'due to fire'

    Back Close short

    u

    v e:cu +ai -> e:cuvai

    'Jesus' + ACC -> 'Jesus(obj)'

    Back Close long

    u:v pu: +in -> pu:vin

    'flower + GEN -> 'flower's'

    Front Mid short

    e

    y Native root words do not end in e

    puNe + il -> puNeyil'Pune' + LOC -> 'in Pune'

    Front Mid longe:

    y Native root words do not end in e:me: + il -> me:yil

    'May' + LOC -> 'in May'

    Back Mid short

    o

    v Native root words do not end in o

    Back Mid longo:

    v Native root words do not end in o:a:TTo: + in -> a:TTo:vin'auto' + GEN-> 'auto's'

    Diphthongai

    y a:cai + a:l -> a:caiya:l'desire'+INS->'due to desire'

    Diphthongau

    v Native root words do not end in aulaknau + il -> laknauvil'Lucknow'+LOC->'in Lucknow'

  • 8/7/2019 Ma Language Format Final

    31/35

    30

    Table 8. Sample of the State Table

    Current State Next State Symbol

    0 0 ai

    0 0 utaiya

    0 1 kal0 1 ai

    0 1 utaiya

  • 8/7/2019 Ma Language Format Final

    32/35

    31

    Figure 1. Sample State Diagram

  • 8/7/2019 Ma Language Format Final

    33/35

    32

    Table 9. Evaluation of Morphological analyser

    Types General Domain Tourism Domain

    Total number of Words 50,000 50,000

    Analysed words 46620 45085

    Error due to Missing

    morphosyntax rules and statetable entries

    223 344

    Error due to agglutination 485 531

    Error due to missing root word 1345 1987

    Input Error 1327 2053

    Correctness of analysis 93.24% 90.17%

  • 8/7/2019 Ma Language Format Final

    34/35

    33

    Table 10. Linguistic abbreviations.

    Abbreviation Full Form

    3PE 3rd

    person Plural Epicene

    3PN 3rd person Plural Neuter

    3SF 3

    rd

    person SingularFeminine

    3SH 3rd

    person SingularHonorific

    3SM 3rd

    person SingularMasculine

    3SN 3rd

    person Singular Neuter

    ABL Ablative

    ACC Accusative

    ADJ Adjective Suffix

    ADV Adverb Suffix

    AUXV Auxiliary Verb

    CAUS Causative

    COND Conditional

    CONC Concessive

    COOR Coordination Clitic

    DAT Dative

    DISJ Disjunction Clitic

    EMPH Emphatic Clitic

    EMP Emphatic Suffix

    FEM Feminine

    FUT Future Tense

    GEN Genitive

    HORT Hortative

    INF Infinitive

    INS Instrumental

    INT Interrogative

    LOC Locative

    MAS Masculine

    NEG Negative Suffix

    NEGVERB Negative Verb

    OBJ Object

    ONOM Onomatopoeic form

    OPT Optative

    PL Plural

    PRON-3SM Pronominal - 3rd personSingular Masculine

    PRE Present Tense

  • 8/7/2019 Ma Language Format Final

    35/35

    34

    PSP Postposition

    PST Past Tense

    RP Relative Participle

    SOC Sociative

    SUFF Suffix

    SUPP Supposition marker

    i According to Schiffman (Schiffman 1994), Thus the usual treatment of Tamil case (Arden 1942) is one wherethere are seven cases--the nominative (first case), accusative (second case), instrumental (third), dative (fourth),ablative (fifth), genitive (sixth), and locative (seventh). The vocative is sometimes given a place in the casesystem as an eighth case, although vocative forms do not participate in usual morphophonemic alternations, nordo they govern the use of any postpositions.