PART OF SPEECHTAGGER FOR KANNADA - …shodhganga.inflibnet.ac.in/bitstream/10603/8875/11/11_chapter 5.pdf · 157 CHAPTER5 PART OF SPEECHTAGGER FOR KANNADA Parts of speech tagging

157

CHAPTER5

PART OF SPEECHTAGGER FOR KANNADA

Parts of speech tagging is a well-understood problem in NLP. The importance of the

problem focuses from the fact that the Parts of Speech tagging is one of the first stages in

the process performed by various natural language related process. POS tagging is the

process of assigning the part of speech tag or other lexical class marker to each and every

word in a sentence. POS tagging has a crucial role in different fields of NLP including

MT. In linguistics, parts-of-speech tagging, also termed grammatical tagging or word-

category disambiguation, is the process of marking up the words in a text or corpus as

corresponding to a particular part of speech, based on both its definition, as well as its

context. That is, relationship with adjacent and related words in a phrase, sentence, or a

paragraph. In other words, it can also be defined as the process of labelling automatic

annotation of syntactic categories for each word in a corpus. It is similar to the process of

tokenization for computer languages.A part-of-speech is a grammatical category,

commonly including verbs, nouns, adjectives, adverbs, determiner, and so on.

For English, there are many POS taggers, employing machine learning techniques

such as HMMs (Brants, 2000), transformation based error-driven learning (Brill, 1995),

decision trees (Black, 1992), maximum entropy methods (Ratnaparkhi, 1996), conditional

random _fields (Laffertey et al., 2001), Support vector machines (Kudoh et al., 2001) etc.

POS taggers are broadly classified into two categories called rule based and stochastic

based[155]. In case of rule based approach hand-written rules are used to distinguish the

tag ambiguity. Stochastic taggers are either HMM based, choosing the tag sequence which

maximizes the product of word likelihood and tag sequence probability, or cue-based,

using decision trees or maximum entropy models to combine probabilistic features. The

performance of the POS tagging model hugely depends on the corpus data with which it is

trained.The relative failure of rule-based approaches, the increasing availability of

machine readable text and the increase in capability of hardware with decrease in cost, are

some of the reasons, researchers to prefer corpus based POS tagging.

158

Some languages have a richer morphology than others, requiring the tagger to have

into account a bigger set of feature patterns. Also the tagset size and ambiguity rate may

vary from language to language. Besides, if few data are available for training, the

proportion of unknown words may be huge. Sometimes, morphological analyzers could be

utilized to reduce the degree of ambiguity when facing unknown words. Thus, a POS

tagger should be flexible with respect to the amount of information utilized and context

shape.

In Dravidian language like Kannada, ambiguity is the key issue that must be addressed

and solved while designing a POS tagger. For different context words behave differently

and hence the challenge is to correctly identify the POS tag of a token appearing in a

particular context. The input to a tagging algorithm is a string of words of a natural

language sentence and a specified tag set (a finite list of Part-of-speech tags). The output is

a single best POS tag for each word.

This chapter describes the development of a part-of-speech tagger for Kannada

language that can be used for analyzing and annotating Kannada texts. The SVM

supervised machine learning classifier algorithm was used in the proposed system to

alleviate POS tagger problem in Kannada language. Prior to this POS tagger system

development, a linguistic study was conducted to determine the internal linguistic

structure of the Kannada sentence. Based on this study, a suitable tagset for Kannada

language was developed based on AMRITA tagset. A corpus of texts, extracted from

Kannada news papers and books, were manually, morphologically analyzed and tagged

based on the developed tagset. A corpus size of approximately fifty thousand words was

used for training and testing the accuracy of the POS tagger generators.

5.1 COMPLEXITY IN KANNADA POS TAGGING

In Dravidian languages, particularly for Kannada language, nouns and verbs get

inflected. Nouns get inflected for number and cases. Verbs get inflected for tense and

number and also are adjectivalized and adverbialized. Also verbs and adjectives are

nominalized by means of certain nominalizers. Adjectives and adverbs do not inflect.

Many post-positions in Kannada are from nominal and verbal sources. So, many times we

need to depend on syntactic function or context to decide upon whether the particular

159

word is a noun or adjective or adverb or post position. This leads to the complexity in

Kannada POS tagging.A noun may be categorized as common, proper or compound.

Similarly, verb may be finite, infinite, gerund or contingent. Contingent form of verb is

not found in other Dravidian languages except Kannada. Other parts of speech were also

divided into their own subcategories.

For example, Kannada the word „batti‟ (ಫತ್ತತ) in the following sentences isassociated

with different parts of speech.

Sentence1: ಸಿೋತ್ದೋದಫತ್ತತಫದಲ್ಲಸಿದಳನ. (Seete deepada batti badalisidaLu)

Sentence2: ಬವಿಮನೋಯನಫತ್ತತಹ ೋಯಿತನ. (baaviya neeru batti hooyitu)

In the first sentence, the word „batti‟ (ಫತ್ತತ)is a noun whereas in the second sentence it

is a verb. This is not rare in natural languages, and a large percentage of word-forms are

ambiguous. Also, the parts of speech are not just the noun, pronoun, verb and adverb.

There are clearly many more categories and sub-categories.

5.2 PROPOSED TAGSET FOR KANNADA

Every natural language is different from other. So there is a need for separate tagset

for every language. For POS level we want to determine the word‟s POS category or tag

and that can be done using a tag set with limited number of tags. Moreover large number

of tags will lead to increased complexity which in turn reduces the tagging accuracy. A

Kannada POS Tagset is developed considering the ambiguity and other peculiarities of

Kannada language. The proposed Kannada POS Tagset resembles the Amrita Tamil tagset

[156].

There are several POS tagsets for Indian languages created by number of research

groups. Baskaran proposed a common POS tagset framework for Indian languages where

he had identified partial 12 tagsets as universal categories for any tagset. TDIL tagset is

the common unified XML based POS Schema for Indian Languages based on W3C

Internationalization. The schema has been developed to take into account the NLP

requirements for Web based services in Indian Languages. This standard specifies XML

POS Schema for tagging. Various tagsets are already in existence for other South

Dravidian languages like Tamil namely AUKBC-Tagset, Tagset from Vasuranganathan,

160

CIIL Tagset etc. Vijayalaxmi .F. Patil of LDC-IL proposed a Kannada tagset which

consists of 39 tags. However, I have encountered the following problems with these

tagsets:

1. For each word, the grammatical categories as well as grammatical features are

considered. Hence it needs to be split for each and every inflected word in the corpus,

which makes the tagging process very complex.

2. The number of tags is very large. This leads to increased complexity during POS

tagging which in turn reduces the tagging accuracy.

3. For simple POS level, a tagset which has just the grammatical categories excluding

grammatical features in nature is more than sufficient. Also, we needed a tagset with

minimum tags without compromising on tagging efficiency.

The proposed tagset consists of 30 tags, where inflections were not considered.The

compound tagsare used only for nouns (NNC) and proper nouns (NNPC). There are 5 tags

for nouns, 1 tag for pronoun, 8 tags for verbs, 3 for punctuations, two for number, and 1

for each adjective, adverb, conjunction, echo, reduplication, intensifier, postposition,

emphasize, determiners, complimentizer and question word. The tags in the proposed

tagset are described in the Table 5.1 with example each.

Table 5.1: POS Tagset for Kannada

S.N TAG DESCRIPTION EXAMPLE [ ENGLISH ]

1 <NN> NOUN ಹನಡನಗ (huDuga) [ boy ]

2 <NNC>

COMPOUND NOUN ಎತ್ತತನಫೆಂಡಿ (ettina banDi)

3 <NNP> PROPER NOUN ಕರ್ಾಟಕ ( Karnataka)

4 <NNPC>

COMPOUND

PROPER NOUN

ಅಫನ್ಲ್್ಱೆಂ (Abdul kalam)

5 <CRD> CARDINALS ಒೆಂದನ (ondu) [ one ]

6 <ORD> ORDINALS ಒೆಂದನೆ (ondane ) [ first ]

161

7 <PRP> PRONOUN ಅವನನ (avanu) [ he ]

8 <ADJ> ADJECTIVE ಸನೆಂದಯವದ (sundaravAda)

[ beautiful ]

9 <ADV> ADVERB ಳೋಗದಲ್ಲಿ (vEgadalli)[ speedily ]

10 <VNAJ> VERB NONFINITE

ADJECTIVE

ಫೆಂದಹನಡನಗ (bandahuDuga)

[ the boy who came ]

11 <VNAV> VERB NONFINITE

ADVERB

ಫೆಂದನಹ ೋದನನ(banduhOdanu)

[ came and went back ]

12 <VBG> VERBAL GERUND ಫಯನವ (baruva )[ coming ]

13 <VBC> VERB

CONTINGENT

ಫಯನಳೋನನ(baruvEnu)[might come ]

14 <VF> VERB FINITE ಫರದೆನನ (baredenu)[ wrote ]

15 <VAX> AUXILIARY VERB ನೆ ೋಡನತ್ತತದೆ್ೋನೆ (nODuttiddEne)

[ was + ing ]

16 <VINT> VERB INFINITE ನೆ ೋಡಲ್ನ (nODalu)[ to see ]

17 <CNJ> CONJUNCTION ಭತನತ (mattu)[ and ]

18 <CVB> CONDITIONAL

VERB

ನೆ ೋಡಿದರ (nODidare)[ if seen ]

19 <QW> QUESTION WORDS ಏಕೆ (Eke)[ why ]

20 <COM> COMPLIMENTIZER ಎೆಂಫ (enba)[ s , es ]

21 <NNQ> QUANTITY NOUN ಸವಲ್ಪ (swalpa)[ little ]

22 <PPO> POST POSITIONS ತನಕ (tanaka)[ till ]

23 <DET> DETERMINERS ಆ (A)

24 <INT> INTENSIFIER ತನೆಂಬ (tunbA)[ very ]

25 <ECH> ECHO WORDS ಅಪ್ಪಪತಪ್ಪಪ(appi tappi)[by mistake ]

26 <EMP> EMPHASIS ಮತರ (matra)[ only ]

27 <COMM> COMMA ,

162

28 <DOT> DOT .

29 <QM> QUESTION MARK ?

30 <RDW> REDUPLICATION

WORDS

ಟಟ (paTa paTa )

[ continuously ]

5.2.1 Description of Tags in the POS Tag set

NN (Noun)

The tag NN is used for common nouns (general nouns) without differentiating them

based on the grammatical information.

NNC (Compound Noun)

Nouns that are compound are tagged using the tag NNC.

NNP (Proper Nouns)

The tag NNP tags the proper nouns.

NNPC (Compound Proper Nouns)

Compound proper nouns are tagged using the tag NNPC.

ORD (Ordinal)

Expressions denoting ordinals will be tagged as ORD.

CRD (Cardinal)

Cardinal tag CRD tag tags the cardinals (numbers) in the language.

PRP (Pronoun)

All pronouns are tagged using the tag PRP.

ADJ (Adjective)

All adjectives in the language will be tagged as ADJ.

163

ADV (Adverb)

ADV tag tags the adverbs in the language. This tag is used only for manner adverbs.

VNAJ (Verb Non-finite Adjective)

VNAJ tag tags the Verb Non-finite Adjective in the language.

VNAV (Verb Non-finite Adverb)

VNAV tag tags the Verb Non-finite Adverb in the language.

VBG (Verbal Gerund)

All Verbal Gerund in the language will be tagged as VBG.

VF (Verb Finite)

VF tag is used to tag the finite verbs in the language.

VAX (Auxiliary Verb)

VAX tags the auxiliary verbs in the language.

VINT (Verb Infinite)

VINT tags the Verb Infinite in the language. This is generally preceded by auxiliary

verbs or finite verbs.

CVB (Conditional Verb)

CVB tags the Conditional Verb in the language.

VBC (Verbal Contingent)

The VBC is a special characteristic feature of Kannada language which is not

presents other south Dravidian languages like Tamil, Telugu and Malayalam.

CNJ (Conjuncts, both coordinating and subordinating)

The tag CNJ can be used for tagging coordinating and subordinating conjuncts.

QW (Question Words)

164

The question words in the language like will be tagged as QW.

COM (Complimentizer)

COM tags the Complimentizer in the language.

NNQ (Quantity Noun)

NNQ tags the Quantity Noun in the language.

PPO (Postposition)

All the Indian languages have the phenomenon of postpositions. Postpositions are

tagged using the tag PPO.

DET (Determiners)

The tag DET tags the determiners in the language.

INT (Intensifier)

Intensifier is used for intensifying adjectives or adverbs in a language.

EMP (Emphasis)

The tag EMP tags the Emphasis words in the language.

COMM (Comma)

The tag COMM tags the comma in a sentence.

DOT (Dot)

The tag DOT tags the dots (period) in a sentence.

QM (Question Mark)

The question marks in the language are tagged using the tag QM.

5.2.2Alignment and Comparison of Proposed Kannada Tagset to TDIL Tagset

TDIL tagset is the common unified XML based POS Schema for Indian Languages

based on W3C Internationalization. The schema has been developed to take into account

165

the NLP requirements for Web based services in Indian Languages. This standard

specifies XML POS Schema for tagging. The following table 5.2 shows the alignment and

comparison of proposed Amrita Kannada tagset to TDIL tagset.

Table 5.2:Alignment and comparison of Amrita Kannada tagset to TDIL tagset

Sl.

No

TDIL Tagset for Dravidian

Languages

Sl.

No

Amrita Tagset for Kannada

Language Category

Lab

el

Category

Label Top

Level

Subtype

(level1)

Subtype

(level 2) Top

level

Subtype

(level 1)

Subtype

(level 2)

1 Noun N 1 Noun N

1.1 Common NN 1.1 Common NN

1.1.1 Compou

nd

NNC

1.2 Proper NNP 1.2 Proper NNP

1.2.1 Compou

nd NNPC

1.3 Nloc NST Merged with Post Position (PPO)

2 Pronoun PR 2 Pronoun PRP 2.1 Personal PRP

2.2 Reflexive PRF

2.3 Relative PRL

2.4 Reciprocal PRC

2.5 Wh-word PRQ

3 Demonstr

ative

DM

3.1 Deictic DMD 3 Determiner DET 3.2 Relative DMR 4 Question

Words QW

3.3 Wh-word DMQ

4 Verb V 5 Verb V 4.1 Main V 5.1 Main V 4.1.1 Finite VF 5.1.1 Finite VF 4.1.2 Non-

Finite

VNF 5.1.2 Verbal

Gerund VBG

4.1.3 Infinitive VINF 5.1.3 Infinitive VINF 4.2 Verbal VN 5.2 Verbal VN 5.2.1 Verb

Nonfinite

Adjectiv

e

VNAJ

5.2.2 Verb Nonfinite

Adverb

VNAV

4.3 Auxiliary VAU

X

5.3 Abxiliary VAX

5.4 Contingent

VBC

5 Adjective JJ 6 Adjective ADJ

6 Adverb RB 7 Adverb ADV

7 Postpositi PSP 8 Postpositio PPO

166

on n

8 Conjuncti

on

CC 9 Conjunctio

n

CNJ

8.1 Co-

ordinator CCD

8.2 Subordinat

or CCS

8.2.1 Quotativ

e

UT 10 Complimen

t-izer COM

9 Particles RP 11 Emphasis EMP

9.1 Default RP

9.2 Classifier C

9.3 Interjectio

n INJ

9.4 Intensifier INTF 12 Intensifi

er INT

9.5 Negation

10 Quantifie

rs

QT 13 Quantifiers Q

10.1 General QTF 13.1 Quantifier

Noun NNQ

10.2 Cardinals QTC 13.2 Cardinals CRD

10.3 Ordinals QTC 13.3 Ordinals ORD

11 Residuals RD 14 Residuals

11.1 Foreign

word

RDF

11.2 Symbol SYM 14.1 Questio

n Mark QM

11.3 Punctuatio

n PUN

C 14.2 Comma COM

M

14.3 Dot DOT

11.4 Unknown UNK

11.5 Echowords ECH 15 Echow

ords

ECH

16 Redupl

ication

Words

RDW

5.3 SUPPORT VECTOR MACHINE BASED TAGGERS

Generally, tagging is required to be as accurate as possible, and as efficient as

possible. But, certainly, there is a trade-off between these two desirable properties. This is

so because obtaining a higher accuracy relies on processing more and more information,

digging deeper and deeper into it. However, sometimes, depending on the kind of

application, a loss in efficiency may be acceptable in order to obtain more precise results.

Or the other way around, a slight loss in accuracy may be tolerated in favor of tagging

speed.

167

Moreover, some languages have a richer morphology than others, requiring the tagger

to have into account a bigger set of feature patterns [157]. Also the tag set size and

ambiguity rate may vary from language to language and from problem to problem.

Besides, if few data are available for training, the proportion of unknown words may be

huge. Sometimes, morphological analyzers could be utilized to reduce the degree of

ambiguity when facing unknown words. Thus, a sequential tagger should be flexible with

respect to the amount of information utilized and context shape. Another very interesting

property required for taggers is their portability.

The SVMTool is intended to comply with all the requirements of modern NLP

technology, by combining simplicity, flexibility, robustness, portability and efficiency

with state–of–the–art accuracy. This is achieved by working in the SVM learning

framework, and by offering NLP researchers a highly customizable sequential tagger

generator.The SVM-based tagger is robust and flexible for feature modelling, trains

efficiently with very less parameters to tune. Also the SVM-based tagger is able to tag

thousands of words per second, which makes it really practical for real NLP applications.

5.3.1 Problem Setting

Binarizing the classification problem and feature codification are the two important

steps in the problem setting.

5.3.1.1 Binarizing the Classification Problem

Tagging a word in context is a multi-class classification problem [154]. Since SVMs

are binary classifiers, a binarization of the problem must be performed before applying

them. The proposed POS tagger model has applied a simple one-per-class binarization,

i.e., a SVM is trained for every POS tag in order to distinguish between examples of this

class and all the rest. When tagging a word, the most confident tag according to the

predictions of all binary SVMs is selected.

However, not all training examples have been considered for all classes. Instead, a

dictionary is extracted from the training corpus with all possible tags for each word, and

when considering the occurrence of a training word „w‟ tagged as „t i‟, this example is used

as a positive example for class ti and a negative example for all other „tj‟ classes appearing

168

as possible tags for „w‟ in the dictionary. This will avoid the generation of excessive and

irrelevant negative examples, and make the training step faster.

5.3.1.2 Feature Codifications

Each example (event) has been represented using the local context of the word for

which the system will determine a tag (output decision). This local context and local

information like capitalization and affixes of the current token will help the system make a

decision even if the token has not been encountered during training. The proposed model

considered a centered window of seven tokens, in which some basic and n–gram patterns

were evaluated to form binary features such as: “previous word is the”, “two preceding

tags are DET NN”, etc. Each of the individual tags of an ambiguity class is also taken as a

binary feature of the form “following word may be a NN”. Therefore, with ambiguity

classes and “maybe‟s”, the proposed model avoid the two pass solution, in which an initial

first pass tagging is performed in order to have right contexts disambiguated for the

second pass. Also explicit n–gram features are not necessary in the SVM approach,

because polynomial kernels account for the combination of features.

Additional features have been used to deal with the problem of unknown words.

Features appearing a number of times under a certain count cut-off might be ignored for

the sake of robustness. Table 5.3 shows a rich feature set used in the proposed experiment.

Table 5.3: Feature Pattern Set

word features w−3,w−2,w−1,w0,w+1,w+2,w+3

PoS features p−3, p−2, p−1, p0, p+1, p+2, p+3

ambiguity classes a0, a1, a2, a3

may be‟s m0,m1,m2,m3

word bigrams (w−2,w−1), (w−1,w+1), (w−1,w0), (w0,w+1), (w+1,w+2)

PoS bigrams (p−2, p−1), (p−1, a+1), (a+1, a+2)

word trigrams (w−2,w−1,w0), (w−2,w−1,w+1), (w−1,w0,w+1),

169

(w−1,w+1,w+2), (w0,w+1,w+2)

PoS trigrams (p−2, p−1, a+0), (p−2, p−1, a+1), (p−1, a0, a+1), (p−1, a+1,

a+2)

sentence info Punctuation („.‟, „?‟, „!‟)

prefixes s 1, s 1s 2, s 1s 2 s3, s 1s 2 s 3s 4

suffixes sn, sn-1sn, s n-2s n-1s n, s n-3s n-2s n-1s n

binary word

features

initial Upper Case, all Upper Case, no initial Capital Letter(s),

all Lower Case, contains a (period / number / hyphen ...)

word length integer

5.4 PROPOSED POS TAGGING ALGORITHM AND ARCHITECTURE

A simple algorithm which shows the basic structure of the proposed architecture for

POS tagging is defined below.

Step1: Take input text.

Step2: Tokenize the input text (Pre-editing).

Step3: Manual Tagging.

Step4: Train the corpus.

Step5: Tagging using SVM.

Step5.1: Search for the tokens in lexicon.

Step5.2: If it found, give the appropriate

TAG from lexicon.

Step5.3: If not found, TAG it with SVM Probabilities.

Step 6: Get the tagged output text.

170

Step 7: Insert those new words in lexicon.

The Fig. 5.1 shows the proposed architecture for POS tagging. The architecture

consist different modules based on their functionalities. The SVMlight software package

consists of three main components, namely the model learner (SVMTLearn), the tagger

(SVMTagger) and the evaluator (SVMEval). Prior to tagging, SVM models (weight

vectors and biases) are learned from a training corpus using the SVMTlearn component.

Different models were learned for different strategies. Then, at tagging time, using the

SVMTagger component, one may choose the tagging strategy that is most suitable for the

purpose of the tagging. Finally, given a correctly annotated corpus, and the corresponding

SVMTagger predicted annotation, the SVMeval component is used to evaluate the

performance of the tagger. The functionalities of each of this module are explained briefly

as follows:

Fig. 5.1: Architecture for POS tagging

5.4.1 Tokenize

Untagged sentences are downloaded from Kannada newspapers and commercial

websites. Input text is then converted into a column format suitable to the SVM tool

[154].

5.4.2 Manual Tagging

The tokenizing module produces a corpus of untagged tokens. After which, the corpus

was tagged manually using proposed AMRITA Kannada tagset. Training data must be in

171

column format, i.e. a token per line corpus in a sentence by sentence fashion. The column

separator is the blank space. The token is expected to be the first column of the line. The

tag to predict takes the second column in the output. The rest of the line may contain

additional information. Initially around 10,000 words are tagged manually. A sample of

the training data is as follows:

ಕೆ ಡಗಿಗೆ<NNP>

ಆಗಮಿಸಿಯನವ<VBG>

ಬಿಜೆಪ್ಪ<NNP>

ಹಿರಿಮ<ADJ>

ರ್ಮಕ<NN>

ಱಲ್<NNPC>

ಕೃಷಣ<NNPC>

ಅಡ್ವಣಿ<NNPC>

ಅವಯನ<PRP>

ಶನಕರವಯ<NNP>

ಕೆ ಡವ<NNPC>

ಉಡನಗೆ<NNPC>

ತ್ ಟನು<VAX>

ಸೆಂಬರಮಿಸಿದರ<VAX>

, <COM>

ಅವಯ<PRP>

ತ್ತು<NN>

ಕಭಱ<NNP>

172

ಅವಯ <PRP>

ಸ್ಥ್<ADV>

ನೋಡಿದಯನ<VF>

. <DOT>

5.4.3 Corpus Training

The tagged corpus is trained using SVMTlearn classifiers, component of SVM- light

tool [154]. SVM– light is an implementation of Vapnik‟s SVMs in C, developed by

Thorsten Joachim‟s. The training with SVM produces a dictionary with merged model and

its feature set (Lexicon). This lexicon contains the tagged Kannada words with its group of

tags probability parameters.SVMTlearn behavior is easily adjusted through a

configuration file. The usage of SVMTlearn module is shown below.

Usage: SVMTlearn [options] <config-file>

Options:

- V verbose 0: none verbose

1: low verbose [default]

2: medium verbose

3: high verbose

Example: SVMTlearn -V 2 config.svmt

These are the currently available config-file options:

Sliding window: The size of the sliding window for feature extraction can be adjusted.

Also, the core position in which the word to disambiguate is to be located may be selected.

By default, window size is 5 and the core position is 2, starting at 0. The proposed POS

tagger, features were extracted by considering the default window size 5.

Feature set: Three different kinds of feature types can be collected from the sliding

window:

173

Word features: Word form n-grams. Usually unigrams, bigrams and trigrams

suffixes. Also, the last word of the sentence corresponds to a punctuation mark („.‟,

„?‟, „!‟), is important.

POS features: Annotated parts–of–speech and ambiguity classes n-grams, and

“may be‟s”. As for words, considering unigrams, bigrams and trigrams is enough.

The ambiguity class for a certain word determines which POS are possible. A

“may be” states, for a certain word, that certain POS may be possible, i.e. it

belongs to the word ambiguity class.

Lexicalized features: including prefixes and suffixes, capitalization, hyphenization,

and similar information related to a word form. But in Tamil capitalization will not

used all the other features are accepted.

Default feature sets for every model are defined.

Feature filtering: The feature space can be kept in a convenient size. Smaller models

allow for a higher efficiency. By default, no more than 100,000 dimensions are used. Also,

features appearing less than n times can be discarded, which indeed causes the system both

to fight against overfitting and to exhibit a higher accuracy. By default, features appearing

just once are ignored.

SVM model compression: Weight vector components lower than a given threshold, in

the resulting SVM models can be filtered out, thus enhancing efficiency by decreasing the

model size but still preserving accuracy level. That is an interesting behavior of SVM

models being currently under study. In fact, discarding up to 70% of the weight

components accuracy remains stable, and it is not until 95% of the components are

discarded that accuracy falls below the current state–of–the–art (97.0% - 97.2%).

C parameter tuning: In order to deal with noise and outliers in training data, the soft

margin version of the SVM learning algorithm allows the misclassification of certain

training examples when maximizing the margin. This balance can be automatically

adjusted by optimizing the value of the C parameter of SVMs. A local maximum is found

exploring accuracy on a validation set for different C values at shorter intervals.

174

Dictionary repairing: The lexicon extracted from the training corpus can be

automatically repaired either based on frequency heuristics or on a list of corrections

supplied by the user. This makes the tagger robust to corpus errors. Also a heuristic

threshold may be specified in order to consider as tagging errors those (wordx tagy) pairs

occurring less than a certain proportion of times with respect to the number of occurrences

of wordx. For example, a threshold of 0.001 would be considered (run DT) as an error if

the word run had been seen at least 1000 times and only once tagged as a „DT‟. This kind

of heuristic dictionary repairing does not harm the tagger performance; on the contrary, it

may help a lot.

Repairing list must comply with the SVMTool dictionary format,

i.e. <word><N occurrences><N possible tags> 1{<tag (i)><N occurrences (i)}

Ambiguous classes: The list of POS presenting ambiguity is, by default, automatically

extracted from the corpus but, if available, this knowledge can be made explicit. This acts

in favor of the system robustness.

Open classes: The list of POS tags for an unknown word may be labelled as is also, by

default, automatically determined.

Backup lexicon: A morphological lexicon containing words that are not present in the

training corpus may be provided. It can be also provided at tagging time. This file must

comply with the SVMTool dictionary format.

Models

Five different kinds of models have been implemented in this. Models 0, 1, and 2

differ only in the features they consider. Model 3 and Model 4 are just like Model 0 with

respect to feature extraction but examples are selected in a different manner. Model 3 is

for unsupervised learning so, given an unlabelled corpus and a dictionary, at learning time

it can only count on knowing the ambiguity class, and the POS information only for

unambiguous words. Model 4 achieves robustness by simulating unknown words in the

learning context at training time.

Model 0: This is the default model. The unseen context remains ambiguous. It was

thought having in mind the one-pass on-line tagging scheme, i.e. the tagger goes either

175

left-to-right or right-to-left making decisions. So, past decisions feed future ones in the

form of POS features. At tagging time only the parts-of-speech of already disambiguated

tokens are considered. For the unseen context, ambiguity classes are considered instead.

Model 1: This model considers the unseen context already disambiguated in a previous

step. So it is thought for working at a second pass, revisiting and correcting already tagged

text.

Model 2: This model does not consider POS features at all for the unseen context. It is

designed to work at a first pass, requiring Model 1 to review the tagging results at a

second pass.

Model 3: The training is based on the role of unambiguous words. Linear classifiers are

trained with examples of unambiguous words extracted from an annotated corpus. So,

fewer POS information is available. The only additional information required is a morpho-

syntactic dictionary.

Model 4: The errors caused by unknown words at tagging time punish severely the

system. So to reduce this problem, during learning, some words are artificially marked as

unknown in order to learn a more realistic model. The process is very simple. The corpus

is divided in a number of folders. Before starting to extract samples from each of the

folders, a dictionary is generated out from the rest of folders. So, the words appearing in a

folder but not in the rest are unknown words to the learner.

Table 5.4: Tag Counts

S.N TAG Counts S.N TAG Counts

1 <NN> 18660 16 <VBC> 4

2 <NNC> 3692 17 <CNJ> 1308

3 <NNP> 2992 18 <CVB> 340

4 <NNPC> 2584 19 <QW> 920

5 <ORD> 58 20 <COM> 994

176

6 <CRD> 1064 21 <NNQ> 2364

7 <PRP> 1732 22 <PPO> 310

8 <ADJ> 1192 23 <DET> 898

9 <ADV> 412 24 <INT> 474

10 <VNAJ> 10 25 <ECH> 64

11 <VNAV> 32 26 <EMP> 292

12 <VBG> 1260 27 <COMM> 1278

13 <VF> 3758 28 <DOT> 4273

14 <VAX> 3410 29 <QM> 262

15 <VINT> 54 30 <RDW> 106

The proposed POS tagger was trained with a tagged Kannada corpus size of 54,000

words. Table 5.4 shows each and every tag in the corpus.

5.4.4 Tagging using SVM Tagger

Given a text corpus in one token per line and the path to a previously learned SVM

model including the automatically generated dictionary,SVMTagger performs the POS

tagging of a sequence of words. The tagging goes online-based on a sliding window,

which gives a view of the feature context to be considered at every decision. A word that

currently not available in the lexicon might be tagged by the probability parameters, using

the probability with the bias occurrence words.

The SVMTagger component works on standard input/output. It processes a token per

line corpus in a sentence by sentence fashion. The token is expected to be the first column

of the line. The predicted tag will take the second column in the output. The rest of the line

remains unchanged.

Example of input to the SVMTagger:

177

ಕೆ ಡಗಿಗೆ

ಆಗಮಿಸಿಯನವ

ಬಿಜೆಪ್ಪ

ಹಿರಿಮ

ರ್ಮಕ

ಱಲ್

ಕೃಷಣ

ಅಡ್ವಣಿ

ಅವಯನ

ಶನಕರವಯ

ಕೆ ಡವ

ಉಡನಗೆ

ತ್ ಟನು

ಸೆಂಬರಮಿಸಿದರ

,

ಅವಯ

ತ್ತು

ಕಭಱ

ಅವಯ

ಸ್ಥ್

ನೋಡಿದಯನ

.

SVMTagger Expected Output:

ಕೆ ಡಗಿಗೆ<NNP>

ಆಗಮಿಸಿಯನವ<VBG>

ಬಿಜೆಪ್ಪ<NNP>

178

ಹಿರಿಮ<ADJ>

ರ್ಮಕ<NN>

ಱಲ್<NNPC>

ಕೃಷಣ<NNPC>

ಅಡ್ವಣಿ<NNPC>

ಅವಯನ<PRP>

ಶನಕರವಯ<NNP>

ಕೆ ಡವ<NNPC>

ಉಡನಗೆ<NNPC>

ತ್ ಟನು<VAX>

ಸೆಂಬರಮಿಸಿದರ<VAX>

, <COM>

ಅವಯ<PRP>

ತ್ತು<NN>

ಕಭಱ<NNP>

ಅವಯ <PRP>

ಸ್ಥ್<ADV>

ನೋಡಿದಯನ<VF>

. <DOT>

Usage : SVMTagger [options] <model>

Options:

- T <strategy>

179

0: one-pass (default - requires model 0)

1: two-passes [revisiting results and relabelling requires model 2 and model 1]

2: one-pass [robust against unknown words requires model 0 and model 2]

3: one-pass [unsupervised learning models requires model 3]

4: one-pass [very robust against unknown words requires model 4]

5: one-pass [sentence-level likelihood requires model 0]

6: one-pass [robust sentence-level likelihood requires model 4]

- S <direction>

LR: left-to-right (default)

RL: right-to-left

LRL: both left-to-right and right-to-left

GLRL: both left-to-right and right-to-left

(global assignment, only applicable under a sentence level tagging strategy)

- K <n> weight filtering threshold for known words (default is 0)

- U <n> weight filtering threshold for unknown words (default is 0)

- Z <n> number of beams in beam search, only applicable under sentence-level strategies

(default is disabled)

- R <n> dynamic beam search ratio, only applicable under sentence-level strategies

(default is disabled)

- F <n> softmax function to transform SVM scores into probabilities (default is 1)

0: do_nothing

1: ln(e^score(i) / [sum:1<=j<=N:[e^score(j)]])

- A predictions for all possible parts-of-speech are returned

- B <backup_lexicon>

- L <lemmae_lexicon>

180

- EOS enable usage of end_of_sentence „<s>‟ string (disabled by default, [!.?] used

instead)

- V <verbose> -> 0: none verbose

1: low verbose

2: medium verbose

3: high verbose

4: very high verbose

Model: model location (path/name)

(Name as declared in the config-file NAME)

Example: SVMTagger -T 0 KANNADA <in.txt > out.txt

Strategies

Seven different tagging strategies have been implemented so far:

Strategy 0: It is the default one. It makes use of Model 0 in a greedy on-line fashion, one-

pass.

Strategy 1: As a first attempt to achieve robustness in front of error propagation, it works

in two passes, in an on-line greedy way. It uses Model 2 in the first pass and Model 1 in

the second. In other words, in the first pass the unseen morpho-syntactic context remains

ambiguous while in the second pass the tag predicted in the first pass is available also for

unseen tokens and used as a feature.

Strategy 2: This strategy tries to achieve robustness by using two models at tagging time,

namely Model 0 and Model 2. When all the words in the unseen context are known it uses

Model 0. Otherwise it makes use of Model 2.

Strategy 3: It uses Model 3, again in a greedy and on-line manner. This unsupervised

learning strategy is still under experimentation.

Strategy 4: It simply uses Model 4 as is, in an on-line greedy fashion.

181

Strategy 5: Still working on a more robust scheme, this strategy performs a sentence-level

tagging by means of dynamic programming techniques (Viterbi algorithm). It uses Model

0.

Strategy 6: Same as strategy 5, this strategy performs a sentence level tagging, this time

applying Model 4.

5.4.5Evaluation and Result

SVMTeval is used to evaluate the performance of the proposed POS tagger system.

Given a SVMTagger predicted tagging output and the corresponding gold-standard,

SVMTeval evaluates the performance in terms of accuracy. Based on the morphological

dictionary that was created during training time by SVMTlearn, results may be presented

for different sets of words such as known words vs. unknown words, ambiguous words vs.

unambiguous words. A different view of these same results are also seen from the class of

ambiguity perspective i.e. words sharing the same kind of ambiguity may be considered

together. Words sharing the same degree of disambiguation complexity, determined by the

size of their ambiguity classes, can also be grouped.

Usage: SVMTeval [mode] <model><gold><pred>

- mode: 0 - complete report (everything)

1 - overall accuracy only [default]

2 - accuracy of known vs unknown words

3 - accuracy per level of ambiguity

4 - accuracy per kind of ambiguity

5 - accuracy per class

- model: model name

- gold: correct tagging file

-pred: predicted tagging file

Tagging accuracyis considered as the evaluation criteria for performance evaluation.

The evaluation criterion used for evaluation of the proposed models is based mostly on

182

exact tagging. The tagging is considered correct only if it exactly matches the one in the

gold-standard. POS tagging accuracy is calculated based on the following formula.

TA = Number of correct tagged words / Number of test words.

The SVM consist of limited lexicon in the beginning. The system shows low accuracy

when POS tagging is performed with this limited lexicon. The size of the training data is

increased with more pre-edited data in step by steptagging with created SVM model.

Again manual corrections for incorrect tags are done. The above process is repeated with

more dataset to increase the size of lexicon. As the lexicon was increased to 54,000 words,

the performance of proposed system also increased to 86%. Lexicon verses accuracy is

shown in the Table 5.5. The accuracy increased with increasing the number of words in

the lexicon.

Table 5.5: POS Tagger Result

No. of Words in Lexicon POS tagger Accuracy

10,000 48%

25,000 66%

54,000 86%

5.5 SUMMARY

The development of POS tagger for a language with limited electronic resources can

be very demanding. The proposed work presented a Part-Of-Speech tagger for Kannada

language modelled using SVM kernel. Prior to this development, a linguistic studywas

conducted to determine the internal linguistic structure of the Kannada sentence and

developed a suitable Kannada tagset. A corpus size of approximately fifty four thousand

words was used for training and testing the accuracy of the tagger generators. From the

experiment it is found that accuracy increased with increasing number of words in the

corpus and the result obtained was more efficient and accurate. The proposed part of

speech tagger was used to develop a syntactic parser model for Kannada language. The

183

POS tagger can also be used to develop bilingual MT from Kannada to any other

Dravidian languages.

5.6 PUBLICATIONS

Antony P J and Soman K P: “Kernel Based Part of Speech Tagger for Kannada”,

International Conference on Machine Learning and Cybernetics 2010 (ICMLC 2010 -

China), Paper is archived in the IEEE Xplore and IEEE CS Digital Library.

Documents

PART OF SPEECHTAGGER FOR KANNADA - …shodhganga.inflibnet.ac.in/bitstream/10603/8875/11/11_chapter 5.pdf · 157 CHAPTER5 PART OF SPEECHTAGGER FOR KANNADA Parts of speech tagging