Upload
trinhtu
View
223
Download
3
Embed Size (px)
Citation preview
157
CHAPTER5
PART OF SPEECHTAGGER FOR KANNADA
Parts of speech tagging is a well-understood problem in NLP. The importance of the
problem focuses from the fact that the Parts of Speech tagging is one of the first stages in
the process performed by various natural language related process. POS tagging is the
process of assigning the part of speech tag or other lexical class marker to each and every
word in a sentence. POS tagging has a crucial role in different fields of NLP including
MT. In linguistics, parts-of-speech tagging, also termed grammatical tagging or word-
category disambiguation, is the process of marking up the words in a text or corpus as
corresponding to a particular part of speech, based on both its definition, as well as its
context. That is, relationship with adjacent and related words in a phrase, sentence, or a
paragraph. In other words, it can also be defined as the process of labelling automatic
annotation of syntactic categories for each word in a corpus. It is similar to the process of
tokenization for computer languages.A part-of-speech is a grammatical category,
commonly including verbs, nouns, adjectives, adverbs, determiner, and so on.
For English, there are many POS taggers, employing machine learning techniques
such as HMMs (Brants, 2000), transformation based error-driven learning (Brill, 1995),
decision trees (Black, 1992), maximum entropy methods (Ratnaparkhi, 1996), conditional
random _fields (Laffertey et al., 2001), Support vector machines (Kudoh et al., 2001) etc.
POS taggers are broadly classified into two categories called rule based and stochastic
based[155]. In case of rule based approach hand-written rules are used to distinguish the
tag ambiguity. Stochastic taggers are either HMM based, choosing the tag sequence which
maximizes the product of word likelihood and tag sequence probability, or cue-based,
using decision trees or maximum entropy models to combine probabilistic features. The
performance of the POS tagging model hugely depends on the corpus data with which it is
trained.The relative failure of rule-based approaches, the increasing availability of
machine readable text and the increase in capability of hardware with decrease in cost, are
some of the reasons, researchers to prefer corpus based POS tagging.
158
Some languages have a richer morphology than others, requiring the tagger to have
into account a bigger set of feature patterns. Also the tagset size and ambiguity rate may
vary from language to language. Besides, if few data are available for training, the
proportion of unknown words may be huge. Sometimes, morphological analyzers could be
utilized to reduce the degree of ambiguity when facing unknown words. Thus, a POS
tagger should be flexible with respect to the amount of information utilized and context
shape.
In Dravidian language like Kannada, ambiguity is the key issue that must be addressed
and solved while designing a POS tagger. For different context words behave differently
and hence the challenge is to correctly identify the POS tag of a token appearing in a
particular context. The input to a tagging algorithm is a string of words of a natural
language sentence and a specified tag set (a finite list of Part-of-speech tags). The output is
a single best POS tag for each word.
This chapter describes the development of a part-of-speech tagger for Kannada
language that can be used for analyzing and annotating Kannada texts. The SVM
supervised machine learning classifier algorithm was used in the proposed system to
alleviate POS tagger problem in Kannada language. Prior to this POS tagger system
development, a linguistic study was conducted to determine the internal linguistic
structure of the Kannada sentence. Based on this study, a suitable tagset for Kannada
language was developed based on AMRITA tagset. A corpus of texts, extracted from
Kannada news papers and books, were manually, morphologically analyzed and tagged
based on the developed tagset. A corpus size of approximately fifty thousand words was
used for training and testing the accuracy of the POS tagger generators.
5.1 COMPLEXITY IN KANNADA POS TAGGING
In Dravidian languages, particularly for Kannada language, nouns and verbs get
inflected. Nouns get inflected for number and cases. Verbs get inflected for tense and
number and also are adjectivalized and adverbialized. Also verbs and adjectives are
nominalized by means of certain nominalizers. Adjectives and adverbs do not inflect.
Many post-positions in Kannada are from nominal and verbal sources. So, many times we
need to depend on syntactic function or context to decide upon whether the particular
159
word is a noun or adjective or adverb or post position. This leads to the complexity in
Kannada POS tagging.A noun may be categorized as common, proper or compound.
Similarly, verb may be finite, infinite, gerund or contingent. Contingent form of verb is
not found in other Dravidian languages except Kannada. Other parts of speech were also
divided into their own subcategories.
For example, Kannada the word „batti‟ (ಫತ್ತತ) in the following sentences isassociated
with different parts of speech.
Sentence1: ಸಿೋತ್ದೋದಫತ್ತತಫದಲ್ಲಸಿದಳನ. (Seete deepada batti badalisidaLu)
Sentence2: ಬವಿಮನೋಯನಫತ್ತತಹ ೋಯಿತನ. (baaviya neeru batti hooyitu)
In the first sentence, the word „batti‟ (ಫತ್ತತ)is a noun whereas in the second sentence it
is a verb. This is not rare in natural languages, and a large percentage of word-forms are
ambiguous. Also, the parts of speech are not just the noun, pronoun, verb and adverb.
There are clearly many more categories and sub-categories.
5.2 PROPOSED TAGSET FOR KANNADA
Every natural language is different from other. So there is a need for separate tagset
for every language. For POS level we want to determine the word‟s POS category or tag
and that can be done using a tag set with limited number of tags. Moreover large number
of tags will lead to increased complexity which in turn reduces the tagging accuracy. A
Kannada POS Tagset is developed considering the ambiguity and other peculiarities of
Kannada language. The proposed Kannada POS Tagset resembles the Amrita Tamil tagset
[156].
There are several POS tagsets for Indian languages created by number of research
groups. Baskaran proposed a common POS tagset framework for Indian languages where
he had identified partial 12 tagsets as universal categories for any tagset. TDIL tagset is
the common unified XML based POS Schema for Indian Languages based on W3C
Internationalization. The schema has been developed to take into account the NLP
requirements for Web based services in Indian Languages. This standard specifies XML
POS Schema for tagging. Various tagsets are already in existence for other South
Dravidian languages like Tamil namely AUKBC-Tagset, Tagset from Vasuranganathan,
160
CIIL Tagset etc. Vijayalaxmi .F. Patil of LDC-IL proposed a Kannada tagset which
consists of 39 tags. However, I have encountered the following problems with these
tagsets:
1. For each word, the grammatical categories as well as grammatical features are
considered. Hence it needs to be split for each and every inflected word in the corpus,
which makes the tagging process very complex.
2. The number of tags is very large. This leads to increased complexity during POS
tagging which in turn reduces the tagging accuracy.
3. For simple POS level, a tagset which has just the grammatical categories excluding
grammatical features in nature is more than sufficient. Also, we needed a tagset with
minimum tags without compromising on tagging efficiency.
The proposed tagset consists of 30 tags, where inflections were not considered.The
compound tagsare used only for nouns (NNC) and proper nouns (NNPC). There are 5 tags
for nouns, 1 tag for pronoun, 8 tags for verbs, 3 for punctuations, two for number, and 1
for each adjective, adverb, conjunction, echo, reduplication, intensifier, postposition,
emphasize, determiners, complimentizer and question word. The tags in the proposed
tagset are described in the Table 5.1 with example each.
Table 5.1: POS Tagset for Kannada
S.N TAG DESCRIPTION EXAMPLE [ ENGLISH ]
1 <NN> NOUN ಹನಡನಗ (huDuga) [ boy ]
2 <NNC>
COMPOUND NOUN ಎತ್ತತನಫೆಂಡಿ (ettina banDi)
3 <NNP> PROPER NOUN ಕರ್ಾಟಕ ( Karnataka)
4 <NNPC>
COMPOUND
PROPER NOUN
ಅಫನ್ಲ್್ಱೆಂ (Abdul kalam)
5 <CRD> CARDINALS ಒೆಂದನ (ondu) [ one ]
6 <ORD> ORDINALS ಒೆಂದನೆ (ondane ) [ first ]
161
7 <PRP> PRONOUN ಅವನನ (avanu) [ he ]
8 <ADJ> ADJECTIVE ಸನೆಂದಯವದ (sundaravAda)
[ beautiful ]
9 <ADV> ADVERB ಳೋಗದಲ್ಲಿ (vEgadalli)[ speedily ]
10 <VNAJ> VERB NONFINITE
ADJECTIVE
ಫೆಂದಹನಡನಗ (bandahuDuga)
[ the boy who came ]
11 <VNAV> VERB NONFINITE
ADVERB
ಫೆಂದನಹ ೋದನನ(banduhOdanu)
[ came and went back ]
12 <VBG> VERBAL GERUND ಫಯನವ (baruva )[ coming ]
13 <VBC> VERB
CONTINGENT
ಫಯನಳೋನನ(baruvEnu)[might come ]
14 <VF> VERB FINITE ಫರದೆನನ (baredenu)[ wrote ]
15 <VAX> AUXILIARY VERB ನೆ ೋಡನತ್ತತದೆ್ೋನೆ (nODuttiddEne)
[ was + ing ]
16 <VINT> VERB INFINITE ನೆ ೋಡಲ್ನ (nODalu)[ to see ]
17 <CNJ> CONJUNCTION ಭತನತ (mattu)[ and ]
18 <CVB> CONDITIONAL
VERB
ನೆ ೋಡಿದರ (nODidare)[ if seen ]
19 <QW> QUESTION WORDS ಏಕೆ (Eke)[ why ]
20 <COM> COMPLIMENTIZER ಎೆಂಫ (enba)[ s , es ]
21 <NNQ> QUANTITY NOUN ಸವಲ್ಪ (swalpa)[ little ]
22 <PPO> POST POSITIONS ತನಕ (tanaka)[ till ]
23 <DET> DETERMINERS ಆ (A)
24 <INT> INTENSIFIER ತನೆಂಬ (tunbA)[ very ]
25 <ECH> ECHO WORDS ಅಪ್ಪಪತಪ್ಪಪ(appi tappi)[by mistake ]
26 <EMP> EMPHASIS ಮತರ (matra)[ only ]
27 <COMM> COMMA ,
162
28 <DOT> DOT .
29 <QM> QUESTION MARK ?
30 <RDW> REDUPLICATION
WORDS
ಟಟ (paTa paTa )
[ continuously ]
5.2.1 Description of Tags in the POS Tag set
NN (Noun)
The tag NN is used for common nouns (general nouns) without differentiating them
based on the grammatical information.
NNC (Compound Noun)
Nouns that are compound are tagged using the tag NNC.
NNP (Proper Nouns)
The tag NNP tags the proper nouns.
NNPC (Compound Proper Nouns)
Compound proper nouns are tagged using the tag NNPC.
ORD (Ordinal)
Expressions denoting ordinals will be tagged as ORD.
CRD (Cardinal)
Cardinal tag CRD tag tags the cardinals (numbers) in the language.
PRP (Pronoun)
All pronouns are tagged using the tag PRP.
ADJ (Adjective)
All adjectives in the language will be tagged as ADJ.
163
ADV (Adverb)
ADV tag tags the adverbs in the language. This tag is used only for manner adverbs.
VNAJ (Verb Non-finite Adjective)
VNAJ tag tags the Verb Non-finite Adjective in the language.
VNAV (Verb Non-finite Adverb)
VNAV tag tags the Verb Non-finite Adverb in the language.
VBG (Verbal Gerund)
All Verbal Gerund in the language will be tagged as VBG.
VF (Verb Finite)
VF tag is used to tag the finite verbs in the language.
VAX (Auxiliary Verb)
VAX tags the auxiliary verbs in the language.
VINT (Verb Infinite)
VINT tags the Verb Infinite in the language. This is generally preceded by auxiliary
verbs or finite verbs.
CVB (Conditional Verb)
CVB tags the Conditional Verb in the language.
VBC (Verbal Contingent)
The VBC is a special characteristic feature of Kannada language which is not
presents other south Dravidian languages like Tamil, Telugu and Malayalam.
CNJ (Conjuncts, both coordinating and subordinating)
The tag CNJ can be used for tagging coordinating and subordinating conjuncts.
QW (Question Words)
164
The question words in the language like will be tagged as QW.
COM (Complimentizer)
COM tags the Complimentizer in the language.
NNQ (Quantity Noun)
NNQ tags the Quantity Noun in the language.
PPO (Postposition)
All the Indian languages have the phenomenon of postpositions. Postpositions are
tagged using the tag PPO.
DET (Determiners)
The tag DET tags the determiners in the language.
INT (Intensifier)
Intensifier is used for intensifying adjectives or adverbs in a language.
EMP (Emphasis)
The tag EMP tags the Emphasis words in the language.
COMM (Comma)
The tag COMM tags the comma in a sentence.
DOT (Dot)
The tag DOT tags the dots (period) in a sentence.
QM (Question Mark)
The question marks in the language are tagged using the tag QM.
5.2.2Alignment and Comparison of Proposed Kannada Tagset to TDIL Tagset
TDIL tagset is the common unified XML based POS Schema for Indian Languages
based on W3C Internationalization. The schema has been developed to take into account
165
the NLP requirements for Web based services in Indian Languages. This standard
specifies XML POS Schema for tagging. The following table 5.2 shows the alignment and
comparison of proposed Amrita Kannada tagset to TDIL tagset.
Table 5.2:Alignment and comparison of Amrita Kannada tagset to TDIL tagset
Sl.
No
TDIL Tagset for Dravidian
Languages
Sl.
No
Amrita Tagset for Kannada
Language Category
Lab
el
Category
Label Top
Level
Subtype
(level1)
Subtype
(level 2) Top
level
Subtype
(level 1)
Subtype
(level 2)
1 Noun N 1 Noun N
1.1 Common NN 1.1 Common NN
1.1.1 Compou
nd
NNC
1.2 Proper NNP 1.2 Proper NNP
1.2.1 Compou
nd NNPC
1.3 Nloc NST Merged with Post Position (PPO)
2 Pronoun PR 2 Pronoun PRP 2.1 Personal PRP
2.2 Reflexive PRF
2.3 Relative PRL
2.4 Reciprocal PRC
2.5 Wh-word PRQ
3 Demonstr
ative
DM
3.1 Deictic DMD 3 Determiner DET 3.2 Relative DMR 4 Question
Words QW
3.3 Wh-word DMQ
4 Verb V 5 Verb V 4.1 Main V 5.1 Main V 4.1.1 Finite VF 5.1.1 Finite VF 4.1.2 Non-
Finite
VNF 5.1.2 Verbal
Gerund VBG
4.1.3 Infinitive VINF 5.1.3 Infinitive VINF 4.2 Verbal VN 5.2 Verbal VN 5.2.1 Verb
Nonfinite
Adjectiv
e
VNAJ
5.2.2 Verb Nonfinite
Adverb
VNAV
4.3 Auxiliary VAU
X
5.3 Abxiliary VAX
5.4 Contingent
VBC
5 Adjective JJ 6 Adjective ADJ
6 Adverb RB 7 Adverb ADV
7 Postpositi PSP 8 Postpositio PPO
166
on n
8 Conjuncti
on
CC 9 Conjunctio
n
CNJ
8.1 Co-
ordinator CCD
8.2 Subordinat
or CCS
8.2.1 Quotativ
e
UT 10 Complimen
t-izer COM
9 Particles RP 11 Emphasis EMP
9.1 Default RP
9.2 Classifier C
9.3 Interjectio
n INJ
9.4 Intensifier INTF 12 Intensifi
er INT
9.5 Negation
10 Quantifie
rs
QT 13 Quantifiers Q
10.1 General QTF 13.1 Quantifier
Noun NNQ
10.2 Cardinals QTC 13.2 Cardinals CRD
10.3 Ordinals QTC 13.3 Ordinals ORD
11 Residuals RD 14 Residuals
11.1 Foreign
word
RDF
11.2 Symbol SYM 14.1 Questio
n Mark QM
11.3 Punctuatio
n PUN
C 14.2 Comma COM
M
14.3 Dot DOT
11.4 Unknown UNK
11.5 Echowords ECH 15 Echow
ords
ECH
16 Redupl
ication
Words
RDW
5.3 SUPPORT VECTOR MACHINE BASED TAGGERS
Generally, tagging is required to be as accurate as possible, and as efficient as
possible. But, certainly, there is a trade-off between these two desirable properties. This is
so because obtaining a higher accuracy relies on processing more and more information,
digging deeper and deeper into it. However, sometimes, depending on the kind of
application, a loss in efficiency may be acceptable in order to obtain more precise results.
Or the other way around, a slight loss in accuracy may be tolerated in favor of tagging
speed.
167
Moreover, some languages have a richer morphology than others, requiring the tagger
to have into account a bigger set of feature patterns [157]. Also the tag set size and
ambiguity rate may vary from language to language and from problem to problem.
Besides, if few data are available for training, the proportion of unknown words may be
huge. Sometimes, morphological analyzers could be utilized to reduce the degree of
ambiguity when facing unknown words. Thus, a sequential tagger should be flexible with
respect to the amount of information utilized and context shape. Another very interesting
property required for taggers is their portability.
The SVMTool is intended to comply with all the requirements of modern NLP
technology, by combining simplicity, flexibility, robustness, portability and efficiency
with state–of–the–art accuracy. This is achieved by working in the SVM learning
framework, and by offering NLP researchers a highly customizable sequential tagger
generator.The SVM-based tagger is robust and flexible for feature modelling, trains
efficiently with very less parameters to tune. Also the SVM-based tagger is able to tag
thousands of words per second, which makes it really practical for real NLP applications.
5.3.1 Problem Setting
Binarizing the classification problem and feature codification are the two important
steps in the problem setting.
5.3.1.1 Binarizing the Classification Problem
Tagging a word in context is a multi-class classification problem [154]. Since SVMs
are binary classifiers, a binarization of the problem must be performed before applying
them. The proposed POS tagger model has applied a simple one-per-class binarization,
i.e., a SVM is trained for every POS tag in order to distinguish between examples of this
class and all the rest. When tagging a word, the most confident tag according to the
predictions of all binary SVMs is selected.
However, not all training examples have been considered for all classes. Instead, a
dictionary is extracted from the training corpus with all possible tags for each word, and
when considering the occurrence of a training word „w‟ tagged as „t i‟, this example is used
as a positive example for class ti and a negative example for all other „tj‟ classes appearing
168
as possible tags for „w‟ in the dictionary. This will avoid the generation of excessive and
irrelevant negative examples, and make the training step faster.
5.3.1.2 Feature Codifications
Each example (event) has been represented using the local context of the word for
which the system will determine a tag (output decision). This local context and local
information like capitalization and affixes of the current token will help the system make a
decision even if the token has not been encountered during training. The proposed model
considered a centered window of seven tokens, in which some basic and n–gram patterns
were evaluated to form binary features such as: “previous word is the”, “two preceding
tags are DET NN”, etc. Each of the individual tags of an ambiguity class is also taken as a
binary feature of the form “following word may be a NN”. Therefore, with ambiguity
classes and “maybe‟s”, the proposed model avoid the two pass solution, in which an initial
first pass tagging is performed in order to have right contexts disambiguated for the
second pass. Also explicit n–gram features are not necessary in the SVM approach,
because polynomial kernels account for the combination of features.
Additional features have been used to deal with the problem of unknown words.
Features appearing a number of times under a certain count cut-off might be ignored for
the sake of robustness. Table 5.3 shows a rich feature set used in the proposed experiment.
Table 5.3: Feature Pattern Set
word features w−3,w−2,w−1,w0,w+1,w+2,w+3
PoS features p−3, p−2, p−1, p0, p+1, p+2, p+3
ambiguity classes a0, a1, a2, a3
may be‟s m0,m1,m2,m3
word bigrams (w−2,w−1), (w−1,w+1), (w−1,w0), (w0,w+1), (w+1,w+2)
PoS bigrams (p−2, p−1), (p−1, a+1), (a+1, a+2)
word trigrams (w−2,w−1,w0), (w−2,w−1,w+1), (w−1,w0,w+1),
169
(w−1,w+1,w+2), (w0,w+1,w+2)
PoS trigrams (p−2, p−1, a+0), (p−2, p−1, a+1), (p−1, a0, a+1), (p−1, a+1,
a+2)
sentence info Punctuation („.‟, „?‟, „!‟)
prefixes s 1, s 1s 2, s 1s 2 s3, s 1s 2 s 3s 4
suffixes sn, sn-1sn, s n-2s n-1s n, s n-3s n-2s n-1s n
binary word
features
initial Upper Case, all Upper Case, no initial Capital Letter(s),
all Lower Case, contains a (period / number / hyphen ...)
word length integer
5.4 PROPOSED POS TAGGING ALGORITHM AND ARCHITECTURE
A simple algorithm which shows the basic structure of the proposed architecture for
POS tagging is defined below.
Step1: Take input text.
Step2: Tokenize the input text (Pre-editing).
Step3: Manual Tagging.
Step4: Train the corpus.
Step5: Tagging using SVM.
Step5.1: Search for the tokens in lexicon.
Step5.2: If it found, give the appropriate
TAG from lexicon.
Step5.3: If not found, TAG it with SVM Probabilities.
Step 6: Get the tagged output text.
170
Step 7: Insert those new words in lexicon.
The Fig. 5.1 shows the proposed architecture for POS tagging. The architecture
consist different modules based on their functionalities. The SVMlight software package
consists of three main components, namely the model learner (SVMTLearn), the tagger
(SVMTagger) and the evaluator (SVMEval). Prior to tagging, SVM models (weight
vectors and biases) are learned from a training corpus using the SVMTlearn component.
Different models were learned for different strategies. Then, at tagging time, using the
SVMTagger component, one may choose the tagging strategy that is most suitable for the
purpose of the tagging. Finally, given a correctly annotated corpus, and the corresponding
SVMTagger predicted annotation, the SVMeval component is used to evaluate the
performance of the tagger. The functionalities of each of this module are explained briefly
as follows:
Fig. 5.1: Architecture for POS tagging
5.4.1 Tokenize
Untagged sentences are downloaded from Kannada newspapers and commercial
websites. Input text is then converted into a column format suitable to the SVM tool
[154].
5.4.2 Manual Tagging
The tokenizing module produces a corpus of untagged tokens. After which, the corpus
was tagged manually using proposed AMRITA Kannada tagset. Training data must be in
171
column format, i.e. a token per line corpus in a sentence by sentence fashion. The column
separator is the blank space. The token is expected to be the first column of the line. The
tag to predict takes the second column in the output. The rest of the line may contain
additional information. Initially around 10,000 words are tagged manually. A sample of
the training data is as follows:
ಕೆ ಡಗಿಗೆ<NNP>
ಆಗಮಿಸಿಯನವ<VBG>
ಬಿಜೆಪ್ಪ<NNP>
ಹಿರಿಮ<ADJ>
ರ್ಮಕ<NN>
ಱಲ್<NNPC>
ಕೃಷಣ<NNPC>
ಅಡ್ವಣಿ<NNPC>
ಅವಯನ<PRP>
ಶನಕರವಯ<NNP>
ಕೆ ಡವ<NNPC>
ಉಡನಗೆ<NNPC>
ತ್ ಟನು<VAX>
ಸೆಂಬರಮಿಸಿದರ<VAX>
, <COM>
ಅವಯ<PRP>
ತ್ತು<NN>
ಕಭಱ<NNP>
172
ಅವಯ <PRP>
ಸ್ಥ್<ADV>
ನೋಡಿದಯನ<VF>
. <DOT>
5.4.3 Corpus Training
The tagged corpus is trained using SVMTlearn classifiers, component of SVM- light
tool [154]. SVM– light is an implementation of Vapnik‟s SVMs in C, developed by
Thorsten Joachim‟s. The training with SVM produces a dictionary with merged model and
its feature set (Lexicon). This lexicon contains the tagged Kannada words with its group of
tags probability parameters.SVMTlearn behavior is easily adjusted through a
configuration file. The usage of SVMTlearn module is shown below.
Usage: SVMTlearn [options] <config-file>
Options:
- V verbose 0: none verbose
1: low verbose [default]
2: medium verbose
3: high verbose
Example: SVMTlearn -V 2 config.svmt
These are the currently available config-file options:
Sliding window: The size of the sliding window for feature extraction can be adjusted.
Also, the core position in which the word to disambiguate is to be located may be selected.
By default, window size is 5 and the core position is 2, starting at 0. The proposed POS
tagger, features were extracted by considering the default window size 5.
Feature set: Three different kinds of feature types can be collected from the sliding
window:
173
Word features: Word form n-grams. Usually unigrams, bigrams and trigrams
suffixes. Also, the last word of the sentence corresponds to a punctuation mark („.‟,
„?‟, „!‟), is important.
POS features: Annotated parts–of–speech and ambiguity classes n-grams, and
“may be‟s”. As for words, considering unigrams, bigrams and trigrams is enough.
The ambiguity class for a certain word determines which POS are possible. A
“may be” states, for a certain word, that certain POS may be possible, i.e. it
belongs to the word ambiguity class.
Lexicalized features: including prefixes and suffixes, capitalization, hyphenization,
and similar information related to a word form. But in Tamil capitalization will not
used all the other features are accepted.
Default feature sets for every model are defined.
Feature filtering: The feature space can be kept in a convenient size. Smaller models
allow for a higher efficiency. By default, no more than 100,000 dimensions are used. Also,
features appearing less than n times can be discarded, which indeed causes the system both
to fight against overfitting and to exhibit a higher accuracy. By default, features appearing
just once are ignored.
SVM model compression: Weight vector components lower than a given threshold, in
the resulting SVM models can be filtered out, thus enhancing efficiency by decreasing the
model size but still preserving accuracy level. That is an interesting behavior of SVM
models being currently under study. In fact, discarding up to 70% of the weight
components accuracy remains stable, and it is not until 95% of the components are
discarded that accuracy falls below the current state–of–the–art (97.0% - 97.2%).
C parameter tuning: In order to deal with noise and outliers in training data, the soft
margin version of the SVM learning algorithm allows the misclassification of certain
training examples when maximizing the margin. This balance can be automatically
adjusted by optimizing the value of the C parameter of SVMs. A local maximum is found
exploring accuracy on a validation set for different C values at shorter intervals.
174
Dictionary repairing: The lexicon extracted from the training corpus can be
automatically repaired either based on frequency heuristics or on a list of corrections
supplied by the user. This makes the tagger robust to corpus errors. Also a heuristic
threshold may be specified in order to consider as tagging errors those (wordx tagy) pairs
occurring less than a certain proportion of times with respect to the number of occurrences
of wordx. For example, a threshold of 0.001 would be considered (run DT) as an error if
the word run had been seen at least 1000 times and only once tagged as a „DT‟. This kind
of heuristic dictionary repairing does not harm the tagger performance; on the contrary, it
may help a lot.
Repairing list must comply with the SVMTool dictionary format,
i.e. <word><N occurrences><N possible tags> 1{<tag (i)><N occurrences (i)}
Ambiguous classes: The list of POS presenting ambiguity is, by default, automatically
extracted from the corpus but, if available, this knowledge can be made explicit. This acts
in favor of the system robustness.
Open classes: The list of POS tags for an unknown word may be labelled as is also, by
default, automatically determined.
Backup lexicon: A morphological lexicon containing words that are not present in the
training corpus may be provided. It can be also provided at tagging time. This file must
comply with the SVMTool dictionary format.
Models
Five different kinds of models have been implemented in this. Models 0, 1, and 2
differ only in the features they consider. Model 3 and Model 4 are just like Model 0 with
respect to feature extraction but examples are selected in a different manner. Model 3 is
for unsupervised learning so, given an unlabelled corpus and a dictionary, at learning time
it can only count on knowing the ambiguity class, and the POS information only for
unambiguous words. Model 4 achieves robustness by simulating unknown words in the
learning context at training time.
Model 0: This is the default model. The unseen context remains ambiguous. It was
thought having in mind the one-pass on-line tagging scheme, i.e. the tagger goes either
175
left-to-right or right-to-left making decisions. So, past decisions feed future ones in the
form of POS features. At tagging time only the parts-of-speech of already disambiguated
tokens are considered. For the unseen context, ambiguity classes are considered instead.
Model 1: This model considers the unseen context already disambiguated in a previous
step. So it is thought for working at a second pass, revisiting and correcting already tagged
text.
Model 2: This model does not consider POS features at all for the unseen context. It is
designed to work at a first pass, requiring Model 1 to review the tagging results at a
second pass.
Model 3: The training is based on the role of unambiguous words. Linear classifiers are
trained with examples of unambiguous words extracted from an annotated corpus. So,
fewer POS information is available. The only additional information required is a morpho-
syntactic dictionary.
Model 4: The errors caused by unknown words at tagging time punish severely the
system. So to reduce this problem, during learning, some words are artificially marked as
unknown in order to learn a more realistic model. The process is very simple. The corpus
is divided in a number of folders. Before starting to extract samples from each of the
folders, a dictionary is generated out from the rest of folders. So, the words appearing in a
folder but not in the rest are unknown words to the learner.
Table 5.4: Tag Counts
S.N TAG Counts S.N TAG Counts
1 <NN> 18660 16 <VBC> 4
2 <NNC> 3692 17 <CNJ> 1308
3 <NNP> 2992 18 <CVB> 340
4 <NNPC> 2584 19 <QW> 920
5 <ORD> 58 20 <COM> 994
176
6 <CRD> 1064 21 <NNQ> 2364
7 <PRP> 1732 22 <PPO> 310
8 <ADJ> 1192 23 <DET> 898
9 <ADV> 412 24 <INT> 474
10 <VNAJ> 10 25 <ECH> 64
11 <VNAV> 32 26 <EMP> 292
12 <VBG> 1260 27 <COMM> 1278
13 <VF> 3758 28 <DOT> 4273
14 <VAX> 3410 29 <QM> 262
15 <VINT> 54 30 <RDW> 106
The proposed POS tagger was trained with a tagged Kannada corpus size of 54,000
words. Table 5.4 shows each and every tag in the corpus.
5.4.4 Tagging using SVM Tagger
Given a text corpus in one token per line and the path to a previously learned SVM
model including the automatically generated dictionary,SVMTagger performs the POS
tagging of a sequence of words. The tagging goes online-based on a sliding window,
which gives a view of the feature context to be considered at every decision. A word that
currently not available in the lexicon might be tagged by the probability parameters, using
the probability with the bias occurrence words.
The SVMTagger component works on standard input/output. It processes a token per
line corpus in a sentence by sentence fashion. The token is expected to be the first column
of the line. The predicted tag will take the second column in the output. The rest of the line
remains unchanged.
Example of input to the SVMTagger:
177
ಕೆ ಡಗಿಗೆ
ಆಗಮಿಸಿಯನವ
ಬಿಜೆಪ್ಪ
ಹಿರಿಮ
ರ್ಮಕ
ಱಲ್
ಕೃಷಣ
ಅಡ್ವಣಿ
ಅವಯನ
ಶನಕರವಯ
ಕೆ ಡವ
ಉಡನಗೆ
ತ್ ಟನು
ಸೆಂಬರಮಿಸಿದರ
,
ಅವಯ
ತ್ತು
ಕಭಱ
ಅವಯ
ಸ್ಥ್
ನೋಡಿದಯನ
.
SVMTagger Expected Output:
ಕೆ ಡಗಿಗೆ<NNP>
ಆಗಮಿಸಿಯನವ<VBG>
ಬಿಜೆಪ್ಪ<NNP>
178
ಹಿರಿಮ<ADJ>
ರ್ಮಕ<NN>
ಱಲ್<NNPC>
ಕೃಷಣ<NNPC>
ಅಡ್ವಣಿ<NNPC>
ಅವಯನ<PRP>
ಶನಕರವಯ<NNP>
ಕೆ ಡವ<NNPC>
ಉಡನಗೆ<NNPC>
ತ್ ಟನು<VAX>
ಸೆಂಬರಮಿಸಿದರ<VAX>
, <COM>
ಅವಯ<PRP>
ತ್ತು<NN>
ಕಭಱ<NNP>
ಅವಯ <PRP>
ಸ್ಥ್<ADV>
ನೋಡಿದಯನ<VF>
. <DOT>
Usage : SVMTagger [options] <model>
Options:
- T <strategy>
179
0: one-pass (default - requires model 0)
1: two-passes [revisiting results and relabelling requires model 2 and model 1]
2: one-pass [robust against unknown words requires model 0 and model 2]
3: one-pass [unsupervised learning models requires model 3]
4: one-pass [very robust against unknown words requires model 4]
5: one-pass [sentence-level likelihood requires model 0]
6: one-pass [robust sentence-level likelihood requires model 4]
- S <direction>
LR: left-to-right (default)
RL: right-to-left
LRL: both left-to-right and right-to-left
GLRL: both left-to-right and right-to-left
(global assignment, only applicable under a sentence level tagging strategy)
- K <n> weight filtering threshold for known words (default is 0)
- U <n> weight filtering threshold for unknown words (default is 0)
- Z <n> number of beams in beam search, only applicable under sentence-level strategies
(default is disabled)
- R <n> dynamic beam search ratio, only applicable under sentence-level strategies
(default is disabled)
- F <n> softmax function to transform SVM scores into probabilities (default is 1)
0: do_nothing
1: ln(e^score(i) / [sum:1<=j<=N:[e^score(j)]])
- A predictions for all possible parts-of-speech are returned
- B <backup_lexicon>
- L <lemmae_lexicon>
180
- EOS enable usage of end_of_sentence „<s>‟ string (disabled by default, [!.?] used
instead)
- V <verbose> -> 0: none verbose
1: low verbose
2: medium verbose
3: high verbose
4: very high verbose
Model: model location (path/name)
(Name as declared in the config-file NAME)
Example: SVMTagger -T 0 KANNADA <in.txt > out.txt
Strategies
Seven different tagging strategies have been implemented so far:
Strategy 0: It is the default one. It makes use of Model 0 in a greedy on-line fashion, one-
pass.
Strategy 1: As a first attempt to achieve robustness in front of error propagation, it works
in two passes, in an on-line greedy way. It uses Model 2 in the first pass and Model 1 in
the second. In other words, in the first pass the unseen morpho-syntactic context remains
ambiguous while in the second pass the tag predicted in the first pass is available also for
unseen tokens and used as a feature.
Strategy 2: This strategy tries to achieve robustness by using two models at tagging time,
namely Model 0 and Model 2. When all the words in the unseen context are known it uses
Model 0. Otherwise it makes use of Model 2.
Strategy 3: It uses Model 3, again in a greedy and on-line manner. This unsupervised
learning strategy is still under experimentation.
Strategy 4: It simply uses Model 4 as is, in an on-line greedy fashion.
181
Strategy 5: Still working on a more robust scheme, this strategy performs a sentence-level
tagging by means of dynamic programming techniques (Viterbi algorithm). It uses Model
0.
Strategy 6: Same as strategy 5, this strategy performs a sentence level tagging, this time
applying Model 4.
5.4.5Evaluation and Result
SVMTeval is used to evaluate the performance of the proposed POS tagger system.
Given a SVMTagger predicted tagging output and the corresponding gold-standard,
SVMTeval evaluates the performance in terms of accuracy. Based on the morphological
dictionary that was created during training time by SVMTlearn, results may be presented
for different sets of words such as known words vs. unknown words, ambiguous words vs.
unambiguous words. A different view of these same results are also seen from the class of
ambiguity perspective i.e. words sharing the same kind of ambiguity may be considered
together. Words sharing the same degree of disambiguation complexity, determined by the
size of their ambiguity classes, can also be grouped.
Usage: SVMTeval [mode] <model><gold><pred>
- mode: 0 - complete report (everything)
1 - overall accuracy only [default]
2 - accuracy of known vs unknown words
3 - accuracy per level of ambiguity
4 - accuracy per kind of ambiguity
5 - accuracy per class
- model: model name
- gold: correct tagging file
-pred: predicted tagging file
Tagging accuracyis considered as the evaluation criteria for performance evaluation.
The evaluation criterion used for evaluation of the proposed models is based mostly on
182
exact tagging. The tagging is considered correct only if it exactly matches the one in the
gold-standard. POS tagging accuracy is calculated based on the following formula.
TA = Number of correct tagged words / Number of test words.
The SVM consist of limited lexicon in the beginning. The system shows low accuracy
when POS tagging is performed with this limited lexicon. The size of the training data is
increased with more pre-edited data in step by steptagging with created SVM model.
Again manual corrections for incorrect tags are done. The above process is repeated with
more dataset to increase the size of lexicon. As the lexicon was increased to 54,000 words,
the performance of proposed system also increased to 86%. Lexicon verses accuracy is
shown in the Table 5.5. The accuracy increased with increasing the number of words in
the lexicon.
Table 5.5: POS Tagger Result
No. of Words in Lexicon POS tagger Accuracy
10,000 48%
25,000 66%
54,000 86%
5.5 SUMMARY
The development of POS tagger for a language with limited electronic resources can
be very demanding. The proposed work presented a Part-Of-Speech tagger for Kannada
language modelled using SVM kernel. Prior to this development, a linguistic studywas
conducted to determine the internal linguistic structure of the Kannada sentence and
developed a suitable Kannada tagset. A corpus size of approximately fifty four thousand
words was used for training and testing the accuracy of the tagger generators. From the
experiment it is found that accuracy increased with increasing number of words in the
corpus and the result obtained was more efficient and accurate. The proposed part of
speech tagger was used to develop a syntactic parser model for Kannada language. The
183
POS tagger can also be used to develop bilingual MT from Kannada to any other
Dravidian languages.
5.6 PUBLICATIONS
Antony P J and Soman K P: “Kernel Based Part of Speech Tagger for Kannada”,
International Conference on Machine Learning and Cybernetics 2010 (ICMLC 2010 -
China), Paper is archived in the IEEE Xplore and IEEE CS Digital Library.