CHAPTER 3 THEORETICAL BACKGROUND - …shodhganga.inflibnet.ac.in/bitstream/10603/9233/12/12...CHAPTER 3 THEORETICAL BACKGROUND 3.1 GENERAL Natural Language Processing (NLP) research

47

CHAPTER 3

THEORETICAL BACKGROUND

3.1 GENERAL

Natural Language Processing (NLP) research has a long tradition in European

countries. It has taken giant leaps in the last decade with the initiation of efficient

machine learning algorithms and the creation of large annotated corpora for various

languages. In countries like India where more than thousands of language are in usage,

so, the importance of the NLP is very relevant. However, NLP research in Indian

languages has mainly focused on the development of rule based techniques due to the

lack of annotated corpora. The pre-requisites for developing NLP applications in Tamil

language are the availability of speech corpora, annotated text corpora, parallel corpora,

lexical resources and computational models. The sparseness of these resources for

Tamil language is one of the major reasons for the slow growth of NLP work in Tamil.

Like other language processing, Tamil language also involves morphological analysis,

syntax analysis and semantic analysis.

3.1.1 Tamil Language

Tamil belongs to the southern branch of the Dravidian languages, a family of around

twenty-six languages native to the Indian subcontinent. It flourished in India as a

language with rich literature during the Sangam period (300 BCE to 300 CE). Tamil

scholars categorize the history of the language into three periods, Old Tamil (300 BC -

700 CE), Middle Tamil (700 - 1600) and Modern Tamil (1600–present). In Old Tamil,

Epigraphic attestation of Tamil begins with rock inscriptions from the 3rd century BC,

written in Tamil-Brahmi, an adapted form of the Brahmi script. The earliest extant

literary text is the ெதால்காப்பியம் (tholkAppiyam), a work on grammar and poetics

which describes the language of the classical period. The Sangam literature contains

about 50,000 lines of poetry contained in 2381 poems attributed to 473 poets including

many women poets [9].

During Modern Tamil i.e., in the early 20th century, the chaste Tamil

Movement called for the removal of all Sanskrit and other foreign elements from

48

Tamil. It received support from Dravidian parties and nationalists who supported Tamil

independence. This led to the replacement of a significant number of Sanskrit loan

words by Tamil equivalents. An important factor specific to Tamil is the existence of

two main varieties of the language, colloquial and formal Tamil ெசந்தமிழ்

(sewthamiz), which are sufficiently divergent that the language is classed as diglossic.

Colloquial Tamil is used for most spoken communication, and formal Tamil is spoken

in a restricted number of high contexts, such as lectures and news bulletins, and also

used in writing. They differ in terms of their lexis, morphology, and segmental

phonology.

Tamil is the official language of the Indian state of Tamilnadu and one of the 22

languages under schedule 8 of the constitution of India. It is also one of the official

languages of the Union Territories of Puducherry, Andaman & Nicobar Islands, Sri

Lanka, Malaysia and Singapore. Tamil became the first legally recognized classical

language of India in the year 2004 [9].

3.1.2 Tamil Grammar

Traditional Tamil grammar consists of five parts, namely எ த் (ezuththu), ெசால்

(sol), ெபா ள் (poruL), யாப் (yAppu) and அணி(aNi). Of these, the last two are

applicable mostly in poetry. The following Table 3.1 gives additional information about

these parts. The tholkAppiyam (ெதால்காப்பியம்) is the oldest work on the grammar of

the Tamil language [132].

Table 3.1 Tamil Grammar

Divisions Meaning Main grammar books

எ த் (Ezuththu ) Letter ெதால்காப்பியம்(tholkAppiyam),நன் ல் (wannUl)

ெசால் (sol) Word ெதால்காப்பியம்(tholkAppiyam),நன் ல் (wannUl)

ெபா ள் (poruL) Meaning ெதால்காப்பியம் (tholkAppiyam)

யாப் (yAppu) Form யாப்ெப ம்கலாக்காாிைக(yApperungkalAkkArikai )

அணி (aNi) Method தனியலங்காரம் (thaniyalangkAram)

49

3.1.3 Tamil Characters

Tamil is written using a script called the vattEzuththu. The Tamil script has twelve

vowels uyirezuththu (உயிெர த் ) "soul-letters", eighteen consonants meyyezuththu

(ெமய்ெய த் ) "body-letters" and one character, the Aythaezuththu (ஆய்த எ த் )

“the hermaphrodite letter”, which is classified in Tamil grammar as being neither a

consonant nor a vowel though often considered as part of the vowel set. The script,

however, is syllabic and not alphabetic.

The complete script, therefore, consists of the thirty-one letters in their independent

form, and an additional 216 compound letters representing a total 247 combinations.

These compound letters are formed by adding a vowel marker to the consonant. The

details of Tamil vowels are given in Table 3.2. Some vowels require the basic shape of

the consonant to be altered in a way that is specific to that vowel. Others are written by

adding a vowel-specific suffix to the consonant, yet others a prefix, and finally some

vowels require adding both a prefix and a suffix to the consonant. The following Table

3.3 lists vowel letters across the top and consonant letters along the side, the

combination of which gives all Tamil compound ( uyirmei) letters.

Table 3.2 Tamil Vowels

In every case, the vowel marker is different from the standalone character for the

vowel. The Tamil script is written from left to right. Vowels are also called the 'life'

(uyir) or 'soul' letters. Tamil vowels are divided into short and long kuril and nedil -

five of each type) and two diphthongs. Tamil compound (uyirmei) letters are formed by

adding a vowel marker to the consonant. There are 216 compound letters in Tamil. The

Tamil transliteration is given in the Appendix A.

Short vowel Long vowel Diphthong

அ ஆ ஐ இ ஈ

உ ஊ ஔ எ ஏ

ஒ ஒ

50

Table 3.3 Tamil Compound Letters

Tamil compound Characters table

Vow→ ↓Cons

அ a

ஆ (A)

இ (i)

ஈ (I)

உ (u)

ஊ (U)

எ (e)

ஏ (E)

ஐ (ai)

ஒ (o)

ஓ (O)

ஔ (au)

க் (k) க கா கி கீ கு கூ ெக ேக ைக ெகா ேகா ெகௗ

ங் (ng) ங ஙா ஙி ஙீ ஙு ஙூ ெங ேங ைங ெஙா ேஙா ெஙௗ

ச் (s) ச சா சி சீ சு சூ ெச ேச ைச ெசா ேசா ெசௗ

ஞ் (nj) ஞ ஞா ஞி ஞீ ெஞ ேஞ ைஞ ெஞா ேஞா ெஞௗ

ட் (d) ட டா டீ ெட ேட ைட ெடா ேடா ெடௗ

ண் (N) ண ணா ணி ணீ ெண ேண ைண ெணா ேணா ெணௗ

த் (th) த தா தி தீ ெத ேத ைத ெதா ேதா ெதௗ

ந் (w) ந நா நி நீ ெந ேந ைந ெநா ேநா ெநௗ

ப் (p) ப பா பி பீ ெப ேப ைப ெபா ேபா ெபௗ

ம் (m) ம மா மி மீ ெம ேம ைம ெமா ேமா ெமௗ

ய் (y) ய யா யி யீ ெய ேய ைய ெயா ேயா ெயௗ

ர் (r) ர ரா ாி ாீ ெர ேர ைர ெரா ேரா ெரௗ

ல் (l) ல லா லீ ெல ேல ைல ெலா ேலா ெலௗ

வ் (v) வ வா வி ெவ ேவ ைவ ெவா ேவா ெவௗ

ழ் (z) ழ ழா ழி ழீ ெழ ேழ ைழ ெழா ேழா ெழௗ

ள் (L) ள ளா ளி ளீ ெள ேள ைள ெளா ேளா ெளௗ

ற் (R) ற றா றி றீ ெற ேற ைற ெறா ேறா ெறௗ

ன் (n) ன னா னி னீ ென ேன ைன ெனா ேனா ெனௗ

3.1.4 Morphological Richness of Tamil Language

Tamil is an agglutinative language. Tamil words consist of a lexical root to which one

or more affixes are attached. Mostly, Tamil affixes are suffixes. Tamil suffixes can be

derivational suffixes, which either changes the Part-of-Speech of the word or its

meaning, or inflectional suffixes, which mark categories such as person, number, mood,

tense, etc. There is no absolute limit on the length and extent of agglutination, which

can lead to long words with a large number of suffixes, which would require several

words or a sentence in English.

51

Tamil is a morphologically rich language in which most of the morphemes

coordinate with the root words in the form of suffixes. Suffixes are used to perform the

functions of cases, plural marker, euphonic increment and postpositions in noun class.

Tamil verbs are inflected for tense, person, number, gender, mood and voice. Other

features of Tamil language are, using plural for honorific noun, frequent echo words,

and null subject feature i.e. not all sentences have subject. Computationally, each root

word can take more than ten thousand inflected word-forms, out of which only a few

hundred will exist in a typical corpus [129]. Tamil is consistently head-final language.

The verb comes at the end of the clause with a typical word order of Subject-Object-

Verb (SOV). However, Tamil language allows word order to be changed, making it a

relatively word order free language. In Tamil, subject-verb agreement is required for

the grammaticality of a Tamil sentence.

3.1.5 Challenges in Tamil NLP

There are many issues that make a Tamil language processing task to difficult. These

relate to the problems of representation and interpretation. Language computing

requires precise representation of context. The natural languages are highly ambiguous

and vague, so achieving such representations are very hard. The various sources of

ambiguities in Tamil language are described below.

3.1.5.1 Ambiguity in Morphemes

Tamil morphemes are ambiguous in the grammatical category and the position it takes

in a word construction.

Ambiguity in morpheme’s grammatical category

A morpheme can have more than one grammatical category. For example, the

morpheme athu, ana, thu can occur as Nominalizing suffix or 3rd Person neuter

suffix.

Ambiguity in morpheme’s position

The suffixation of the morpheme’s position also leads to ambiguity. The Table 3.4

gives a few examples for the morphemes and its possible grammatical features.

52

Table 3.4 Ambiguity in Morpheme’s Position

3.1.5.2 Ambiguity in Word Class

A word may be ambiguous in its Part of Speech or the word class. A word may have

more than one interpretation. For example, the word ப “padi” can take noun class or

verb class. The word ambiguity has to be disambiguating while referring to its context.

padi- study (V) or step (N)

கீேழ ப உள்ள கவனமாக ெசல்ல ம் . step (N)

தின ம் பாடங்கைள ப என ஆசிாிைய மாணவர்களிடம் கூறினார். study (V)

3.1.5.3 Ambiguity in Word Sense

Even though a word belongs to a specific grammatical category, it may be ambiguous

in the sense. For instance, the Tamil word கா “ kAddu” has 11 senses in noun class

and 18 senses in verb class [kiriyAvin tharkAla Tamil akarAthi, 2006] [133]. For

example the following sentence has two different meanings.

அவன் பாடல் ேகட்டான் .

(He heard the song)

Morpheme Possible Grammatical Features

அ (a) Infinitive Relative Participle

கல் (kal) Root Nominal Suffix

ஆக (Aka) Benefactive Adverbial Suffix

த் (th) Sandhi Tense

ெசய் (sey) Root Auxiliary Root

53

(He ask the song )

3.1.5.4 Ambiguity in Sentence

A sentence may be ambiguous even if the words are not ambiguous. For example, the

following sentence has two interpretations.

“நான் ஒ அழகான ெபண்ைண ம் ஆைண ம் பார்த்ேதன்”

(I saw the beautiful women and men)

(I saw the beautiful women and beautiful men).

The words are not ambiguous but the sentences are ambiguous.

3.2 MORPHOLOGY

Morphology is the field within linguistics that studies the internal structure of words.

While words are generally accepted as being the smallest units of syntax, it is clear that

in most (if not all) languages, words can be related to other words by rules.

Morphology is the branch of linguistics that studies patterns of word-formation within

and across languages, and attempts to formulate rules that model the knowledge of the

speakers of those languages.

3.2.1 Types of Morphology

Morphology is traditionally classified into three main divisions: inflection, derivation,

and compounding. Inflectional morphology deals with the formation of different forms

in the paradigm of a lexeme. In inflectional morphology, words undergo a change in

their form to express some grammatical functions but their syntactic category remains

unchanged. Many inflectional features appear on words to express agreement purposes

(agreement in person, number, and gender) as well as to express case, aspect, mood,

and tense.

The derivational morphology is concerned with “the creation of a new lexeme

via affixation”. In English, the process of word formation through derivation involves

two types of affixation: prefixation, which means placing a morpheme before a word,

e.g. un-happy; and suffixation, which means placing a morpheme after a word, e.g.

54

happi-ness. Derivation poses a problem to translation in that “not all derived words

have straight-forward compositional translation as derived words. In English, for

example, the same meaning can be expressed by different affixes. Moreover, the same

affix can have more than one meaning. This can be exemplified by the suffix -er. This

suffix can be used to express the agent as in player and singer. But this is not the only

meaning it can convey as it can describe instruments as in mixer and cooker. In this

way the affix can have a range of equivalents in the target language and the attempt to

have one-to-one correspondences for affixes will be greatly misguided.

Compounding morphology is the process of forming a new word through

combining two or more words. Compounding is a process of word formation that

involves combining complete word forms into a single compound form; dog catcher is

therefore a compound, because both dog and catcher are complete word forms in their

own right before the compounding process has been applied, and are subsequently

treated as one form. An important notion in compounding is the notion of head. A

compound noun is divided into head and modifier or modifiers. For instance, the

compound noun watchtower in which watch and tower can be represented as a head

and modifier.

3.2.2 Lexemes

A lexical database is organized around lexemes, which include all the morphemes of a

language. A lexeme is conventionally listed in a dictionary as a separate entry.

Generally lexeme corresponds to a set of forms taken by a single word. For example, in

the English language, run, runs, ran and running are forms of the same lexeme “run”.

3.2.3 Lemma and Stems

A lemma in morphology is the canonical form of a lexeme. In lexicography, this unit is

usually the citation form or headword by which it is indexed. Lemmas have special

significance in highly inflected languages such as Tamil. The process of determining

the lemma for a given word is called lemmatization.

A stem is the part of the word that never changes even when morphologically

inflected, whilst a lemma is the base form of the verb. For example, for the word

"produced", the lemma is "produce", but the stem is “produc-”. This is because there

55

are words such as production. In linguistic analysis, the stem is defined more generally

as the analyzed base form from which all inflected forms can be formed. When

phonology is taken into account, the definition of the unchangeable part of the word is

not useful, as can be seen in the phonological forms of the words in the preceding

example: "produced" vs. "production".

3.2.4 Inflections and Word Forms

Given the notion of a lexeme, it is possible to distinguish two kinds of morphological

rules. Some morphological rules relate different forms of the same lexeme; while other

rules relate two different lexemes. Rules of the first kind are called inflectional rules,

while those of the second kind are called word-formation. The English plural, as

illustrated by dog and dogs, is an inflectional rule; compounds like dog-catcher or

dishwasher provide an example of a word-formation rule. Informally, word-formation

rules form "new words" (that is, new lexemes), while inflection rules yield variant

forms of the "same" word (lexeme).

Derivation involves affixing bound (non-independent) forms to existing lexemes,

whereby the addition of the affix derives a new lexeme. One example of derivation is

clear in this case: the word independent is derived from the word dependent by

prefixing it with the derivational prefix in-, while dependent itself is derived from the

verb depend.

3.2.5 Morphemes and Types

Morpheme is the minimal meaningful unit in a word. The concept of word and

morpheme are different, a morpheme may or may not stand alone. One or several

morphemes compose a word.

• Free morphemes, like town and dog, can appear with other lexemes (as in town

hall or dog house) or they can stand alone, i.e. "free".

• Bound morphemes like "un-" appear only together with other morphemes to

form a lexeme. Bound morphemes in general tend to be prefixes and suffixes.

56

• Derivational morphemes can be added to a word to create (derive) another

word: the addition of "-ness" to "happy," for example, to give "happiness." They

carry semantic information.

• Inflectional morphemes modify a word's tense, number, aspect, and so on,

without deriving a new word or a word in a new grammatical category (as in the

"dog" morpheme if written with the plural marker morpheme "-s" becomes

"dogs"). They carry grammatical information.

Agglutinative languages have words containing several morphemes that are always

clearly differentiable from one another in that each morpheme represents only one

grammatical meaning and the boundaries between those morphemes are easily

demarcated. The bound morphemes are affixes, and they may be individually

identified. Agglutinative languages tend to have a high number of morphemes per

word, and their morphology is highly regular [134].

3.2.6 Allomorphs

One of the largest sources of complexity in morphology is one-to-one correspondence

between meaning and form which is scarcely applies to every case in the language.

English have word form pairs like ship/ships, ox/oxen, goose/geese, and sheep/sheep,

where the difference between the singular and the plural is signaled in a way that

departs from the regular pattern, or is not signaled at all. Even cases considered

"regular", with the final -s, are not so simple; the -s in dogs is not pronounced the same

way as the -s in cats, and in a plural like dishes; an "extra" vowel appears before the -s.

These cases, where the same distinction is affected by alternative forms of a "word",

are called allomorph.

3.2.7 Morpho-Phonemics

Morpho-phonology or Morpho-phonemics studies the phonemic changes when a

morpheme is inflected with another. This phenomenon is called ‘sandhi’ in Tamil.

Sandhi occurs very frequently in Tamil and should be taken care when building

morphological analyzers or generators. For instance, the noun root ‘pU’ (flower),

when pluralized, becomes ‘pUkkaL’ instead of the ‘pUkaL’. When the root is

57

monosyllabic ending with a long and the following morpheme starts with a vallinam

consonant, the consonant geminates. Sandhi changes can occur between two

morphemes or words. Although sandhi rules are mostly dependent on phonemic

properties of the morphemes, they sometimes depend on the grammatical relations of

the words on which they operate. Sometimes gemination may be invalid when the

words are in subject-predicate relation, but valid if they are in modifier-modified

relation. Sandhi changes can occur in four different ways: Gemination, Insertion,

Deletion and Modification. Gemination is a case of insertion where the vallinam

consonants double themselves. In general, the insertion happens when new characters

are inserted between words or morphemes. Deletion happens when existing characters

at the end of the first word or the start of the second word are dropped. Modification

happens when characters get replaced by some other characters with close phonological

properties.

3.2.8 Morphotactics

The morphemes of a word cannot occur in random order. In every language, there are

well-defined ways to sequence the morphemes. The morphemes can be divided into a

number of classes and the morpheme sequences are normally defined in terms of the

sequence of classes. For instance, in Tamil, the case morphemes follow the number

morpheme in noun constructions. For example, க்கைள ( _கள்_ஐ). The other way

around is invalid. For example, ஐக்கள் ( _ஐ_ கள்). The order in which

morphemes follow each other is strictly governed by a set of rules called morphotactics.

In Tamil, these rule play a very important role in word construction and derivation as

the language is agglutinative and words are formed by a long sequence of morphemes.

Rules of morphotactics also serve to disambiguate the morphemes that occur in more

than one class of morphemes. The analyzer uses these rules to identify the structure of

words.

58

3.3 MACHINE LEARNING FOR NLP

3.3.1 Machine Learning

Machine learning deals with techniques that allow computers to automatically learn and

make accurate predictions based on past observations. The major focus of machine

learning is to extract information from data automatically, by using computational and

statistical methods. Machine learning techniques are being used for solving various

tasks of Natural Language processing. This includes speech recognition, document

categorization, document segmentation, part-of-speech tagging, and word-sense

disambiguation, named entity recognition, parsing, machine translation and

transliteration.

There are two main tasks involved in machine learning; learning/training and

prediction. The system is given with a set of examples called training data. The primary

goal is to automatically acquire effective and accurate model from the training data.

The training data provides the domain knowledge i.e., characteristics of the domain

from which the examples are drawn. This is a typical task for inductive learning and is

usually called concept learning or learning from examples. The larger the amount of

training data, usually the better the model will be. The second phase of machine

learning is the prediction, wherein a set of inputs is mapped into the corresponding

target values. The main challenge of machine learning is to create a model, with good

prediction performance on the test data i.e., model with good generalization on

unknown data.

Machine learning algorithms are categorized based on the desired outcome of

the algorithm. Types of machine learning algorithms include Supervised learning,

Unsupervised learning, Semi-supervised learning, Reinforcement learning and

Transduction [135]. In supervised learning the target function is completely specified

by the training data. There is a label associated with each example. If the label is

discrete, then the task is called classification. Otherwise, for real valued labels, the task

becomes a regression problem. Based on the examples in the training data, the label for

new case is predicted. Hence, learning is not only a question of remembering but also

of generalization to unseen cases. Any change in the learning system can be seen as

59

acquiring some kind of knowledge. So, depending on what the system learns, the

learning is categorized as

• Model Learning: The system predicts values of unknown function. This is called

as prediction and is a task well known in statistics. If the function is discrete, the

task is called classification. For continuous-valued functions it is called regression.

• Concept learning: The systems acquire descriptions of concepts or classes of

objects.

• Explanation-based learning: Using traces (explanations) of correct (or incorrect)

performances, the system learns rules for more efficient performance of unseen

tasks.

• Case-based (exemplar-based) learning: The system memorizes cases (exemplars)

of correctly classified data or correct performances and learns how to use them (e.g.

by making analogies) to process unseen data.

3.3.2 Support Vector Machines

Support Vector Machine (SVM) represents a new approach to supervised pattern

classification which has been successfully applied to a wide range of pattern

recognition problems. SVM as supervised machine learning technology is attractive

because it has an extremely well developed learning theory, statistical learning theory.

SVM is based on strong mathematical foundations and results in simple yet very

powerful algorithms. A simple way to build a binary classifier is to construct a

hyperplane separating class members from non-members in the input space.

Unfortunately, most real world problems involve non-separable data for which there

does not exist a hyperplane that successfully separates the class members from non-

class members in the training set. One solution to the inseparability is to map the data

into a higher dimensional space and define a separating hyperplane in that space. This

higher dimensional space is called the feature space, as opposed to the input space

occupied by training examples. With an appropriately chosen feature space of sufficient

dimensionality, any consistent training set can be made separable.

However, translating the training set into a higher dimensional space incurs both

computational and learning-theoretic costs. Representing the feature vectors

60

corresponding to the training set can be extremely expensive in terms of memory and

time. Furthermore, artificially separating the data in this way exposes the learning

system to the risk of finding trivial solutions that overfit the data.

Support Vector Machines elegantly sidestep both difficulties [136]. Support

vector machines avoid overfitting by choosing a specific hyperplane among the many

that can separate the data in the feature space. SVMs find the maximum margin

hyperplane, the hyperplane that maximises the minimum distance from the hyperplane

to the closest training point. The maximum margin hyperplane can be represented as a

linear combination of training points. Consequently, the decision function for

classifying points with respect to the hyperplane only involves dot products between

points. Furthermore, the algorithm that finds a separating hyperplane in the feature

space can be stated entirely in terms of vectors in the input space and dot products in

the feature space. Thus, a support vector machine can locate a separating hyperplane in

the feature space and classify points in that space without ever representing the space

explicitly, simply by defining a function, called a kernel function that plays the role of

the dot product in the feature space. This technique avoids the computational burden of

explicitly representing the feature vectors.

Another appealing feature of SVM classification is the sparseness of its

representation of the decision boundary. The location of the separating hyperplane in

the feature space is specified via real-valued weights on the training set examples.

Those training examples that lie far away from the hyperplane do not participate in its

specification and therefore receive weights of zero. Only the training examples that lie

close to the decision boundary between the two classes receive nonzero weights. These

training examples are called the support vectors, since removing them would change

the location of the separating hyperplane. It is believed that all the information about

classification in the training samples can be represented by these Support vectors. In a

typical case, the number of support vectors is quite small compared to the total number

of training samples.

The maximum margin allows the SVM to select among multiple candidate

hyperplanes. However, for many data sets, the SVM may not be able to find any

separating hyperplane at all, either because the kernel function is inappropriate for the

training data or because the data contains mislabeled examples. The latter problem can

61

be addressed by using a soft margin that accepts some misclassifications of the training

examples. A soft margin can be obtained in two different ways. The first is to add a

constant factor to the kernel function output whenever the given input vectors are

identical. The second is to define a priori an upper bound on the size of the training set

weights. In either case, the magnitude of the constant factor is to be added to the kernel

or to tie the size of the weights which controls the number of training points that the

system misclassifies. The setting of this parameter depends on the specific data at hand.

Completely specifying a support vector machine therefore requires specifying two

parameters: the kernel function and the magnitude of the penalty for violating the soft

margin.

Thus, a support vector machine finds a nonlinear decision function in the input

space by mapping the data into a higher dimensional feature and separating it there by

means of a maximum margin hyperplane. The computational complexity of the

classification operation does not depend on the dimensionality of the feature space,

which can even be infinite. Overfitting is avoided by controlling the margin. The

separating hyperplane is represented sparsely as a linear combination of points. The

system automatically identifies a subset of informative points and uses them to

represent the solution. Finally, the training algorithm solves a simple convex

optimization problem. All these features make SVMs an attractive classification

system.

3.3.3 Geometrical Interpretation of SVM

Typically, the machine is presented with a set of training examples, (xi,yi) where the xi

are the real world data instances and the yi are the labels indicating which class the

instance belongs to. For the two class pattern recognition problem, yi = +1 or yi = -1. A

training example (xi,yi) is called positive if yi = +1 and negative otherwise. SVMs

construct a hyperplane that separates two classes (this can be extended to multi-class

problems). While doing so, the SVM algorithm tries to achieve maximum separation

between the classes.

Separating the classes with a large margin minimizes a bound on the expected

generalization error [137]. A ‘minimum generalization error’, means that when new

examples (data points with unknown class values) arrive for classification, the chance

62

of making an error in the prediction (of the class which it belongs) based on the learned

classifier (hyperplane) should be minimum. Intuitively, such a classifier is one which

achieves maximum separation-margin between the classes. Figure 3.1 illustrates the

concept of ‘maximum margin’. The two planes parallel to the classifier and which

pass through one or more points in the data set are called ‘bounding planes’. The

distance between these bounding planes is called the ‘margin’ and SVM ‘learning’,

means, finding a hyperplane which maximizes this margin. The points (in the dataset)

falling on the bounding planes are called ‘support vectors’ . These points play a crucial

role in the theory and hence the name support vector machines. ‘Machine’, means

algorithm. Vapnik (1998) has shown that if the training vectors are separated without

errors by an optimal hyperplane, the expected error rate on a test sample is bounded by

the ratio of the expectation of the support vectors to the number of training vectors.

Since this ratio is independent of the dimension of the problem, if one can find a small

set of support vectors, good generalization is guaranteed [136].

Figure 3.1 Maximum Margin and Support Vectors

Maximum Margin

Support Vectors

63

In the case, wherein the data points are shown in Figure 3.2, one may simply minimize

the number of misclassifications whilst maximizing the margin with respect to the

correctly classified examples. In such a case it is said that the SVM training algorithm

allows a training error. There may be another situation wherein the points are clustered

such that the two classes are not linearly separable as shown in Figure 3.3, that is, if one

tries for a linear classifier, it may have to tolerate a large training error. In such cases,

one prefers non-linear mapping of data into some higher dimensional space called

‘feature space’, F, where it is linearly separable. In order to distinguish between these

two spaces, the original space of data points is called ‘input space’. The hyperplane in

‘feature space’ corresponds to a highly non-linear separating surface in the original

input space. Hence the classifier is called a non-linear classifier

Figure 3.2 Training Errors in Support Vector Machine

64

Figure 3.3 Non-linear Classifier

The process of mapping the data into higher dimensional space involves heavy

computation especially when the data which itself may be of high dimensional.

However, there is no need to do any explicit mapping to higher dimensional space for

finding the hyper plane classifier, all computations will be done in the input space itself

[138].

3.3.4 SVM Formulation

Notation used

m = number of data points in the training set

n = number of features (variables) in the data

1

2

i

ii

in

xx

x

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

x

, n dimensional vector, which represent a data point in “input space”.

=Taget value of the ith data, it takes +1 or -1 valuei iid D=

65

1

2

m

dd

d

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

d

, vector representing target value of m data points

1

2

0 00 0

.0 0 m

dd

diag( )=

d

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

…

…

D = d

1

2

n

ww

w

⎡ ⎤⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎣ ⎦

w

, weight vector orthogonal to the hyper plane

1 1 2 2 0n nw x w x w x γ+ + − =… . γ is a scalar which is generally known as bias term

1 1 1 1

2 2 2 2

, .. . . .

T T T T

1 2

T T T T

1 2

T T T T

m m 1 m 2 m m

m

m

. .

. .. . . . .. .

⎡ ⎤ ⎡ ⎤⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥= =⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎢ ⎥ ⎢ ⎥⎣ ⎦ ⎣ ⎦

T

x x x x x x xx x x x x x x

A AA

x x x x x . . x x

TAA is called linear kernel of the dataset

(.)φ → x A nonlinear mapping function that maps input vector x into a high

dimensional feature vector

( ( ( ( ( (( ( ( ( ( (

. . .( ( ( ( ( (

1 1 1 1

1 1

1 m 2

T T T2 m

T T T2 2 2 m

T T Tm m m

) ) ) ) . . ) )) ) ) ) . . ) ). . . . .. .

) ) ) ) ) )

φ φ φ φ φ φφ φ φ φ φ φ

φ φ φ φ φ φ

⎡ ⎤⎢ ⎥⎢ ⎥⎢ ⎥=⎢ ⎥⎢ ⎥⎢ ⎥⎣ ⎦

x x x x x xx x x x x x

K

x x x x . . x x

K is called the non-linear Kernel of input dataset.

66

Q = an mxm matrix whose (i,j)th element is ( ) ( )Ti j i jd d φ φx x

( ) K * T=Q . d * d, where *. represent element wise multiplication

From the geometric point of view, the support vector machine constructs an optimal

hyperplane given by w T x - γ = 0 between two classes of examples. The free

parameters are a vector of weights w which is orthogonal to the hyperplane and a

threshold value γ. The aim is to find maximally separating bounding planes

w T x - γ = 1

w T x - γ = -1

such that data points with d = -1 satisfy the constraints

w T x - γ ≤ -1

and data points with d = +1 satisfy

w T x - γ ≥ 1.

The perpendicular distance of the bounding plane w T x - γ = 1 from the origin is

|- γ + 1|/||w||

and the perpendicular distance of the bounding plane w T x - γ = -1 from the origin is, |- γ - 1|/||w|| .

The margin between the optimal hyperplane and the bounding plane is 1/||w||, and

so the distance between the bounding hyperplanes is 2/||w||.

Then the learning problem is formulated as an optimization problem as given below.

The ‘training of SVM’ consists of finding w and γ, given the matrix of data points

A and the corresponding class vector d. Once and γw are obtained then the decision

boundary is 0− γ =Tw x . The decision function is given by ( )f x = sign( )− γTw x . That

is, for a new points the sign of − γTw x is assigned as the class value. The problem is

easily solved in terms of its Lagrangian dual variables.

.,,1,1)( subject to21 Minimize 2

liD iT

ii …=≥−

=

γxw

w

67

3.4 VARIOUS APPROACHES FOR POS TAGGING

There are different approaches for POS tagging. The Figure 3.4 demonstrates different

POS tagging models. Most tagging algorithms fall into one of the two classes which are

rule-based taggers or stochastic taggers.

3.4.1 Supervised POS Tagging

The supervised POS tagging models require pre-tagged corpora which are used for

training to learn rule sets, information about the tagset, word-tag frequencies etc. The

learning tool generates trained models along with the statistical information. The

performance of the models generally increases with increase in the size of pre-tagged

corpus.

Figure 3.4 Classification of POS Tagging Models

POS Tagging

UnsupervisedSupervised

Rule Based Stochastic Neural Rule Based Stochastic Neural

Brill Brill

N‐gram based

Maximum Likelihood

Hidden Markov Model

Baum‐Welch Algorithm

Viterbi Algorithm

68

3.4.2 Unsupervised POS Tagging

Unlike the supervised models, the unsupervised POS tagging models do not require a

pre-tagged corpus. Instead, they use advanced computational methods like the Baum-

Welch algorithm to automatically induce tagsets, transformation rules etc. Based on the

information, they either calculate the probabilistic information needed by the stochastic

taggers or induce the contextual rules needed by rule-based systems or transformation

based systems.

3.4.3 Rule based POS Tagging

The rule based POS tagging models apply a set of hand written rules and use contextual

information to assign POS tags to words in a sentence. These rules are often known as

context frame rules. For example, a context frame rule might say something like:

“If an ambiguous/unknown word X is preceded by a Determiner and followed by a

Noun, tag it as an Adjective.”

On the other hand, the transformation based approaches use a pre-defined set of

handcrafted rules as well as automatically induced rules that are generated during

training. Some models also use information about capitalization and punctuation, the

usefulness of which are largely dependent on the language being tagged. The earliest

algorithms for automatically assigning Part-of-Speech were based on two-stage

architecture. The first stage used a dictionary to assign each word a list of potential

parts of speech. The second stage used large lists of hand-written disambiguation rules

to bring down this list to a single Part-of-Speech for each word [139].

The ENGTWOL [140] tagger is based on the same two-stage architecture,

although both the lexicon and the disambiguation rules are much more sophisticated

than the early algorithms. The ENGTWOL lexicon is based on the two-level

morphology. It has about 56,000 entries for English word stems, counting a word with

multiple parts of speech (e.g. nominal and verbal senses of hit) as separate entries, and

of course not counting inflected and many derived forms. Each entry is annotated with

a set of morphological and syntactic features. In the first stage of the tagger, each word

is run through the two-level lexicon transducer and the entries for all possible parts of

speech are returned.

69

3.4.4 Stochastic POS Tagging

A stochastic approach includes frequency, probability or statistics. The simplest

stochastic approach finds out the most frequently used tag for a specific word in the

annotated training data and uses this information to tag that word in the unannotated

text. The problem with this approach is that it can come up with sequences of tags for

sentences that are not acceptable according to the grammar rules of a language.

An alternative to the word frequency approach is known as the n-gram approach

that calculates the probability of a given sequence of tags. It determines the best tag for

a word by calculating the probability that it occurs with the n previous tags, where the

value of n is set to 1, 2 or 3 for practical purposes. These are known as the unigram,

bigram and trigram models. The most common algorithm for implementing an n-gram

approach for tagging a new text is known as the Viterbi Algorithm, which is a search

algorithm that avoids the polynomial expansion of a breadth first search by trimming

the search tree at each level using the best m Maximum Likelihood Estimates (MLE)

where m represents the number of tags of the following word.

Advantages of Statistical Approach,

• Very robust, can process any input strings

• Training is automatic, very fast

• Can be retrained for different corpora / tagsets without much effort

• Language independent

• Minimize the human effort and human error.

3.4.5 Other Techniques

Apart from these, a few different approaches for tagging have been developed.

Support Vector Machines: This is the powerful machine learning method used for

various applications in NLP and other areas like bio-informatics, data mining, etc.

Neural Networks: These are potential candidates for the classification task since

they learn abstractions from examples [141].

70

Decision Trees: These are classification devices based on hierarchical clusters of

questions. They have been used for natural language processing such as POS Tagging.

“Weka” can be used for classifying the ambiguous words [141].

Maximum Entropy Models: These avoid certain problems of statistical

interdependence and have proven successful for tasks such as parsing and POS tagging.

Example-Based Techniques: These techniques find the training instance that is

most similar to the current problem instance and assume the same class for the new

problem instance as for the similar one.

3.5 VARIOUS APPROACHES FOR MORPHOLOGICAL

ANALYZER

3.5.1 Two level Morphological Analysis

Koskenniemi (1985) [26] describes two-level morphology as a “general, language

independent framework which has been implemented for a host of different languages

(Finnish, English, Russian, Swedish, German, Swahili, Danish, Basque, Estonian,

etc.)”. It consists of two representations and one relation.

The surface representation of a word-form:

This is the actual spelling of the final valid word. For example English words

eating and swimming, are both surface representations.

The lexical (also called morphophonemic) representation of a word-form:

This shows a simple concatenation of base forms and tags. Consider the

following examples showing the lexical and surface form of English words.

Lexical Form Surface Form

talk + Verb talk

walk + Verb + 3PSg walks

eat +Verb + Prog eating

swim +Verb + Prog swimming

71

It may be noted that the lexical representation (or form) is often invariant or

constant. In contrast, affixes and bases of the surface form tend to have alternating

shapes. This can be seen in the above examples. The same tag “+Verb + Prog” is used

with both eat and swim, but swim is realized as swimm in the context of ing, while eat

shows no alternation in the context of ing. The rule component consists of rules which

map the two representations to each other. Each rule is described through a Finite-

State-Transducer (FST). Figure 3.5, schematically depicts two-level morphology.

Figure 3.5 Two Level Morphology

3.5.2 Unsupervised Morphological Analyzer

The definition of Unsupervised Learning of Morphology is given below.

“Input: Raw (un-annotated, non-selective) natural language text data.”

“Output: A description of the morphological structure

(there are various levels to be distinguished) of the language of the input text.”

Some approaches have explicit or implicit biases towards certain kinds of

languages; they are nevertheless considered to be Unsupervised Learning of

Morphology. Morphology may be narrowly taken as to include only derivational and

grammatical affixation, where the number of affixations a root may take is finite and

the order of affixation may not be permuted. A number of approaches focus on

concatenative morphology/ compounding only. All works considered are designed to

function on orthographic words, i.e., raw text data in orthography that segment on the

word-level.

72

3.5.3 Memory based Morphological Analysis

Memory based learning approach models morphological analysis (including

compounding) of complex word-forms as sequences of classification tasks. MBMA

(Memory-Based Morphological Analysis) is a memory-based learning system (Stanfill

and Waltz, 1986) [142]. Memory-based learning is a class of inductive, supervised

machine learning algorithm that learns by storing examples of a task in memory.

Computational effort is invested on a "call-by-need" basis for solving new examples

(henceforth called instances) of the same task. When new instances are presented to a

memory-based learner, it searches for the best matching instances in memory,

according to a task-dependent similarity metric. When it has found the best matches

(the nearest neighbors), it transfers their solution (classification, label) to the new

instance.

3.5.4 Stemmer based Approach

Stemmer uses a set of rules containing list of stems and replacement rules to stripping

of affixes. It is a program oriented approach where the developer has to specify all

possible affixes with replacement rules. Potter algorithm is one of the most widely used

stemmer algorithm and it is freely available. The advantage of stemmer algorithm is

that it is very suitable to highly agglutinative languages like Dravidian languages for

creating Morphological Analyzer and Generator.

3.5.5 Suffix Stripping based Approach

Highly agglutinative languages such as Dravidian languages, a Morphological Analyzer

and Generator can be successfully built using suffix stripping approach. The advantage

of the Dravidian language is that no prefixes and circumfixes exist for words. Words

are usually formed by adding suffixes to the root word serially. This property can be

well suited for suffix stripping based Morphological Analyzer and Generator. Once the

suffix is identified, the stem of the whole word can be obtained by removing that suffix

and applying proper orthographic (sandhi) rules. A set of dictionaries like stem

dictionary, suffix dictionary and also using morphotactics and sandhi rules, a suffix

stripping algorithm successfully implements MAG.

73

3.6 VARIOUS APPROACHES IN MACHINE TRANSLATION

From the period when the first idea of using machine for the process of language

translation, there have been many different approaches to machine translation that have

been proposed, implemented and put into use, during the course of time. The main

approaches to machine translation are:

• Linguistic or Rule Based Approaches

o Direct Approach

o Interlingua Approach

o Transfer Approach

• Non-Linguistic Approaches

o Dictionary Based Approach

o Corpus Based Approach

o Example Based Approach

o Statistical Approach

• Hybrid Approach

Direct, Interlingua and Transfer approaches are linguistic approaches which require

some sort of linguistic knowledge to perform translations, whereas dictionary based,

example based and statistical approach falls under non-linguistic approaches that don’t

require any linguistic knowledge to translate the sentences. Hybrid approach is a

combination of both linguistic and non-linguistic approaches.

3.6.1 Linguistic or Rule Based Approaches

Rule based approaches requires a lot of linguistic knowledge during the translation and

so it uses grammar rules and computer programs which will be helpful in analysing the

text for determining grammatical information and features for each and every word in

the source language, translating it by replacing each word by lexicon or word that have

the same context in the target language. Rule based approach is the principal

methodology that was developed in machine translation. Linguistic knowledge will be

required in order to write the rules for this type of approaches. These rules will play a

vital role during the different levels of translation. This approach is also called as

Theory based Machine Translation.

74

The benefit of rule based machine translation method is that it can intensely

examine the sentence at its syntax and semantic levels. There are complications in this

method such as prerequisite of vast linguistic knowledge and very huge number of rules

is needed in order to cover all the features in a language. An advantage of the approach

is that the developer has more control over the translations than is the case with corpus-

based approaches. The three different approaches that require linguistic knowledge are

as follows.

3.6.1.1 Direct Approach

Direct translation approach can be considered as the first approach to machine

translation. In this type of approach, the machine translation system is designed more

specifically for one particular pair of language. There is no need of identifying the

schematic roles and universal concepts in this approach. It involves the process of

analysing morphological information, identify the constituents and reorder the words in

the source language according to the word order pattern of the target language and then

replace the words in the source language by the target language words using a lexical

dictionary of that particular language pair and as a last step, inflect the words

appropriately to produce translations. This approach as it is seen, looks like a lot of

work has to be done in order to produce translations, but all those work which has to be

employed will be simple and can be accomplished very easily, in a short span of time.

Figure 3.5 illustrates the block diagram of the direct approach to machine translation.

This approach perform a simple and minimal syntactic and semantic analysis,

by which it differs from the other rule based translation systems such as interlingua and

the transfer-based approaches. As the direct approach to machine translation is

considered to be ad-hoc and found to be an approach that is unsuitable approach to

machine translation. Table 3.6 describes the example, how the sentence “he came late

to school yesterday” will be translated from English to Tamil using the direct approach.

Figure

Input

Aft

er

Mor

Con

Wor

Dict

Inflesent

e 3.6 Block

Table 3.5 A

Sentence in

rphological

nstituent Ide

rd Reorderi

tionary Loo

ect(the finaence)

Diagram of

An Exampl

n English

Analysis

entification

ing

kup

al translate

75

f Direct App

le to Illustra

He came

He come

<He><cschool>< <He><yPAST>mtd; nePAST

ed mtd; ne

proach to M

ate the Dire

e late to scho

e PAST late

come PAST><yesterday>yesterday><

ew;W gs;spf

ew;W gs;spf

Machine Tra

ct Approac

ool yesterday

to school ye

><late><to > <to school><

f;F neuk; fH

f;F neuk; fH

anslation

h

y

esterday

<late><come

Hpj;J th

Hpj;J te;jhd;

e

;.

76

3.6.1.2 Interlingua Approach

Interlingua approach to machine translation mainly aims at transforming the texts in the

source language to a common representation which is applicable to many languages.

Using this representation the translation of text to the target language is performed and

it should be possible to translate to every language from the same Interlingua

representation with the right rules.

Interlingua approach sees machine translation as a two stage process:

1. Analysing and transforming the source language texts into a common language

independent representation.

2. From the common language independent form generate the text in the target

language.

The first stage is particular to source language and doesn’t require any knowledge

about the target language whereas the second stage is particular to the target language

and doesn’t require any knowledge from the source language. The main advantage of

interlingua approach is that it creates an economical multilingual environment that

requires 2n translation systems to translate among n languages where in the other case,

the direct approach requires n(n-1) translation systems. Table 3.6 has the Interlingua

representation of the sentence, “he will reach the hospital in ambulance”.

Table 3.6 An Example for Interlingua Representation

Predicate Reach

Agent Boy (Number: Singular)

Theme Hospital (Number: Singular)

Instrument Ambulance (Number: Singular)

Tense FUTURE

The concepts and relations that are used are the most important aspect in any

interlingua-based system. The ontology should be powerful enough that all subtleties of

meaning that can be expressed using any language should be representable in the

Interlingua. Interlingua approach can be found more economical when translation is

car

inc

sho

3.6

Th

rep

Int

syn

or

gen

sen

sta

stru

exp

blo

rried out wit

creased, dram

own in the F

6.1.3 Transf

e less deter

presentations

erlingua app

ntactic or sem

semantic dep

The transf

neration. In

ntence struct

ge, transfor

ucture to tha

presses the t

ock diagram

th three or m

matically. T

igure 3.7.

fer Approac

rmined trans

s of the sour

proach. The

mantic infor

pending on t

fer model

the analysi

ture and the

mations are

at of the targ

tense, numb

of the transf

more languag

This is clearl

Figure 3.7

ch

sfer approac

rce and targe

e transfer ap

rmation of th

the need.

involves th

is stage, the

e constituent

e applied to

get language

ber, gender e

fer approach

77

ges but also

ly evident f

7 The Vauqu

ch has three

et language t

pproach can

he text. In ge

hree stages

e source lan

ts of the sen

the source

e. The gene

etc. in the ta

h.

the complex

from the Va

uois Triangl

e stages, co

texts, instead

n be done e

eneral, transf

which are

nguage sent

ntence are i

language p

ration stage

arget langua

xity of this a

auquois trian

le

omprising th

d of the two

either by co

fer can eithe

analysis,

tence is par

identified. In

parse tree to

translates th

age. Figure 3

approach ge

ngle which

he intellectu

o stages in th

onsidering th

er be syntact

transfer, an

rsed, and th

n the transfe

o convert th

he words an

3.8 shows th

ets

is

ual

he

he

tic

nd

he

fer

he

nd

he

thr

rep

sta

ord

fin

wo

pro

gen

sen

tran

sin

lan

Consider t

ee stages of

presentation

ge. The repr

der as result

al generatio

ords.

From the a

oduces a rep

nerates the f

ntence. Thus

nslate n lang

nce individua

nguages for e

Figure

the sentence

f the translati

after the an

resentation o

of the trans

on stage wh

above examp

presentation

final translat

s, using thi

guages, will

al transfer c

each directio

3.8 Block D

e, “he will c

ion of this se

nalysis stage

of the senten

fer stage of

ich replaces

ple, it will b

that is sour

tion from the

is approach

require ‘n’ a

components

on and ‘n’ ge

78

Diagram for

come to scho

entence usin

e of the tran

ce after reor

the transfer

s the source

be clear that

rce language

e target lang

in multilin

analyser com

are require

eneration com

r Transfer A

ool in bus”.

ng the transfe

nsfer approa

rdering it acc

approach is

e language w

t, the analys

e dependent

guage depend

ngual machin

mponents, n(

ed for transl

mponents.

Approach

Table 3.7

er approach.

ach is show

cording to th

s shown in T

words to tar

ser stage of

and the gen

dent represe

ne translatio

(n-1) transfe

lation betwe

illustrates th

The sentenc

wn in analys

he Tamil wor

Table 3.7. Th

rget languag

this approac

neration stag

entation of th

on system t

er componen

een a pair o

he

ce

sis

rd

he

ge

ch

ge

he

to

nts

of

79

Table 3.7 An Example for Transfer Approach

Input Sentence He will come to school in bus

Analysis <he><will come><to school><in bus>

Transfer <he><in bus><to school><will come>

Generation (Output) அவன் ேப ந்தில் பள்ளிக்கு வ வான்

3.6.2 Non-Linguistic Approaches

The non-linguistic approaches are those which don’t require any linguistic knowledge

explicitly to translate texts in the source language to target language. The only resource

required by this type of approaches is data either the dictionaries for the dictionary

based approach or bilingual and monolingual corpus for the empirical or corpus based

approaches.

3.6.2.1 Dictionary based Approach

The dictionary based approach to machine translation uses dictionary for the language

pair to translate the texts in the source language to target language. In this approach,

word level translations will be done. This dictionary based approach can either be

preceded by some pre-processing stages to analyse the morphological information and

lemmatize the word to be retrieved from the dictionary. This kind of approach can be

used to translate the phrases in a sentence and found to be least useful in translating a

full sentence. This approach will be very useful in accelerating the human translation,

by providing meaningful word translations and limiting the work of humans to

correcting the syntax and grammar of the sentence.

3.6.2.2 Empirical or Corpus based Approach

The corpus based approaches don’t require any explicit linguistic knowledge to

translate the sentence. But a bilingual corpus of the language pair and the monolingual

corpus of the target language are required to train the system to translate a sentence.

This approach has driven lots of interest in world-wide.

3.6

Th

bei

into

typ

Th

tran

is a

sen

retu

rep

alig

pre

the

ma

or

targ

usi

Fin

usi

exa

6.2.3 Examp

is approach

ings interpre

o sub proble

pe of similar

is approach

nslation has

The EB

a computer

ntence or a s

urned. In co

produce prev

gnment, and

evious exam

e input sente

atch to match

identify gen

get words th

ing existing

nally, these

ing either he

ample-based

ple based Ap

to machine

et and solve t

ems, solve ea

r problems in

h needs a h

to be perfor

BMT system

aided transl

imilar senten

ontrast, the

vious senten

d recombinat

mples and fin

nce. This m

hes using hig

neralized tem

hese matchin

bilingual dic

corresponde

euristic or st

d approach.

Figu

pproach

translation i

the problem

ach of the su

n the past an

huge bilingu

rmed.

m functions l

lation tool t

nce has been

EBMT sys

nce translati

tion [143]. 1

nds the piece

matching is do

gher linguist

mplates. 2) T

ng strings co

ctionaries or

ences are rec

tatistical info

ure 3.9 Bloc

80

s a techniqu

ms. That is, n

ub problems

nd integrate

ual corpus o

like a transla

that is able

n translated p

tem can tra

ons. EBMT

) In matchin

es of text th

one using va

tic knowledg

The alignmen

orrespond to

r automatica

combined an

ormation. Fi

ck Diagram

ue that is mai

ormally the

with the ide

them to sol

of the lang

ation memor

to reuse pre

previously, t

anslate nove

T translates i

ng, the system

at together g

arious heuris

ge to calcula

nt step is the

o. This ident

ally deduced

nd the rejoin

igure 3.9 sho

of EBMT S

inly based o

humans spli

ea of how th

lve the probl

guage pair a

ry. A transla

evious transl

the previous

el sentences

in three step

m looks in i

give the bes

stics from ex

ate the simila

en used to id

tification ca

from the pa

ned sentenc

ows the bloc

System

n how huma

it the problem

ey solved th

lem in whol

among whic

ation memor

lations. If th

s translation

and not ju

ps; matchin

ts database o

st coverage o

xact characte

arity of word

dentify whic

an be done b

arallel data. 3

es are judge

ck diagram o

an

m

his

le.

ch

ry

he

is

ust

g,

of

of

er

ds

ch

by

3)

ed

of

bou

sen

the

sec

of

pro

not

3.6

Sta

me

cor

ma

Tra

In orde

ught a home

The pa

ntences in th

e words in th

cond sentenc

the sentence

ocessing may

t available in

6.2.4 Statisti

atistical app

ethods by d

rpora. This

any aspects.

anslation (SM

r to get a cle

e” and the Ta

Tab

arts of the s

he corpus. H

he first sent

ce pair. Ther

es in the cor

y be require

n the corpus.

ical Approa

proach to m

deriving the

approach di

Figure 3.10

MT) system.

Figu

ear idea of t

amil translat

ble 3.8 Exam

EnglishHe bought a

He has a ho

sentence to

Here, the par

tence pair an

refore, the co

rpus are take

ed in order t

.

ach

machine tran

parameters

iffers from

0 shows the

.

ure 3.10 Blo

81

this approach

tion also giv

mple of Eng

h a pen

ome

be translate

rt of the sent

nd ‘a home’

orresponding

en and comb

o handle nu

nslation gen

for those

the other ap

e simple blo

ock Diagram

h, consider t

en in Table

glish and Ta

mtd; xmtDf;F

ed will be

tence ‘He bo

gets match

g Tamil part

bined approp

umbers and g

nerates tran

methods by

pproaches t

ock diagram

m of SMT S

the English

3.8.

amil Senten

TamilU ngdh thF xU tPL ,

matched wi

ought’ gets

hed with the

t of the matc

priately. Som

gender if ex

nslations usi

y analysing

to machine

of a Statist

System

sentence “H

nces

l h';fpdhd; Uf;fpwJ

ith these tw

matched wit

words in th

ched segmen

metimes, pos

act words ar

ing statistic

the bilingu

translation i

tical Machin

He

wo

th

he

nts

st-

re

al

ual

in

ne

82

The advantages of statistical approach over other machine translation approaches are as

follows:

• The enhanced usage of resources available for machine translation such as

manually translated parallel and aligned texts of a language pair, books available in

both languages and so on. That is large amount of machine readable natural

language texts are available with which this approach can be applied.

• In general, statistical machine translation systems are language independent i.e., it

is not designed specifically for a pair of language.

• Rule based machine translation systems are generally expensive as they employ

manual creation of linguistic rules and also these systems cannot be generalised for

other languages, whereas statistical systems can be generalised for any pair of

languages, if bilingual corpora for that particular language pair is available.

• Translations produced by statistical systems are more natural compared to that of

other systems, as it is trained from the real time texts available from bilingual

corpora and also the fluency of the sentence will be guided by a monolingual corpus

of the target language.

Statistical parameters are analysed and determined from Bi-lingual and

Monolingual corpora. Using these parameters translation and language models are

generated. Designing a statistical system for a particular language pair is a rapid

process because the work lies on creating bilingual corpora for that particular language

pair. In order to obtain better translations from this approach, the system needs at least

more than two million words for a particular domain. Moreover, Statistical Machine

Translation requires an extensive hardware configuration to create translation models in

order to reach average performance levels.

3.6.3 Hybrid Machine Translation System

Hybrid machine translation approach makes use of the advantages of both statistical

and rule-based translation methodologies. Commercial translation systems such as Asia

Online and Systran provide systems that were implemented using this approach. Hybrid

machine translation approaches differ in many numbers of aspects:

Ru

ma

targ

stat

for

Sta

app

sys

the

sys

sho

3.7

Th

sys

two

aut

sho

aut

flu

pro

ule-based sys

achine transl

get languag

tistical syste

r this system

Fig

atistical tran

proach a sta

stem to pre-p

e output of

stem to prov

own in Figur

Figure

7 EVAL

is section p

stem. Evalua

o important

tomatic eval

ows how to

tomatically.

ency is thro

ocess. The j

stem with p

ation system

e. The outp

em to provid

.

gure 3.11 Ru

nslation syst

atistical mac

process the

the statistic

vide better

re 3.12.

3.12 Statist

LUATING

provides eva

ation of mac

t types of

luation and

o evaluate

The most

ough human

udgments o

post-processi

m produces t

put of this r

de better tra

ule based Tr

tem with pre

chine transla

data before

al system c

translations.

tical Machin

G STATIST

aluation met

chine transla

evaluation

manual eva

the perform

reliable me

evaluation.

of more than

83

ing by statis

translations f

rule based

anslations. F

ranslation S

e-processing

ation system

providing th

can also be

. The block

ne Translat

TICAL M

thods to find

ation is a ve

techniques

aluation or

mance of an

ethod for e

But human

n one huma

stical appro

for a given t

system will

Figure 3.11 s

System with

g by the rule

m is incorpo

he data for t

post-process

k diagram fo

tion System

MACHINE

d the qualit

ry active fie

in machin

human eval

n MT syst

evaluating tr

n evaluation

an evaluator

oach: Here th

text in sourc

l be post-pr

shows the b

h Post-proce

e based appr

orated with

training and

sed using th

or this type

with Pre-p

E TRANSL

ty of machin

eld of resear

ne translatio

luation. Thi

tem, both m

ranslation a

is a slow a

are usually

he rule base

ce language t

ocessed by

block diagram

essing

roach: In th

a rule base

d testing. Als

he rule base

of system

rocessing

LATION

ne translatio

rch. There ar

on which ar

is subdivisio

manually an

adequacy an

and expensiv

y averaged.

ed

to

a

m

his

ed

so

ed

is

on

re

re

on

nd

nd

ve

A

84

quick, cheap and consistent approach is required to judge the MT systems. A precise

automated evaluation technique would require linguistic understanding. Methods for

automatic evaluation usually find the similarity between the translation output and one

or more translation references.

3.7.1 Human Evaluation Techniques

Statistical Machine Translation outputs are very hard to evaluate. To judge the quality

of translation one may ask human translators to find the scores for a machine

translation output or compare a system output with a gold standard output. This gold

standard outputs are generated by human translators. In human evaluation, different

translators translated same sentence in different ways. There is no single correct answer

for the translation task because a sentence can be translated in different ways. The

reason for translation variation is choice of words, word order and style of translators.

So the machine translation quality is very hard to predict.

The human evaluation tasks provide the best insight into the performance of an

MT system, but they come with some major drawbacks. It is an expensive and time

consuming evaluation method. To overcome some of these drawbacks, automatic

evaluation metrics have been introduced. These are much faster and cheaper than

human evaluation, and they are consistent in their evaluation, since they will always

provide the same evaluation given the same data. The disadvantage of automatic

evaluation metrics is that their judgments are often not as correct as those provided by a

human. The evaluation process, however, has the advantage that it is not tied by the

realistic scenery of translation. Most often, evaluation is performed on sentences where

one or more gold standard reference translations already exist [143].

In human evaluation method, the judges are presented with a gold-standard

sentence and some translations. Table 3.9 shows the scales used for evaluation when

the language being translated into is English. Using this scale, the judges are asked to

assign a score to each of the presented translations. Accuracy and fluency is a

widespread means of doing manual evaluation.

85

Table 3.9 Scales of Evaluation

Score Adequacy Fluency

5 All Flawless

4 Most Good

3 Much Non-native

2 Little Disfluent

1 None Incomprehensible

3.7.2 Automatic Evaluation Techniques The automatic evaluation is the method which use computer program to judge the

translation output is better or not. Currently automatic evaluation metrics is widely used

to evaluate machine translation system. These systems are upgrade based on the rise

and fall of scores in this automatic evaluation. The major advantage of this technique is

time and money. It requires less time to judge a huge amount of outputs. In situations

like everyday system evaluation, human evaluation can be too expensive, slow, and

inconsistent. Therefore, an automatic evaluation metric that is reliable and very

important to the progress of Machine translation field. In this section, the most widely

used automatic evaluation metrics, BLEU, NIST, Edit distance measures and precision

and recall are described.

3.7.2.1 BLEU Score

The first and most widely-used first automatic evaluation measure is BLEU (BiLingual

Evaluation Understudy) [144]. It was introduced by IBM in Papineni et.al. (2002). It

finds the geometric mean of modified n-gram precisions. BLEU considers not only

single word matches between the output and the reference sentence, but also n-gram

matches, up to some maximum n. It is the ratio of correct n-gram of a certain order n in

relation to the total number of generated n-gram of that order. The maximum order n

for n-gram to be matched is typically set to four. This mean is then called BLEU-4.

Multiple reference are also be used to compute BLEU. Evaluating system translation

against multiple reference translation provides a more robust assignment of the

translation quality [144]. The BLEU metric then takes the geometric mean of the scores

assigned to all n-gram lengths. Equation 3.1 shows the formula for BLEU, where N is

86

the order of n-grams that are used, usually 4, pn is a modified n-gram precision, where

each n-gram in the reference can be matched by at most one n-gram from the

hypothesis. BP is a brevity penalty, which is used to penalize too short translations. It is

based on the length of the hypothesis c, and the reference length r. If several references

are used, there are alternative ways of calculating the reference length, using the

closest, average or shortest reference length. BLEU can only be used to give accurate

system wide scores, since the geometric mean formulation means it will be zero if there

are no overlapping 4-grams, which is often the case in single sentences.

BLEU =BP. ∑ log (3.1)

BP=1

3.7.2.2 NIST Metric

The NIST metric (Doddington, 2002) is an extension of the BLEU metric [145]. The

introduction of this metric tried to meet two characteristics of BLEU. First, the

geometric average of BLEU makes the overall score more sensitive to the modified

precision of the individual n’s, than if the arithmetic average is used. This may be a

problem if not many high n-gram matches exist. Second, all word forms are weighted

equally in BLEU. Less frequent word forms may be of higher importance for the

translation than for example high frequent function words, which NIST tries to

compensate for by introducing an information weight. Additionally, the BP is also

changed to have less impact for small variations in length. The information weight of

an n-gram abc is calculated by the following equation:

info(abc) = log

(3.2)

This information weight is used in equation (3.4) instead of the actual count of

matching n-grams. In addition, the arithmetic average is used instead of the geometric,

and the BP is calculated based on the average reference length instead of the closest

reference length. The lengths of these are summed for the entire corpus (r) and the same

for the translations (t).

87

BP = exp . min , 1 (3.3)

NIST = BP · ∑∑

∑ (3.4)

The NIST metric is very similar to the BLEU metric, and their correlations with human

evaluations are also close. Perhaps NIST correlates a bit better with adequacy, while

BLEU correlates a bit better with fluency (Doddington, 2002) [145].

3.7.2.3 Precision and Recall

In Automatic evaluation metrics each sentence in system translation is compared

against gold standard or human translations. This gold standard human translation is

called Reference translation. This precision and recall approach is based on word

matches. Precision is a fraction of retrieved docs that are relevant and Recall is defined

as fraction of relevant docs that are retrieved. This metric is mainly used in information

retrieval systems. The significant drawback of this metric while using in Machine

translation is, not considerable of word order.

Precision: number of relevant documents retrieved by a search divided by the total

number of documents retrieved by that search

P(relevant | retrieved)

Recall: the number of relevant documents retrieved by a search divided by the total

number of existing relevant documents (which should have been retrieved)

P(retrieved | relevant)

For Example,

SMT OUTPUT: Israeli officials responsibility of airport safety

REFERENCE: Israeli officials are responsible for airport security

Precision = Correct / Output length

88

= 3 / 6 = 50%

Recall = Correct / Reference length

= 3 / 7 = 42.85%

The F Measure (weighted harmonic mean) is a combined measure that assesses the

precision/recall tradeoff.

F= 2( P x R) / (P+R)

F= 2(.5 x .4285 ) / (.5 + .4285)

F= 46%

3.7.2.4 Edit Distance Measures

Edit Distance Measures provide an estimate of translation quality based on the number

of changes which must be applied to the automatic translation so as to transform it into

a reference translation

• WER- Word Error Rate (Nießen et al., 2000) [147]. This measure is based on

the Levenshtein distance (Levenshtein, 1966) [146] —the minimum number of

substitutions, deletions and insertions that have to be performed to convert the

automatic translation into a reference translation.

• PER- Position-independent Word Error Rate (Tillmann et al., 1997) [148]. A

shortcoming of the WER is that it does not allow reordering of words. In order

to overcome this problem, the position independent word error rate (PER)

compares the words in the two sentences without taking the word order into

account.

• TER- Translation Edit Rate (Snover et.al., 2006) [149]. TER measures the

amount of post-editing that a human would have to perform to change a system

output so it exactly matches a reference translation. Possible edits include

insertions, deletions, and substitutions of single words as well as shifts of word

sequences. All edits have equal cost.

89

TER = # of edits to closest reference / average # of reference words

The edits that TER considers are insertion, deletion and substitution of

individual words, as well as shifts of contiguous words. TER has also been

shown to correlate well with human judgment.

3.8 SUMMARY

This chapter provided background on Tamil language processing and various

approaches for developing linguistic tools and Machine Translation system. This

chapter also gives an overview of Tamil language and its morphology. Machine

learning for Natural Language Processing and evaluation methods for Machine

translation are also discussed in this chapter.

Documents

CHAPTER 3 THEORETICAL BACKGROUND - …shodhganga.inflibnet.ac.in/bitstream/10603/9233/12/12...CHAPTER 3 THEORETICAL BACKGROUND 3.1 GENERAL Natural Language Processing (NLP) research