Transcript

Word Sense Disambiguation for Urdu Text by

Machine Learning Syed Zulqarnain Arif1, Muhammad Mateen Yaqoob1, Atif Rehman2, and Fuzel Jamil1

1Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan

2Department of CS & IT, University of Sargodha, Sargodha, Pakistan

Abstract- The problem underlying this research is to solve word sense disambiguation problem for Urdu language text.

Although the problem is well-studied for English language text, the work on Urdu is still in infancy. Since Urdu is a

morphologically a rich language and hence very sparse, it may not be guaranteed that the methods that performance better

for English language may also be effective for Urdu. The goal of this research is to propose a method for word sense

disambiguation of ambiguous Urdu words and develop a prototype system. We have proposed a two-stage method where

the first stage is detection phase and second stage is identification phase. The detection task is performed by maintaining a

lexicon of ambiguous words. The identification problem is posed as a classification problem where a classifier is trained

for each ambiguous word by developing a word sense disambiguation specific test collection. Two well-studied text

classification algorithms have been analyzed for classification: support vector machine and Naïve Bayes. The results have

shown that support vector machine has an advantage due to it intrinsic ability to handle sparse data and less sensitivity to

high dimensionality of dataset.

Keywords: Machine Learning, Word Sense Disambiguation, Support Vector Machine, Naïve Bayes

I. INTROUDCTION

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or

semi-structured machine-readable documents. . This extraction is basically applied on structured and semi-structured

textual data that is readable to machine. However, the techniques of automatic annotation and content extraction from

multimedia are also considered in IE. In IE, main goal is to automatically extract entities from data on the web, finding

the relationships among entities and describing attributes of entities. The extracted knowledge enables much richer

forms of queries on the abundant unstructured sources than possible with keyword searches alone. Another objective

of IE is to represent the unstructured text in a form useful to derive inferences that are based on the logical content

found in the input data.

Assigning a correct sense automatically to a word that has more than one meaning is referred as word sense

disambiguation (WSD). Many words have more than one meaning having correct sense in a particular context for

which WSD technique is introduced. Word sense disambiguation is a process to resolve the ambiguity and rectify the

use of word in a specific frame of reference [1]. Human are good to identify the correct sense of word with the help

of context easily. WSD is a domain that deals with modeling this ability of humans into computers. Frequently it is

noticed that some words may have multiple meanings according to their use as parts Of speech (POS). In the following

example same word (bear) gives us a sense of verb and noun in two different sentences.

I bear (verb) him no malice?

We sell all shapes and sizes of teddy bear (noun).

I can’t bear (verb) this stiffing humidity.

Bear is used as a noun in second sentence which shows “bear is type of animal”, while in the other sentence bear is

used as a verb which means “to tolerate“. As such ambiguous words will be determined with the help of parts of

speech. However, a more challenging WSD task is to disambiguate words that have same parts of speech but different

meanings. For instance, “Bank” in English can be used for a “higher edge of a river side” or “investment organization”.

The ambiguous words are often distinguished as polysemy and homonymy. Polysemy are referred for words that have

more than one senses. For example: bank is an organization or a watercourse beach. On the other hand, homonymy is

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

738 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

referred to words that are completely dissimilar words but identical spellings. For instance, we can use “Can” as a

model verb: I can see, or it can be used as a container: She bought me oil can. Difference between polysemy and

homonymy is mostly unclear. Word sense disambiguation (WSD) is the technique of identifying meaning of a word

according to the context of the sentence [2]. Normally WSD is a difficult task which requires a lot of effort and

challenges to attain, in all languages. Dealing with languages of South Asian region requires some more and different

challenges. As our focus of interest in Urdu language, we have described some related characteristics of Urdu language

in below paragraphs.

There is no concept of capitalization in any of the South Asian languages including Urdu. Whilst it can be noticed that

it is the main feature of almost all the European Languages. For example, English has a feature of capitalization and

it is very useful to identify a Name in an English phrase. It is necessary that the first alphabet of the name in English

must be written in capital. So no capitalization feature in Urdu makes it hard for WSD especially when rule-based

approach can be used since capitalization can be an important heuristic to disambiguate words with different part of

speech.

There are many ambiguities in Urdu language when it comes to the names. There are names which are used both as a

common and a proper noun. In WSD, it is one of the biggest challenges to differentiate improper noun and common

noun. For example, برکت (barkat) or امین(amin) are the names and they can be used as a common noun also. For any

technique whether it is knowledge engineering or machine learning language resources are very important. So is the

case with WSD. One should have the availability of collection of documents with labeled categories such as [20]. This

collection is often referred as test collection and not available for Urdu language. Like other language Urdu language

is an Ambiguous language. Like in English language there are many words that have more than one meaning like e.g.

“Bank” finical bank or river bank? Similarly in Urdu “Sona” is ambiguous word Gold or Sleep?

Now a day’s huge amount of Urdu data is available on the internet in the form of blog, forums, News website, social

media etc. This situation is not as few year back when Urdu news website scans the pieces of the newspaper and show

on the website. Today many of those website use digital text instead of scanned documents. The availability of this

trend creates opportunities and challenges to easily gather and process Urdu text to better organize it, retrieve and

extract information in better way. Since WSD is important task for machine translation, there is a need to develop the

system for Urdu text data like English and other European languages retrieve, extract classify and handle the data.

Considering this development, the goal underlying this research is to apply state-of-art documents Word Sense

disambiguation systems on Urdu text documents in order to make the information retrieval, extraction from Urdu

document more efficiently.

The problem underlying this research paper is to solve word sense disambiguation problem for Urdu language text. It

deals with the question that how do we identify correct senses of ambiguous Urdu words. Although the problem is

well-studied for English language text, the applicability of methods is not much studied for Urdu text. Since Urdu is

a morphologically a rich language and hence very sparse, it may not be guaranteed that the methods that performance

better for English language may also be effective for Urdu. We have proposed a two-stage method as shown in Figure

1.1 where the first stage is detection phase and second stage is identification phase. The detection task is performed

by maintaining a lexicon of ambiguous words. The identification problem is posed as a classification problem where

a classifier is trained for each ambiguous word by developing a word sense disambiguation specific test collection.

Figure-1: Proposed approach

Input Data Detection Phase

(identify

ambiguous word)

Identification Phase

(identify the sense of

word)

Lexicon Word

Specific

Models

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

739 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

II. RELATED WORK

In this section we have described previous work related to this research paper. The related work is organized into two

categories; traditional knowledge based approaches and machine learning based approaches. The machine learning

based approach is further organized into three classes; supervised, unsupervised and semi-supervised methods.

A. Knowledge Based Approach for WSD

Text categorization starts appearing in early 1960 and in early 1990 it became a major field in information system

discipline. It is now being applied in many applications like email spam filtering, website classification etc. [10]. The

most popular approach to text categorization, in late 1980, was Knowledge Engineering (KE). In KE, set of rules were

defined manually by the human experts to classify text documents into pre-define set of categories. In the decade of

1990, KE lost its popularity and ML approach became popular. In ML approach, the classifier is built automatically

from a set of pre-classified documents using some ML techniques. The advantage of ML approach over KE is:

Accuracy: ML approach accuracy is comparable with KE approach.

Cost: KE is costly than ML, knowledge engineers and domain experts are involved in building the classifier while

ML does not need not much human interaction to build a classifier [10].

These methods mainly try to avoid the need of large amounts of training materials required in supervised methods.

Knowledge-based methods can be classified in function of the type of resources they use: 1) Machine-Readable

Dictionaries; 2) Thesauri; or 3) Computational Lexicons or Lexical Knowledge Bases.

1. Dictionary Based Approach

The dictionary based approaches are based on already built dictionary which holds sentiment polarities words, such

as: WordNet Affect, or Senti WordNet (dictionary). The SentiWordNet [21], is a lexical tool that parses words within

an opinion and search for their match in already maintained dictionary. Based on match words are assigned scoresfor

being positive and negative. The score of all the words for being positive and negative are then accumulated separately

within an opinion. The final decision is made based on the accumulated score; if positive score is greater, the opinion

is regarded as positive and negative otherwise.

These words or set of words (opinions) are passed to bootstrapping process using WordNet [21] which is an algorithm

that uses WordNet for counting the words and assigning the polarities. This technique is easy and efficient and

produces better results. But at the same time, this technique has disadvantage as no efficient method is found in dealing

with context dependent words. For instance, the word “large” or “small”, “cold” and “hot” can point toward a positive

or a negative opinion on based on product context and feature. The approach was used to generate a quick summary

of opinions (i.e. % of positive or negative opinions) related to different features of products [22]. The overall system

is shown as a block diagram in Figure 2. After extracting opinions, part of speech tagging is employed to filter

irrelevant words since adjective words are considered as relevant words. Removing irrelevant words are referred as

feature pruning in this block diagram. In the next phase, the system considers infrequent features for their removal.

Finally, the system works on opinion sentence orientation determination (i.e. positive or negative).Bootstrapping

technique is used for semantic orientation of each opinion using the Word Net [23]. Then the whole sentence's

semantic orientation is found based on leading orientation of the opinion words in the sentence. An improvement on

these methods is proposed on [24], [25] by using more refined and sophisticated tools, like the Sentiment Analyzer

[26]. Another work on dictionary based approach is performed in the context of movie recommendations [27].

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

740 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

Figure-2: Summary system

2. Machine-Readable Dictionaries (MRDs)

Provide a ready-made source of information about word senses and knowledge about the world, which could be very

useful for WSD and NLU. Since the work by Lesk (1986) many researchers have used MRDs as structured source of

lexical knowledge for WSD systems. However, MRDs contain inconsistencies and are created for human use, and not

for machine exploitation. There is a lot of knowledge in a dictionary only really useful when performing a complete

WSD process on the whole definitions [28]. See also the results of the Senseval-3 task devoted to disambiguate

WordNet glosses [29]

WSD techniques using MRDs can be classified according to: the lexical resource used (mono or bilingual MRDs); the

MRD information exploited by the method (words in definitions, semantic codes, etc); and the similarity measure

used to relate words from context and MRD senses.

The Lesk Algorithm

Words are disambiguated by discovering the duo of dictionary senses with the word overlap in their descriptions in

dictionary. For example: while disambiguating the words in ‘pine cone’, the portrait of the compatible senses involves

both of the words ‘evergreen’ and ‘tree’ in one dictionary. Figure 3 gives an example of Lesk Algorithm.

Figure-3: Lesk algorithm

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

741 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

It involves the consideration of across-the-board affiliation and computation of the semantic resemblance of each duo

of word senses which depends on the provided rhetoric knowledge base e.g. Word net. Graph-based methods in the

former days of research have been applied. That resulted in success. More crucial graph-based approaches have been

exhibited to carry out work as good as supervised methods [30] or even performing better than them on particular

given areas. [31], [32].

B. Machine learning approach for WSD

Machine learning domain generally offers a number of algorithms that can learn concepts/patterns from datasets.

These algorithms are often distinguished into three classes: supervised, unsupervised and semi-supervised. In

supervised learning, the mainly requirement is annotated dataset where ambiguous words are labeled. The task can be

seen as an instance of text categorization where an algorithm is trained to classify text document. A number of trainable

supervised learning algorithms are available for this purpose including support vector machine and naïve Bayes. After

training, the algorithms result hypothesis (or a function) which can output the sense of ambiguous words. Graphical

view of machine learning approach is represented in figure 4.

Figure-4: View of learning method to WSD

1. Supervised Learning

The study of algorithm to learn concepts under the supervision is called Supervised Learning (SL) (e.g. by giving

related data set and there correct output result). Consider WSD problem the standard way to provide supervision is an

annotated data set (i.e. collection of data set that are manually tagged by a person according to the word sense). Work

flow of supervised learning is as follow in figure 5, and method of supervised learning show in figure 6.

Figure-5: Work flow of supervised learning

Figure-6: Supervised learning method

Learning

Methods Unsupervised Learning

Semi-Supervised

Supervised Learning

Predicted

Responses

New Data

Model

Model

Know Responses

Known Data

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

742 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

For WSD problem supervised Learning is the best technique. This technique include support vector machine (SVM)

[36], Bayesian Classification [20], and Decision Tree [37] etc. All these techniques are the variants of Supervised

Learning approach that consist of the systems.

2. Unsupervised Learning

In contract to supervised learning, unsupervised learning relies on clustering training data to find groups related to

different senses. The groups can be used to approximate densities or proximity of unknown words (with unknown

senses). Unsupervised learning may also be used to reduce dimensionality of feature space which may be used to

train supervised algorithms. Method of unsupervised learning show in figure 7 below.

Unsupervised method is the proficient query for WSD researchers. The fundamental hypothesis is that same senses

occur in alike contexts due to which senses can be inferred from text by grouping word occurrences with the help of

resemblance of context [38]. This is called Word Sense Induction. Afterwards, new occurrences of the word can be

arranged into the most intimate inferred clusters. It has been proved that Word Sense Induction hoists Web search

result grouping by amplifying the caliber of result clusters and the change of result lists. [39] [40]

Figure-7: Unsupervised learning method

3. Semi-Supervised Learning

One disadvantage of supervised learning approach is preparation of large test collection. With machine learning

approach, it is generally known that more is the data, high is the effectiveness of method. In order to deal with issue

of collecting large size datasets, an alternative approach is adopting semi-supervised learning (SSL). The main idea

of approach is to initially train an algorithm with small sized annotated dataset and then improve this undertrained

method with annotated data. Although the main object of SSL is close to the Supervised Learning, the goal is to learn

a concept with few tagged and many untagged documents. The SSL technique is still not mature. Bootstrapping is one

of the popular techniques used for SSL. General frame work of SSL [41] is given in figure 8.

Figure-8: Frame work of semi-supervised learning

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

743 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

Supervised learning is one extreme learning where all data set is annotated WSD concept. Other extreme learning is

Unsupervised Learning (USL) where complete un-annotated data set are used. The algorithm saves the time of tagging

the data set. Clustering is well known approach of USL. This technique is based on lexical resources (e.g. Word Net),

calculated on lexical patterns and statistics on a large corpus which is not annotated.

C. WSD for Urdu Language

Work is done for using WSD process for Urdu language. Urdu, being spoken by more than a 100 million people, is

an important language. Here is a preliminary footstep to determine the abstruseness of words in Urdu framework. The

method that is executed to determine uncertainty is Bayesian Classification. In the premature exertions of

disambiguating senses of words, dictionaries in the form of hard copy were used. Those pains executed well on trained

data. Unfortunately, enactment of those efforts distorted on large scale. In 1986 an electronic dictionary was used for

WSD and it was valid for large context. In this methodology senses were acknowledged using sense definition in

dictionary. The method was evidenced as a notorious method. A dictionary based approach was engaged by Walker

which chooses the sense of a word by recognizing the semantic class of the word.

III. RESEARCH METHODOLOGY

The purpose of this section is to describe our proposed methodology of Urdu WSD using support vector machine. The

design of prototype system using proposed methodology is also presented. Methodology is shown in the form of block

diagram in figure 9. The main steps include pre-processing (i.e. tokenization, stop-words removal and feature

weighting), feature selection classifier training and evaluation. These steps are briefly described in subsequent

sections.

Figure-9: Research methodology

In Urdu natural language processing, one of the known shortcoming is the non-availability of corpus. Although, some

corpuses are made, they are not readily available and specific to tasks like part-of-speech tagging and name entity

recognition. In order to test and develop WSD methods since no data collection is available, we have collected our

own naive collection. Test collection is a collection of documents to which human indexers have assigned category

names from a predefined set [51]. Most vital resource for text categorization. We have taken 21 Urdu words that have

different sense depending on context in which they are used. The list of words is shown in table 1.

Data collection

Feature selection

Evaluation

Pre-Processing

Classifier Training

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

744 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

Table-1: Words and corresponding senses

For each word, we have collected text documents from the different online resources for example (voice of America,

BBC Urdu, and Express etc). Distribution of documents per word is shown in Table 2. To annotate documents with

predefined word senses, we have employed the services of three Urdu teachers from Department of Urdu, University

of Sargodha. The inter-annotator agreement between these annotators is estimated 0.99.

Table-2: Document distribution over words

Words Documents Words Documents

ا جان ل Bajana 73 ب Kal 50 ک

نا ھاگ مر Bhagna 47 ب Kamar 50 ک

س سم Bus 50 ب Qasam 50 ق

ا ا Dana 51 دان الن ھ Kalna 50 ک

ز ون Daraz 45 درا Kon 50 ک

ا وٹ Deya 43 دی Lot 50 ل

لک Jaha 49 جہاں Mlk 62 م

ر وری Par 47 پ puri 50 پ

سر sar 50 ا سون sona 53

ار ال tar 45 ت balay 31 ب

سو soo 48

As discussed in text categorization process, few pre-processing steps are required to represent the textual data in proper

format for applying machine learning algorithms. The steps include tokenization, stop words removal and feature

weighting etc. To tokenize documents, number of possibilities is available including phrase-level tokenization,

sentence-level tokenization and word-level tokenization. However, word-level tokenization is empirically found to be

Words Senses Words Senses

ا جان Bajana 1. Knock ب

2. Clock-strike

3. Play

ل Kal 1. Tomorrow ک

2. Total

نا ھاگ Bhagna 1. Avoid ب

2. Elope

3. Escape

4. Run

مر Kamar 1. Back ک

2. Name

س Bus 1. Bus ب

2. Enough سم Qasam 1. Type ق

2. swear

ا Dana 1. Grain دان

2. Intelligent ا الن ھ Kalna 1. To Eat ک

2. To Play

ز Daraz 1. Draw درا

2. Height ون Kon 1. Ice-cream ک

2. Shape

3. who

ا Deya 1. Give دی

2. Oil Lamp وٹ Lot 1. Come back ل

2. Loot

Jaha 1. Where جہاں

2. World لک Mlk 1. Cast م

2. Country

3. Milk

ر Par 1. Upon پ

2. Wings وری puri 1. Dish پ

2. Whole world

سر sar 1. Head

2. Sir ا سون sona 1. Sleep

2. Gold

ار tar 1. Letter ت

2. Wire ال balay 1. Call him ب

2. Ghosts

سو soo 1. Sleep

2. 100 rupees

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

745 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

more effective for being statistically sounder [10]. That’s why; we have adopted word level tokenization method. One

key issue of text categorization is known as curse of dimensionality which not only degrades the effectiveness of

classifiers but also make some classifiers unscaleable [52] [53]. The key idea to deal with this issue is to remove size

of set of features. Because not all the features are good for the classification as there are several useless words like

auxiliary verbs, conjunctions, articles and function words (that exhibits frequencies over all classes) that can better be

removed. These words are collectively known as stop words. Despite removing stop words, other sophisticated

methods include feature selection and feature extraction [53]. While feature selection methods select a subset of

features from set of features, feature extraction methods reduce dimensionality of feature space by projecting them

onto reduced dimensional space. Finally, we have adopted standard TFIDF (term frequency inverse document

frequency) measure for assigning weights to features. TFIDF measure. There are two key requirements behind this

selection: firstly the selection should be performed automatically and secondly the selection should contain minimal

set of informative words that improves the effectiveness of classifiers [54]. To meet this objective, several feature

selection techniques have been proposed in the literature. The most widely used techniques are information gain, gain

ratio, mutual gain and chi-square [54]. Various studies have been conducted to perform comparison between feature

selection methods, e.g. [54]. As the result of these empirical studies, information gain has highly among others. By

following these recommendations, we have also employed information gain in our work. The method is briefly

described in following subsection.

Information Gain (IG)

Information Gain is commonly employed as a term goodness criterion in ML. It is a quantitative measure and is often

used in text categorization to find the worthiness of a feature. Worthiness of a feature means how well the single

feature performs classification according to target classification of a document [55] IG is often defined with the help

of entropy. Entropy is a measure commonly used in the domain of information theory and can be understood as a

measure of (im) purity of a collection of dataset. For example, in a binary classification problem S is dataset and it

contains some positive and negative examples (e.g. either a document is a spam or not) then the entropy H of S is

measured as:

In the above equation p ⊕denotes ratio of positive examples and p⊖ is the ratio of negative examples in the dataset

S. If the dataset is partitioned according to the feature then the IG of a feature is then measured as expected reduction

in entropy. Below equation 2, measures the IG of a feature t relative to a dataset S and it can be denoted as IG(S,t).

Where val(t) represents the set of all values of a feature t and Sv is the subset of S in which feature t has v value.

A. Classification Method

We have posed WSD as text classification task where input to the classifier is a textual document in tokenized form

and output is the sense of each token (i.e. word). The overall idea is illustrated in the form of block diagram as shown

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

746 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

in Figure 3.2 Machine learning offers a variety of algorithms that can be utilized for this purpose. The algorithms

include naïve Bays and support vector machines (SVM). A brief introduction of these algorithms can be found in [56].

Several empirical studies have been performed to comparatively evaluate the effectiveness of the algorithms. As the

result of the studies, it is found that SVM has advantage over other classifiers in terms of effectiveness. There are

several reasons behind the success of SVM that are described below after the introduction of naïve Bays and SVM.

Figure-10: Classification model

1. Naïve Bayes Classifier

Naïve Bays categorized in two approaches groups in text categorization: Naïve and non Naïve. Naïve approach former

is the assumption of feature independence and non-Naïve considers the dependence of features. Independence means

the word order is irrelevant (i.e. the presence of a word does not affect the presence or absence of another one. In naïve

Bays algorithm, text categorization is dealt as estimating posterior probabilities of categories given documents. In NB

classifier, text categorization is viewed as estimating posterior probabilities of categories given documents- i.e.

𝑃(𝑐𝑖|𝑑𝑗); the probability that 𝑗𝑡ℎdocument (as re-presented with a weight vector 𝑑𝑗 =< 𝑞1𝑗, 𝑞2𝑗 , … 𝑞|𝑇|𝑗 > where

𝑞𝑘𝑗 is the weight of 𝑘𝑡ℎ feature in 𝑗𝑡ℎ document) belongs to class 𝑐𝑖. These posterior probabilities are estimated by

using Bayes theorem as:

𝑃(𝑐𝑖|𝑑𝑗) =𝑃(𝑑𝑗|𝑐𝑖)𝑃(𝑐𝑖)

𝑃(𝑑𝑗) (3)

Where (ci) is the prior probability that represents the probability of selecting a arbitrary (random) document that is

part of classci, (di) is the probability (randomly) that a arbitrarily chosen document has weight vector 𝑑j and P(dj|ci)is

the probability that the document djbelongs to class ci. However, the guess of (ci|dj) is intractable due to the problem

of estimating (dj|ci)involving very high dimensional vector dj. To make computation tractable(dj|ci), it is assumed

that document vector coordinates are conditionally independent of each other. By following the assumption, the term

(dj|ci) can be estimated as:

P (⃗𝑑𝑗|𝑐𝑖) = ∏ 𝑃((𝑊𝑘𝑗|𝑐𝑖)𝑇𝑘=1 (4)

2. Support Vector Machine (SVM)

SVM is based on the idea well known as structural risk minimization principle [36] where objective is to find a

hypothesis that can guarantee lowest true error (error of a hypothesis for classifying an unseen instance drawn from

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

747 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

the same distribution as training data). The direct estimation of true error is not possible unless learner knows true

target concept. However, according to structural risk minimization principle, the true error can be bounded using

training error and complexity of hypothesis space. The complexity of hypothesis is represented by using well known

Vapnik-Chervonenkic dimension or VC dimension) [51]. The task underlying SVM is to minimize true error of

resultant hypothesis by efficiently controlling the VC dimension of hypothesis space [37].

In figure 11, a geometrical illustration of SVM for a binary classification problem is shown. Though the figure shows

two-dimension space with linearly separable data points, it can be generalized into high dimensional space with not

linearly separable problems. It can be seen from the figure that there can be many hyper planes separating the examples

into two classes. However, SVM algorithm found the decision boundary (i.e. boundaries with maximum distance or

margin) that best separate positive examples from negative examples in the n-dimensional space.

Figure-11: A prototypical problem for learning support vector machine

The problem of finding maximum margin (i.e. boundary) is mathematically outlined as:

minimizew,b<w.w>

subject to yi ( < w.xi > + b ) 1 i = 1,…l

Where l denotes the no. of training examples, xi is the input vector, yi is the desired output. Due to computational

convenience, the problem is reformulated as:

𝐿(𝑤, 𝑏, 𝛼) =1

2< 𝑤. 𝑤 >-∑ 𝛼𝑖[𝑦𝑖(< 𝑤𝑖 . 𝑥𝑖 > +𝑏) − 1]𝐼

𝑖=1 (5)

Where αi ≥ 0 are langrage multipliers. The langrage formulation of Equation.1 is often termed as primal formulation.

By differentiating Equation.12 with respect to w and b and substituting their values in Equation.12, we can formulate

the problem in so called dual form:

𝐿(𝑤, 𝑏, 𝛼) = ∑ 𝛼𝑖𝐼𝑖=1 −

1

2∑ 𝑦𝑖𝑦𝑗𝛼𝑖𝛼𝑗 < 𝑥𝑖𝑥𝑗 >𝐼

𝑖,𝑗=1 (6)

Some sophisticated mathematical transformations are also applied (known as kernel trick) to data prior to learning

decision boundary when instances are not linearly separable. The objective of transformation is to transform data

instances to higher dimensions features space t. The instance xi in new feature space is referred as (xi). An essential

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

748 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

computational operation of the dual formulation is to compute the dot product between data instances (notice from

Equation 13) that can be computed in higher dimensional space as(𝑥𝑖)(𝑥𝑗).

Despite using this computationally inefficient approach, kernel functions (xi) provide us a convenient way for this

computation

(𝑥𝑖). (𝑥𝑗) =𝐾𝑥𝑖,𝑥𝑗 (7)

Depending on the choice of kernel function, Support Vector Machine has three popular variants: Support Vector

Machine with linear or no kernel, polynomial kernel and radial basis kernel. In order to train SVMs, LIBSVM [56] is

considered as standard toolkit. The toolkit can be utilized with WEKA (a standard machine learning toolkit) using a

wrapper function. Describe some advantages of SVM over other classifiers [57] study of five text categorization

methods: Neural Networks, K-Nearest Neighbours, Linear Least Square Fit, Support Vector Machine, and Naïve

Bayes shows that Support Vector Machine is a good classifier .Their focus was on the robustness of these methods

over skewed categories distribution. They founded that Support Vector Machine and K-Nearest Neighbour performs

comparably when the no. of positive training examples per category are small (less than 10).

B. Evaluation Method and Metrics

During text categorization process, the assessment of document classifier is experimental rather than analytical

approach because experimental evaluation measures the effectiveness of classifier which is mainly the aptitude in

respect of taking right decisions. To evaluate the classifier, there are three measures usually used to measure the

success of the classifier.

Training efficiency: Training time is the period consumed in developing a classifier from a dataset/corpus.

Classification efficiency: Classification is the average time used in classification of a document using classifier

Effectiveness: Effectiveness shows tendency of classifier to make correct decisions.

Among these measures, effectiveness is considered most important and used widely [63]. Evaluation of classifier is

important, for this it is applied on a test corpus to measure the effectiveness of classifier over previously unseen test

documents. Test corpus consisted of a set of documents not used in training/learning phase. Different methods are

used to measure the efficiency of classifier, but mostly used methods are precision, recall, accuracy and f-measure

[64]. Precision can be considered as the conditional probability for which a random document 𝑑 is classified under

category𝑐𝑖, or what would be deemed the correct category. It shows the classifiers capability to put a document under

the correct category as opposite to all documents placed in that category, both correct and incorrect: [68]

𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃𝑖

𝑇𝑃𝑖 + 𝐹𝑃𝑖

Where TP (true positive, which means classified correctly under a category) is taken from the sum of true positive and

FP (False positive, which means document categorized as related to category incorrectly).

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

749 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

Recall is probability that whether a document dx should be taken into a category ci or not. This decision is taken as:

𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇𝑃𝑖

𝑇𝑃𝑖+𝐹𝑁𝑖

Precision and recall are also used together to achieve better performance of the classifier. Formula is given below

after combining both:

𝐹𝛽 = (𝛽2 + 1)(𝑃)(𝑅)

𝛽2.(𝑃)+𝑅

Where, P represents precision and R denotes recall respectively. β in this equation is a positive parameter that is used

to show target of the evaluation task. In this formula if we consider precision more important than recall then 𝛽 reaches

to zero and if recall is considered as more important than precision then 𝛽 reaches to infinity. Normally, 𝛽 is set as 1

to givenequal importance to both precision and recall. Categorization and can be defined as: A with respect 𝑐𝑖 is the

percentage of correct classification decision. In binary text categorization accuracy is not adequate measure because

in binary text categorization applications the two categories 𝑐𝑖 and 𝑐𝑖 are unbalanced (i.e. one contain more examples

than other) [63].

Recall (R) = (categories found and correct / total categories correct)*100

Both (Precision and recall) are the fundamental measures used in evaluating search strategies. In database, there will

be a set of records which is related to the search topic records are assumed to be either related or unrelated (these

measures do not allow for degrees of relevancy). The actual retrieval set may not perfectly match the set of relevant

records. In order to get good performance of classifier precision and recall are often used together [84]. F-Measure

can be defined as harmonic mean of precision and recall.

F-measure = 2 𝑃𝑅

𝑃+𝑅

The results of F-measure are depended upon the results of precision and recall. If precision and recall are high then F-

measure is also high. The value of F-measure remains in the interval of [0, 1]. In above formula F is standard F-

measure, where it evaluates the category-wise prediction abilities of the classifier [61]. Here by using F-measure result

(e.g. one or more numerical value) provide a measured distance b/w the human classification and the target automatic

classifier.

IV. EXPERIMENTATION AND RESULTS

The results are presented in this section in terms of effectiveness of classifiers for WSD. The flow of experimentation

is shown below in figure 12 using a block diagram. As described above, two algorithms (SVM and Naive Bays) are

analyzed for classification tasks. In order to characterize the performance of classifiers, standard 5 cross fold validation

test is used. Cross validation is a technique for model validation and it evaluates the generalization of independent

data set over statistical results provided by model. The diagram shows the flow of experiments:

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

750 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

Figure-12: Experimentation and results flow

As described above, feature selection is used to select most informative features. Feature selection not only makes

classifiers computationally efficient but improves effectiveness of classifiers [54]. We have used information gain as

feature selection method according to the recommendation of [67]. Results of the feature selection process over SVM

are shown in figure 13.

Figure-13: No of features vs f-measure of SVM

Results of the feature selection process over SVM are shown in figure 14. It is clearly seen that on the same data set

the SVM have more stable than the Naïve base.

0

0.2

0.4

0.6

0.8

1

1.2

F -

me

asu

re

Number of features

Tar

SONA

SAR

PURI

DARAZ

MULK

KHALANA

Load tagged UNER dataset

Apply classifiers

Comparative Analysis

SVM

Precision

Recall

F-measure

Results of SVM

WSD

Dataset

Evaluation Tool

(WEKA)

Recall

NB

Precision

F-measure

Results of NB

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

751 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

Figure-14: No of features vs f-measure of NB

The tasks are performed in WEKA which is a popular machine learning toolkit and have not only built-in functionality

for pre-processing tasks but also classifier training algorithms [68]. Based on proposed methodology, a prototype

application is developed in java as shown in figure 15. Currently, the application performs two main tasks; model the

classifier from the training data, and predict the senses of the word from the given document. The Prediction process

operates on a single text document. On loading the document into the application, the system performs two tasks;

detect the presence of ambiguous words (i.e. the words that have more than one sense) and predict the sense of the

ambiguous word. The first task is performed by maintaining a list of ambiguous words while the second task is

achieved by maintaining a collection of models. In both collections words and their corresponding models have same

indexes.

Figure-15: Prototype system

V. CONCLUSION AND FUTURE WORK

In this research paper we presented a Word Sense Disambiguation in Urdu text document by using machine learning.

There are various issues related to Urdu language processing, including lack of standard Urdu corpus and

incompatibility issues of NLP tools for Urdu language which has been discussed earlier. That’s why in this research

work WSD for Urdu text documents has been implemented using SVM. We presented analysis result of two classifiers

support vector machine and Naïve Bayes, on the basis of their result we have choose the best features and developed

the prototype system using the standard SVM library for URDU text documents.

To validate results, we have used k-folded cross validation where training dataset is divided into k mutually exclusive

subsets; interactively one of the k subset is used for testing the classifier and other k-1 subset used for training the

classifier [69]. The 5-folded cross validation test is conducted to validate the classifiers. Significant results have been

0

0.2

0.4

0.6

0.8

1

1.2

F -m

easu

re

Number of features

Tar

SONA

SAR

PURI

DARAZ

MULK

KHALANA

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

752 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

produced on developed corpus. It is shown through results that our approach is very effective to perform Urdu WSD.

We believed that the proposed system will help in information retrieval system and our test collection dataset will also

help the other researcher to work in the field of WSD for URDU language.

In future we want to increase the test collection of more ambiguous words; we also want to enhance our prototype

system from supervised learning to unsupervised learning. By doing this we can make an online directory which will

help in the natural language processing. And also other research can easily find the crop for URDU language process

which is not available today. Finally, comparative analysis between various machines learning algorithm be performed

to analyze the impact of classifier over WSD.

REFERENCES

[1] R. Navigli, “Word sense disambiguation: A survey”, ACM Computing Surveys, Vol. 41 No. 2, pp. 1-69, 2009.

[2] E. Palta, Word Sense Disambiguation. Mumbai: Kanwal Rekhi School of Information Technology, 2007.

[3] M. Sanderson, “Word Sense Disambiguation and Information Retrieval”, In Proc. of 17th annual international

ACM SIGIR conference on Research and development in information retrieval, pp. 142-151, New York, USA, 1994.

[4] J. E. Ellman, I. Klincke, and J. I. Tait, “Word Sense Disambiguation by Information Filtering and Extraction”,

Computers and the Humanities, Vol. 34, pp. 127-134, 2000.

[5] M. Carpuat and D. Wu, “Improving Statistical Machine Translation using Word Sense Disambiguation”, In Proc.

of 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural

Language Learning, pp. 61-72, Prague, 2007.

[6] W. Weaver, “Machine Translation of Languages”, Cambridge, Massachusetts: MIT Press. pp. 15–23, 1955.

[7] G. Pilania, C. C. Wang, X. Jiang, S. Rajasekaran, and R. Ramprasad, “Accelerating materials property predictions

using machine learning” Scientific Reports, Vol. 3, 2013.

[8] L. Guo, “Applying Data Mining Techniques in Property/Casualty Insurance”, in CAS 2003 Winter Forum, Data

Management, Quality, and Technology Call Papers and Ratemaking Discussion Papers, CAS. 2003.

[9] F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, No. 1,

pp. 1-47, 2002.

[10] V. Gupta and G. S. Lehal, “A Survey of Text Mining Techniques and Applications”, Journal of Emerging

Technologies in Web Intelligence, Vol. 1, No. 1, pp. 60-76, 2009.

[11] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization”, In Proc.

of 14th International Conference on Machine Learning, pp. 143-151, 1997.

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

753 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

[12] D. Blei, “Probabilistic Topic Models”, Communications of the ACM, Vol. 54, No. 4, pp. 77-84, 2012.

[13] M. Ikonomakis and S. Kotsiantis, “Text Classification Using Machine Learning Techniques”, WSEAS

Transactions on Computers, Vol. 4, No. 8, pp. 966-974, 2005.

[14] S. Niharika and V. S. Latha, “A Survey on Text Categorization”, International Journal of Computer Trends and

Technology, Vol. 3, No. 1, pp. 39-45, 2012.

[15] A. Naseer and H, Sharmad, “Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification”,

In Proc. of Conference on Language & Technology (CLT10), pp.1-5, Pakistan, 2010.

[16] M. Hu and B. Liu, “Mining and summarizing customer reviews”, In Proc. of 10th ACM SIGKDD international

conference on Knowledge discovery and data mining, pp. 168-177, 2004.

[17] A. Fahrni, and M. Klenner, “Old wine or warm beer: Target-specific sentiment analysis of adjectives”, In Proc.

of Symposium on Affective Language in Human and Machine, pp. 60-63, 2008.

[18] S. M. Kim, and E. Hovy, “Determining the sentiment of opinions”, In Proc. of 20th international conference on

Computational Linguistics, Association for Computational Linguistics, pp. 13-67, 2004.

[19] A. M. Popescu and O. Etzioni, “Extracting product features and opinions from reviews”, Natural language

processing and text mining, pp. 9-28, 2007.

[20] J. Yi, T. Nasukawa, R. Bunescu, and W. Niblack, “Sentiment analyzer: Extracting sentiments about a given topic

using natural language processing techniques”, In Proc. of 3rd IEEE International Conference on Data Mining, pp.

427-434, 2003.

[21] T. T. Thet, J. C. Na, C. S. G. Khoo, and S. Shakthikumar, “Sentiment analysis of movie reviews on discussion

boards using a linguistic approach”, In Proc. of 1st International CIKM workshop on Topic-sentiment analysis for

mass opinion, pp. 81-84, 2009.

[22] M. Castillo, F. Real, J. Atserias, and G. Rigau, “The TALP Systems for Disambiguating”, In Proc. of International

Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 93-96, 2004.

[23] R. Navigli, and P. Velardi, “Structural semantic interconnections: a knowledge-based approach to word sense

disambiguation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No.7, pp. 1075-1086,

2005.

[24] Navigli, Roberto, and Simone Paolo Ponzetto, “SemEval-2007 task 07: coarse-grained English all-words task”,

In Proc. of 4th International Workshop on Semantic Evaluations, pp. 30-35, Stroudsburg, PA, USA, 2007.

[25] Agirre, E., Lopez de Lacalle, A., Soroa, A., “Knowledge-based WSD on Specific Domains Performing better

than Generic Supervised WSD”, In Proc. of 21st international joint conference on Artificial intelligence, pp. 1501-

1506, San Francisco, CA, USA, 2009.

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

754 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

[26] Navigli, Roberto, and Mirella Lapata., “An experimental study of graph connectivity for unsupervised word sense

disambiguation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32 No.4, pp. 678-692, 2010.

[27] Ponzetto, Simone Paolo, and Roberto Navigli, "Knowledge-rich word sense disambiguation rivaling supervised

systems," In Proc. of 48th annual meeting of the association for computational linguistics. Association for

Computational Linguistics, 2010.

[28] Burges C., J. C., "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge

Discovery, pp. 121-167, 1998.

[29] H. Schütze, "Automatic word sense discrimination," Computational linguistics, vol. 24.1, pp. 97-123, 1998.

[30] Roberto Navigli, "A quick tour of word sense disambiguation, induction and related approaches," Theory and

practice of computer science, pp. 115-129, 2012.

[31] Di Marco, Antonio, and Roberto Navigli, "Clustering and diversifying web search results with graph-based word

sense induction," Computational Linguistics, Vol. 39 No.3, pp. 709-754, 2013.

[32] Thanh Phong Pham and Wee Sun Lee and Hwee Tou Ng, "Word Sense Disambiguation with Semi-Supervised

Learning," In Proc. of National Conference on Artificial Intelligence, London, 2005.

[33] Resnik, Philip, and David Yarowsky, "A perspective on word sense disambiguation methods and their

evaluation," in Proceedings of the ACL SIGLEX workshop on tagging text with lexical semantics: Why, what, and

how, Washington, 1997.

[34] Gliozzo, Alfio Massimiliano, Bernardo Magnini, and Carlo Strapparava, "Unsupervised Domain Relevance

Estimation for Word Sense Disambiguation," EMNLP, 2004.

[35] Siva Reddy, "Word Sense Disambiguation Using Semantic Categories, Domain Information and Knowledge

Sources. Diss.," In International Institute of Information Technology, Hyderabad, 2010.

[36] McCarthy, Diana, et al., "Unsupervised acquisition of predominant word senses," Computational Linguistics, vol.

33.4, pp. 553-590, 2007.

[37] Mohammad, Saif, and Graeme Hirst, "Distributional measures of concept-distance: A task-oriented evaluation,"

in Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for

Computational Linguistics, 2006.

[38] Roberto Navigli, "A quick tour of word sense disambiguation, induction and related approaches," Theory and

practice of computer science. Springer Berlin Heidelberg, pp. 115-129, 2012.

[39] Navigli, Roberto, and Simone Paolo Ponzetto, "BabelNet: The automatic construction, evaluation and application

of a wide-coverage multilingual semantic network," Artificial Intelligence 193, pp. 217-250, 2012.

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

755 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

[40] Martinez, David, Oier Lopez De Lacalle, and Eneko Agirre., "On the Use of Automatically Acquired Examples

for All-Nouns Word Sense Disambiguation," J. Artif. Intell. Res.(JAIR) 33, pp. 79-107, 2008.

[41] Asma Naseer, "Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification," in Center for

Research in Urdu Language Processing, Lahore, Pakistan.

[42] Tom Mitchel, "Machine Learning," McGraw - Hill, 1997.

[43] G. Forman, "An Experimental Study of Feature Selection Metrics for Text," Journal of Machine Learning

Research, pp. 1289-1305, 2003.

[44] A. Dasgupta and Michael W. Mahoney, "Feature selection methods for text classification," in in Proceedings of

the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA,

2007, pp. pp. 230-239.

[45] Monica Rogati and Yiming yang, "High-Performing feature selection for text classification," in CIKM, Virginia,

USA, 2002.

[46] Yimming Yang, Fan Li, David D. Lewis,Tony G. Rose, "RCV1:A New Benchmark collection for text

categorization Research," journal of machine learning, vol. 5, pp. 361-397, 2004.

[47] Chih-Chung Chang and Chih-Jen Lin, "LIBSVM: A Library for Support Vector Machines," in CM Transactions

on Intelligent Systems and Technology, 2003, p. 2(3).

[48] Y. Yang and X. Liu, "A re-examination of text categorization methods," in SIGIR-99, 22nd ACM International

Conference on Research, Berkeley, 1999, pp. 42–49.

[49] T. Joachims, "Text Catagorization with Support Vector Machines: Learning," in Tenth European Conference on

Machine, 1998, pp. 137-142.

[50] D. W. Aha, D. Kibler, "Instance-based Learning Algorithms. Machine Learning," Kluwer Academic Publishers,

Boston. Manufactured in The Netherlands, pp. 37-66, 1991.

[51] E. Gabrilovich and S. Markovitch, "Text categorization with many redundant features: using aggressive feature

selection to make SVMs competitive with C4.5," In Proceedings of the ICML, the 21st International Conference on

Machine Learning, pp. 321-328, 2004.

[52] Milne, David, Olena Medelyan, and Ian H.Witten, "Mining Domain-specific thesauri from Wikipedia: A case

study," in Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence. IEEE Computer

Society, 2006.

[53] Timothy. Weal, "Utilizing Wikipedia categories for document Classification," in Evaluation, 2006.

[54] Remko, Eibe Frank, and Mark A Hall, "WEKA—Experiences with a Java Open-Source Project," Journal of

Machine Learning Rearch 11, pp. 2533-2541, 2010.

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

756 https://sites.google.com/site/ijcsis/ ISSN 1947-5500

[55] M. Hall and I. H. Witten, "The WEKA data mining software: an update," SIGKDD Explorations, pp. 10-18,

2009.

[56] Mark Hall and Ian H witten, "The WEKA Data Mining Software: An Update," SIGKDD Explorations, vol. 1,

no. 1, 2005.

[57] "Data mining – Practical Machine Learning tools (2nd Ed.)," in Morgan Kaufmann Publisher, An imprint of

Elsevier, San Francisco, CA, 2005.

[58] Jan O. Pedersen and Yiming Yang, "A comparative Study on feature selection in text categorization," in in ICML

97 proceeding of the fourteenth international conference on machine learning, San Francisco, CA, USA, 1997, pp.

412-420.

[59] A. Rehman, M. Mustafa, I. Israr, and M. Yaqoob, “Survey of wearable sensors with comparative study of noise

reduction ECG filters”, International Journal of Computing and Network Technology, Vol. 1, No. 1, pp. 61-81, 2013.

[60] D. Anguita and L. Ghelardoni, "The ‘K’ in K-fold cross validation," in European Symposium on Artificial Neural

Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 2012.

[61] Dubey, Ajay, and Vasudeva Varma, "Generation of Bilingual Dictionaries using Structural Properties," 2013.

[62] K. Kamran, S. Afzal, M. Yaqoob, M. Sharif, “A Comparative Survey on Vehicular Ad-hoc Network (VANET)

Routing Protocol using Heuristic and Optimistic Techniques”, Research Journal of Information Technology, Vol. 6

No. 2, pp. 14-24, 2015.

International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016

757 https://sites.google.com/site/ijcsis/ ISSN 1947-5500