Upload
independent
View
0
Download
0
Embed Size (px)
Citation preview
Word Sense Disambiguation for Urdu Text by
Machine Learning Syed Zulqarnain Arif1, Muhammad Mateen Yaqoob1, Atif Rehman2, and Fuzel Jamil1
1Department of Computer Science, COMSATS Institute of Information Technology, Abbottabad, Pakistan
2Department of CS & IT, University of Sargodha, Sargodha, Pakistan
Abstract- The problem underlying this research is to solve word sense disambiguation problem for Urdu language text.
Although the problem is well-studied for English language text, the work on Urdu is still in infancy. Since Urdu is a
morphologically a rich language and hence very sparse, it may not be guaranteed that the methods that performance better
for English language may also be effective for Urdu. The goal of this research is to propose a method for word sense
disambiguation of ambiguous Urdu words and develop a prototype system. We have proposed a two-stage method where
the first stage is detection phase and second stage is identification phase. The detection task is performed by maintaining a
lexicon of ambiguous words. The identification problem is posed as a classification problem where a classifier is trained
for each ambiguous word by developing a word sense disambiguation specific test collection. Two well-studied text
classification algorithms have been analyzed for classification: support vector machine and Naïve Bayes. The results have
shown that support vector machine has an advantage due to it intrinsic ability to handle sparse data and less sensitivity to
high dimensionality of dataset.
Keywords: Machine Learning, Word Sense Disambiguation, Support Vector Machine, Naïve Bayes
I. INTROUDCTION
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or
semi-structured machine-readable documents. . This extraction is basically applied on structured and semi-structured
textual data that is readable to machine. However, the techniques of automatic annotation and content extraction from
multimedia are also considered in IE. In IE, main goal is to automatically extract entities from data on the web, finding
the relationships among entities and describing attributes of entities. The extracted knowledge enables much richer
forms of queries on the abundant unstructured sources than possible with keyword searches alone. Another objective
of IE is to represent the unstructured text in a form useful to derive inferences that are based on the logical content
found in the input data.
Assigning a correct sense automatically to a word that has more than one meaning is referred as word sense
disambiguation (WSD). Many words have more than one meaning having correct sense in a particular context for
which WSD technique is introduced. Word sense disambiguation is a process to resolve the ambiguity and rectify the
use of word in a specific frame of reference [1]. Human are good to identify the correct sense of word with the help
of context easily. WSD is a domain that deals with modeling this ability of humans into computers. Frequently it is
noticed that some words may have multiple meanings according to their use as parts Of speech (POS). In the following
example same word (bear) gives us a sense of verb and noun in two different sentences.
I bear (verb) him no malice?
We sell all shapes and sizes of teddy bear (noun).
I can’t bear (verb) this stiffing humidity.
Bear is used as a noun in second sentence which shows “bear is type of animal”, while in the other sentence bear is
used as a verb which means “to tolerate“. As such ambiguous words will be determined with the help of parts of
speech. However, a more challenging WSD task is to disambiguate words that have same parts of speech but different
meanings. For instance, “Bank” in English can be used for a “higher edge of a river side” or “investment organization”.
The ambiguous words are often distinguished as polysemy and homonymy. Polysemy are referred for words that have
more than one senses. For example: bank is an organization or a watercourse beach. On the other hand, homonymy is
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
738 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
referred to words that are completely dissimilar words but identical spellings. For instance, we can use “Can” as a
model verb: I can see, or it can be used as a container: She bought me oil can. Difference between polysemy and
homonymy is mostly unclear. Word sense disambiguation (WSD) is the technique of identifying meaning of a word
according to the context of the sentence [2]. Normally WSD is a difficult task which requires a lot of effort and
challenges to attain, in all languages. Dealing with languages of South Asian region requires some more and different
challenges. As our focus of interest in Urdu language, we have described some related characteristics of Urdu language
in below paragraphs.
There is no concept of capitalization in any of the South Asian languages including Urdu. Whilst it can be noticed that
it is the main feature of almost all the European Languages. For example, English has a feature of capitalization and
it is very useful to identify a Name in an English phrase. It is necessary that the first alphabet of the name in English
must be written in capital. So no capitalization feature in Urdu makes it hard for WSD especially when rule-based
approach can be used since capitalization can be an important heuristic to disambiguate words with different part of
speech.
There are many ambiguities in Urdu language when it comes to the names. There are names which are used both as a
common and a proper noun. In WSD, it is one of the biggest challenges to differentiate improper noun and common
noun. For example, برکت (barkat) or امین(amin) are the names and they can be used as a common noun also. For any
technique whether it is knowledge engineering or machine learning language resources are very important. So is the
case with WSD. One should have the availability of collection of documents with labeled categories such as [20]. This
collection is often referred as test collection and not available for Urdu language. Like other language Urdu language
is an Ambiguous language. Like in English language there are many words that have more than one meaning like e.g.
“Bank” finical bank or river bank? Similarly in Urdu “Sona” is ambiguous word Gold or Sleep?
Now a day’s huge amount of Urdu data is available on the internet in the form of blog, forums, News website, social
media etc. This situation is not as few year back when Urdu news website scans the pieces of the newspaper and show
on the website. Today many of those website use digital text instead of scanned documents. The availability of this
trend creates opportunities and challenges to easily gather and process Urdu text to better organize it, retrieve and
extract information in better way. Since WSD is important task for machine translation, there is a need to develop the
system for Urdu text data like English and other European languages retrieve, extract classify and handle the data.
Considering this development, the goal underlying this research is to apply state-of-art documents Word Sense
disambiguation systems on Urdu text documents in order to make the information retrieval, extraction from Urdu
document more efficiently.
The problem underlying this research paper is to solve word sense disambiguation problem for Urdu language text. It
deals with the question that how do we identify correct senses of ambiguous Urdu words. Although the problem is
well-studied for English language text, the applicability of methods is not much studied for Urdu text. Since Urdu is
a morphologically a rich language and hence very sparse, it may not be guaranteed that the methods that performance
better for English language may also be effective for Urdu. We have proposed a two-stage method as shown in Figure
1.1 where the first stage is detection phase and second stage is identification phase. The detection task is performed
by maintaining a lexicon of ambiguous words. The identification problem is posed as a classification problem where
a classifier is trained for each ambiguous word by developing a word sense disambiguation specific test collection.
Figure-1: Proposed approach
Input Data Detection Phase
(identify
ambiguous word)
Identification Phase
(identify the sense of
word)
Lexicon Word
Specific
Models
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
739 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
II. RELATED WORK
In this section we have described previous work related to this research paper. The related work is organized into two
categories; traditional knowledge based approaches and machine learning based approaches. The machine learning
based approach is further organized into three classes; supervised, unsupervised and semi-supervised methods.
A. Knowledge Based Approach for WSD
Text categorization starts appearing in early 1960 and in early 1990 it became a major field in information system
discipline. It is now being applied in many applications like email spam filtering, website classification etc. [10]. The
most popular approach to text categorization, in late 1980, was Knowledge Engineering (KE). In KE, set of rules were
defined manually by the human experts to classify text documents into pre-define set of categories. In the decade of
1990, KE lost its popularity and ML approach became popular. In ML approach, the classifier is built automatically
from a set of pre-classified documents using some ML techniques. The advantage of ML approach over KE is:
Accuracy: ML approach accuracy is comparable with KE approach.
Cost: KE is costly than ML, knowledge engineers and domain experts are involved in building the classifier while
ML does not need not much human interaction to build a classifier [10].
These methods mainly try to avoid the need of large amounts of training materials required in supervised methods.
Knowledge-based methods can be classified in function of the type of resources they use: 1) Machine-Readable
Dictionaries; 2) Thesauri; or 3) Computational Lexicons or Lexical Knowledge Bases.
1. Dictionary Based Approach
The dictionary based approaches are based on already built dictionary which holds sentiment polarities words, such
as: WordNet Affect, or Senti WordNet (dictionary). The SentiWordNet [21], is a lexical tool that parses words within
an opinion and search for their match in already maintained dictionary. Based on match words are assigned scoresfor
being positive and negative. The score of all the words for being positive and negative are then accumulated separately
within an opinion. The final decision is made based on the accumulated score; if positive score is greater, the opinion
is regarded as positive and negative otherwise.
These words or set of words (opinions) are passed to bootstrapping process using WordNet [21] which is an algorithm
that uses WordNet for counting the words and assigning the polarities. This technique is easy and efficient and
produces better results. But at the same time, this technique has disadvantage as no efficient method is found in dealing
with context dependent words. For instance, the word “large” or “small”, “cold” and “hot” can point toward a positive
or a negative opinion on based on product context and feature. The approach was used to generate a quick summary
of opinions (i.e. % of positive or negative opinions) related to different features of products [22]. The overall system
is shown as a block diagram in Figure 2. After extracting opinions, part of speech tagging is employed to filter
irrelevant words since adjective words are considered as relevant words. Removing irrelevant words are referred as
feature pruning in this block diagram. In the next phase, the system considers infrequent features for their removal.
Finally, the system works on opinion sentence orientation determination (i.e. positive or negative).Bootstrapping
technique is used for semantic orientation of each opinion using the Word Net [23]. Then the whole sentence's
semantic orientation is found based on leading orientation of the opinion words in the sentence. An improvement on
these methods is proposed on [24], [25] by using more refined and sophisticated tools, like the Sentiment Analyzer
[26]. Another work on dictionary based approach is performed in the context of movie recommendations [27].
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
740 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
Figure-2: Summary system
2. Machine-Readable Dictionaries (MRDs)
Provide a ready-made source of information about word senses and knowledge about the world, which could be very
useful for WSD and NLU. Since the work by Lesk (1986) many researchers have used MRDs as structured source of
lexical knowledge for WSD systems. However, MRDs contain inconsistencies and are created for human use, and not
for machine exploitation. There is a lot of knowledge in a dictionary only really useful when performing a complete
WSD process on the whole definitions [28]. See also the results of the Senseval-3 task devoted to disambiguate
WordNet glosses [29]
WSD techniques using MRDs can be classified according to: the lexical resource used (mono or bilingual MRDs); the
MRD information exploited by the method (words in definitions, semantic codes, etc); and the similarity measure
used to relate words from context and MRD senses.
The Lesk Algorithm
Words are disambiguated by discovering the duo of dictionary senses with the word overlap in their descriptions in
dictionary. For example: while disambiguating the words in ‘pine cone’, the portrait of the compatible senses involves
both of the words ‘evergreen’ and ‘tree’ in one dictionary. Figure 3 gives an example of Lesk Algorithm.
Figure-3: Lesk algorithm
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
741 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
It involves the consideration of across-the-board affiliation and computation of the semantic resemblance of each duo
of word senses which depends on the provided rhetoric knowledge base e.g. Word net. Graph-based methods in the
former days of research have been applied. That resulted in success. More crucial graph-based approaches have been
exhibited to carry out work as good as supervised methods [30] or even performing better than them on particular
given areas. [31], [32].
B. Machine learning approach for WSD
Machine learning domain generally offers a number of algorithms that can learn concepts/patterns from datasets.
These algorithms are often distinguished into three classes: supervised, unsupervised and semi-supervised. In
supervised learning, the mainly requirement is annotated dataset where ambiguous words are labeled. The task can be
seen as an instance of text categorization where an algorithm is trained to classify text document. A number of trainable
supervised learning algorithms are available for this purpose including support vector machine and naïve Bayes. After
training, the algorithms result hypothesis (or a function) which can output the sense of ambiguous words. Graphical
view of machine learning approach is represented in figure 4.
Figure-4: View of learning method to WSD
1. Supervised Learning
The study of algorithm to learn concepts under the supervision is called Supervised Learning (SL) (e.g. by giving
related data set and there correct output result). Consider WSD problem the standard way to provide supervision is an
annotated data set (i.e. collection of data set that are manually tagged by a person according to the word sense). Work
flow of supervised learning is as follow in figure 5, and method of supervised learning show in figure 6.
Figure-5: Work flow of supervised learning
Figure-6: Supervised learning method
Learning
Methods Unsupervised Learning
Semi-Supervised
Supervised Learning
Predicted
Responses
New Data
Model
Model
Know Responses
Known Data
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
742 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
For WSD problem supervised Learning is the best technique. This technique include support vector machine (SVM)
[36], Bayesian Classification [20], and Decision Tree [37] etc. All these techniques are the variants of Supervised
Learning approach that consist of the systems.
2. Unsupervised Learning
In contract to supervised learning, unsupervised learning relies on clustering training data to find groups related to
different senses. The groups can be used to approximate densities or proximity of unknown words (with unknown
senses). Unsupervised learning may also be used to reduce dimensionality of feature space which may be used to
train supervised algorithms. Method of unsupervised learning show in figure 7 below.
Unsupervised method is the proficient query for WSD researchers. The fundamental hypothesis is that same senses
occur in alike contexts due to which senses can be inferred from text by grouping word occurrences with the help of
resemblance of context [38]. This is called Word Sense Induction. Afterwards, new occurrences of the word can be
arranged into the most intimate inferred clusters. It has been proved that Word Sense Induction hoists Web search
result grouping by amplifying the caliber of result clusters and the change of result lists. [39] [40]
Figure-7: Unsupervised learning method
3. Semi-Supervised Learning
One disadvantage of supervised learning approach is preparation of large test collection. With machine learning
approach, it is generally known that more is the data, high is the effectiveness of method. In order to deal with issue
of collecting large size datasets, an alternative approach is adopting semi-supervised learning (SSL). The main idea
of approach is to initially train an algorithm with small sized annotated dataset and then improve this undertrained
method with annotated data. Although the main object of SSL is close to the Supervised Learning, the goal is to learn
a concept with few tagged and many untagged documents. The SSL technique is still not mature. Bootstrapping is one
of the popular techniques used for SSL. General frame work of SSL [41] is given in figure 8.
Figure-8: Frame work of semi-supervised learning
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
743 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
Supervised learning is one extreme learning where all data set is annotated WSD concept. Other extreme learning is
Unsupervised Learning (USL) where complete un-annotated data set are used. The algorithm saves the time of tagging
the data set. Clustering is well known approach of USL. This technique is based on lexical resources (e.g. Word Net),
calculated on lexical patterns and statistics on a large corpus which is not annotated.
C. WSD for Urdu Language
Work is done for using WSD process for Urdu language. Urdu, being spoken by more than a 100 million people, is
an important language. Here is a preliminary footstep to determine the abstruseness of words in Urdu framework. The
method that is executed to determine uncertainty is Bayesian Classification. In the premature exertions of
disambiguating senses of words, dictionaries in the form of hard copy were used. Those pains executed well on trained
data. Unfortunately, enactment of those efforts distorted on large scale. In 1986 an electronic dictionary was used for
WSD and it was valid for large context. In this methodology senses were acknowledged using sense definition in
dictionary. The method was evidenced as a notorious method. A dictionary based approach was engaged by Walker
which chooses the sense of a word by recognizing the semantic class of the word.
III. RESEARCH METHODOLOGY
The purpose of this section is to describe our proposed methodology of Urdu WSD using support vector machine. The
design of prototype system using proposed methodology is also presented. Methodology is shown in the form of block
diagram in figure 9. The main steps include pre-processing (i.e. tokenization, stop-words removal and feature
weighting), feature selection classifier training and evaluation. These steps are briefly described in subsequent
sections.
Figure-9: Research methodology
In Urdu natural language processing, one of the known shortcoming is the non-availability of corpus. Although, some
corpuses are made, they are not readily available and specific to tasks like part-of-speech tagging and name entity
recognition. In order to test and develop WSD methods since no data collection is available, we have collected our
own naive collection. Test collection is a collection of documents to which human indexers have assigned category
names from a predefined set [51]. Most vital resource for text categorization. We have taken 21 Urdu words that have
different sense depending on context in which they are used. The list of words is shown in table 1.
Data collection
Feature selection
Evaluation
Pre-Processing
Classifier Training
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
744 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
Table-1: Words and corresponding senses
For each word, we have collected text documents from the different online resources for example (voice of America,
BBC Urdu, and Express etc). Distribution of documents per word is shown in Table 2. To annotate documents with
predefined word senses, we have employed the services of three Urdu teachers from Department of Urdu, University
of Sargodha. The inter-annotator agreement between these annotators is estimated 0.99.
Table-2: Document distribution over words
Words Documents Words Documents
ا جان ل Bajana 73 ب Kal 50 ک
نا ھاگ مر Bhagna 47 ب Kamar 50 ک
س سم Bus 50 ب Qasam 50 ق
ا ا Dana 51 دان الن ھ Kalna 50 ک
ز ون Daraz 45 درا Kon 50 ک
ا وٹ Deya 43 دی Lot 50 ل
لک Jaha 49 جہاں Mlk 62 م
ر وری Par 47 پ puri 50 پ
سر sar 50 ا سون sona 53
ار ال tar 45 ت balay 31 ب
سو soo 48
As discussed in text categorization process, few pre-processing steps are required to represent the textual data in proper
format for applying machine learning algorithms. The steps include tokenization, stop words removal and feature
weighting etc. To tokenize documents, number of possibilities is available including phrase-level tokenization,
sentence-level tokenization and word-level tokenization. However, word-level tokenization is empirically found to be
Words Senses Words Senses
ا جان Bajana 1. Knock ب
2. Clock-strike
3. Play
ل Kal 1. Tomorrow ک
2. Total
نا ھاگ Bhagna 1. Avoid ب
2. Elope
3. Escape
4. Run
مر Kamar 1. Back ک
2. Name
س Bus 1. Bus ب
2. Enough سم Qasam 1. Type ق
2. swear
ا Dana 1. Grain دان
2. Intelligent ا الن ھ Kalna 1. To Eat ک
2. To Play
ز Daraz 1. Draw درا
2. Height ون Kon 1. Ice-cream ک
2. Shape
3. who
ا Deya 1. Give دی
2. Oil Lamp وٹ Lot 1. Come back ل
2. Loot
Jaha 1. Where جہاں
2. World لک Mlk 1. Cast م
2. Country
3. Milk
ر Par 1. Upon پ
2. Wings وری puri 1. Dish پ
2. Whole world
سر sar 1. Head
2. Sir ا سون sona 1. Sleep
2. Gold
ار tar 1. Letter ت
2. Wire ال balay 1. Call him ب
2. Ghosts
سو soo 1. Sleep
2. 100 rupees
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
745 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
more effective for being statistically sounder [10]. That’s why; we have adopted word level tokenization method. One
key issue of text categorization is known as curse of dimensionality which not only degrades the effectiveness of
classifiers but also make some classifiers unscaleable [52] [53]. The key idea to deal with this issue is to remove size
of set of features. Because not all the features are good for the classification as there are several useless words like
auxiliary verbs, conjunctions, articles and function words (that exhibits frequencies over all classes) that can better be
removed. These words are collectively known as stop words. Despite removing stop words, other sophisticated
methods include feature selection and feature extraction [53]. While feature selection methods select a subset of
features from set of features, feature extraction methods reduce dimensionality of feature space by projecting them
onto reduced dimensional space. Finally, we have adopted standard TFIDF (term frequency inverse document
frequency) measure for assigning weights to features. TFIDF measure. There are two key requirements behind this
selection: firstly the selection should be performed automatically and secondly the selection should contain minimal
set of informative words that improves the effectiveness of classifiers [54]. To meet this objective, several feature
selection techniques have been proposed in the literature. The most widely used techniques are information gain, gain
ratio, mutual gain and chi-square [54]. Various studies have been conducted to perform comparison between feature
selection methods, e.g. [54]. As the result of these empirical studies, information gain has highly among others. By
following these recommendations, we have also employed information gain in our work. The method is briefly
described in following subsection.
Information Gain (IG)
Information Gain is commonly employed as a term goodness criterion in ML. It is a quantitative measure and is often
used in text categorization to find the worthiness of a feature. Worthiness of a feature means how well the single
feature performs classification according to target classification of a document [55] IG is often defined with the help
of entropy. Entropy is a measure commonly used in the domain of information theory and can be understood as a
measure of (im) purity of a collection of dataset. For example, in a binary classification problem S is dataset and it
contains some positive and negative examples (e.g. either a document is a spam or not) then the entropy H of S is
measured as:
In the above equation p ⊕denotes ratio of positive examples and p⊖ is the ratio of negative examples in the dataset
S. If the dataset is partitioned according to the feature then the IG of a feature is then measured as expected reduction
in entropy. Below equation 2, measures the IG of a feature t relative to a dataset S and it can be denoted as IG(S,t).
Where val(t) represents the set of all values of a feature t and Sv is the subset of S in which feature t has v value.
A. Classification Method
We have posed WSD as text classification task where input to the classifier is a textual document in tokenized form
and output is the sense of each token (i.e. word). The overall idea is illustrated in the form of block diagram as shown
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
746 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
in Figure 3.2 Machine learning offers a variety of algorithms that can be utilized for this purpose. The algorithms
include naïve Bays and support vector machines (SVM). A brief introduction of these algorithms can be found in [56].
Several empirical studies have been performed to comparatively evaluate the effectiveness of the algorithms. As the
result of the studies, it is found that SVM has advantage over other classifiers in terms of effectiveness. There are
several reasons behind the success of SVM that are described below after the introduction of naïve Bays and SVM.
Figure-10: Classification model
1. Naïve Bayes Classifier
Naïve Bays categorized in two approaches groups in text categorization: Naïve and non Naïve. Naïve approach former
is the assumption of feature independence and non-Naïve considers the dependence of features. Independence means
the word order is irrelevant (i.e. the presence of a word does not affect the presence or absence of another one. In naïve
Bays algorithm, text categorization is dealt as estimating posterior probabilities of categories given documents. In NB
classifier, text categorization is viewed as estimating posterior probabilities of categories given documents- i.e.
𝑃(𝑐𝑖|𝑑𝑗); the probability that 𝑗𝑡ℎdocument (as re-presented with a weight vector 𝑑𝑗 =< 𝑞1𝑗, 𝑞2𝑗 , … 𝑞|𝑇|𝑗 > where
𝑞𝑘𝑗 is the weight of 𝑘𝑡ℎ feature in 𝑗𝑡ℎ document) belongs to class 𝑐𝑖. These posterior probabilities are estimated by
using Bayes theorem as:
𝑃(𝑐𝑖|𝑑𝑗) =𝑃(𝑑𝑗|𝑐𝑖)𝑃(𝑐𝑖)
𝑃(𝑑𝑗) (3)
Where (ci) is the prior probability that represents the probability of selecting a arbitrary (random) document that is
part of classci, (di) is the probability (randomly) that a arbitrarily chosen document has weight vector 𝑑j and P(dj|ci)is
the probability that the document djbelongs to class ci. However, the guess of (ci|dj) is intractable due to the problem
of estimating (dj|ci)involving very high dimensional vector dj. To make computation tractable(dj|ci), it is assumed
that document vector coordinates are conditionally independent of each other. By following the assumption, the term
(dj|ci) can be estimated as:
P (⃗𝑑𝑗|𝑐𝑖) = ∏ 𝑃((𝑊𝑘𝑗|𝑐𝑖)𝑇𝑘=1 (4)
2. Support Vector Machine (SVM)
SVM is based on the idea well known as structural risk minimization principle [36] where objective is to find a
hypothesis that can guarantee lowest true error (error of a hypothesis for classifying an unseen instance drawn from
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
747 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
the same distribution as training data). The direct estimation of true error is not possible unless learner knows true
target concept. However, according to structural risk minimization principle, the true error can be bounded using
training error and complexity of hypothesis space. The complexity of hypothesis is represented by using well known
Vapnik-Chervonenkic dimension or VC dimension) [51]. The task underlying SVM is to minimize true error of
resultant hypothesis by efficiently controlling the VC dimension of hypothesis space [37].
In figure 11, a geometrical illustration of SVM for a binary classification problem is shown. Though the figure shows
two-dimension space with linearly separable data points, it can be generalized into high dimensional space with not
linearly separable problems. It can be seen from the figure that there can be many hyper planes separating the examples
into two classes. However, SVM algorithm found the decision boundary (i.e. boundaries with maximum distance or
margin) that best separate positive examples from negative examples in the n-dimensional space.
Figure-11: A prototypical problem for learning support vector machine
The problem of finding maximum margin (i.e. boundary) is mathematically outlined as:
minimizew,b<w.w>
subject to yi ( < w.xi > + b ) 1 i = 1,…l
Where l denotes the no. of training examples, xi is the input vector, yi is the desired output. Due to computational
convenience, the problem is reformulated as:
𝐿(𝑤, 𝑏, 𝛼) =1
2< 𝑤. 𝑤 >-∑ 𝛼𝑖[𝑦𝑖(< 𝑤𝑖 . 𝑥𝑖 > +𝑏) − 1]𝐼
𝑖=1 (5)
Where αi ≥ 0 are langrage multipliers. The langrage formulation of Equation.1 is often termed as primal formulation.
By differentiating Equation.12 with respect to w and b and substituting their values in Equation.12, we can formulate
the problem in so called dual form:
𝐿(𝑤, 𝑏, 𝛼) = ∑ 𝛼𝑖𝐼𝑖=1 −
1
2∑ 𝑦𝑖𝑦𝑗𝛼𝑖𝛼𝑗 < 𝑥𝑖𝑥𝑗 >𝐼
𝑖,𝑗=1 (6)
Some sophisticated mathematical transformations are also applied (known as kernel trick) to data prior to learning
decision boundary when instances are not linearly separable. The objective of transformation is to transform data
instances to higher dimensions features space t. The instance xi in new feature space is referred as (xi). An essential
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
748 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
computational operation of the dual formulation is to compute the dot product between data instances (notice from
Equation 13) that can be computed in higher dimensional space as(𝑥𝑖)(𝑥𝑗).
Despite using this computationally inefficient approach, kernel functions (xi) provide us a convenient way for this
computation
(𝑥𝑖). (𝑥𝑗) =𝐾𝑥𝑖,𝑥𝑗 (7)
Depending on the choice of kernel function, Support Vector Machine has three popular variants: Support Vector
Machine with linear or no kernel, polynomial kernel and radial basis kernel. In order to train SVMs, LIBSVM [56] is
considered as standard toolkit. The toolkit can be utilized with WEKA (a standard machine learning toolkit) using a
wrapper function. Describe some advantages of SVM over other classifiers [57] study of five text categorization
methods: Neural Networks, K-Nearest Neighbours, Linear Least Square Fit, Support Vector Machine, and Naïve
Bayes shows that Support Vector Machine is a good classifier .Their focus was on the robustness of these methods
over skewed categories distribution. They founded that Support Vector Machine and K-Nearest Neighbour performs
comparably when the no. of positive training examples per category are small (less than 10).
B. Evaluation Method and Metrics
During text categorization process, the assessment of document classifier is experimental rather than analytical
approach because experimental evaluation measures the effectiveness of classifier which is mainly the aptitude in
respect of taking right decisions. To evaluate the classifier, there are three measures usually used to measure the
success of the classifier.
Training efficiency: Training time is the period consumed in developing a classifier from a dataset/corpus.
Classification efficiency: Classification is the average time used in classification of a document using classifier
Effectiveness: Effectiveness shows tendency of classifier to make correct decisions.
Among these measures, effectiveness is considered most important and used widely [63]. Evaluation of classifier is
important, for this it is applied on a test corpus to measure the effectiveness of classifier over previously unseen test
documents. Test corpus consisted of a set of documents not used in training/learning phase. Different methods are
used to measure the efficiency of classifier, but mostly used methods are precision, recall, accuracy and f-measure
[64]. Precision can be considered as the conditional probability for which a random document 𝑑 is classified under
category𝑐𝑖, or what would be deemed the correct category. It shows the classifiers capability to put a document under
the correct category as opposite to all documents placed in that category, both correct and incorrect: [68]
𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = 𝑇𝑃𝑖
𝑇𝑃𝑖 + 𝐹𝑃𝑖
Where TP (true positive, which means classified correctly under a category) is taken from the sum of true positive and
FP (False positive, which means document categorized as related to category incorrectly).
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
749 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
Recall is probability that whether a document dx should be taken into a category ci or not. This decision is taken as:
𝑅𝑒𝑐𝑎𝑙𝑙 =𝑇𝑃𝑖
𝑇𝑃𝑖+𝐹𝑁𝑖
Precision and recall are also used together to achieve better performance of the classifier. Formula is given below
after combining both:
𝐹𝛽 = (𝛽2 + 1)(𝑃)(𝑅)
𝛽2.(𝑃)+𝑅
Where, P represents precision and R denotes recall respectively. β in this equation is a positive parameter that is used
to show target of the evaluation task. In this formula if we consider precision more important than recall then 𝛽 reaches
to zero and if recall is considered as more important than precision then 𝛽 reaches to infinity. Normally, 𝛽 is set as 1
to givenequal importance to both precision and recall. Categorization and can be defined as: A with respect 𝑐𝑖 is the
percentage of correct classification decision. In binary text categorization accuracy is not adequate measure because
in binary text categorization applications the two categories 𝑐𝑖 and 𝑐𝑖 are unbalanced (i.e. one contain more examples
than other) [63].
Recall (R) = (categories found and correct / total categories correct)*100
Both (Precision and recall) are the fundamental measures used in evaluating search strategies. In database, there will
be a set of records which is related to the search topic records are assumed to be either related or unrelated (these
measures do not allow for degrees of relevancy). The actual retrieval set may not perfectly match the set of relevant
records. In order to get good performance of classifier precision and recall are often used together [84]. F-Measure
can be defined as harmonic mean of precision and recall.
F-measure = 2 𝑃𝑅
𝑃+𝑅
The results of F-measure are depended upon the results of precision and recall. If precision and recall are high then F-
measure is also high. The value of F-measure remains in the interval of [0, 1]. In above formula F is standard F-
measure, where it evaluates the category-wise prediction abilities of the classifier [61]. Here by using F-measure result
(e.g. one or more numerical value) provide a measured distance b/w the human classification and the target automatic
classifier.
IV. EXPERIMENTATION AND RESULTS
The results are presented in this section in terms of effectiveness of classifiers for WSD. The flow of experimentation
is shown below in figure 12 using a block diagram. As described above, two algorithms (SVM and Naive Bays) are
analyzed for classification tasks. In order to characterize the performance of classifiers, standard 5 cross fold validation
test is used. Cross validation is a technique for model validation and it evaluates the generalization of independent
data set over statistical results provided by model. The diagram shows the flow of experiments:
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
750 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
Figure-12: Experimentation and results flow
As described above, feature selection is used to select most informative features. Feature selection not only makes
classifiers computationally efficient but improves effectiveness of classifiers [54]. We have used information gain as
feature selection method according to the recommendation of [67]. Results of the feature selection process over SVM
are shown in figure 13.
Figure-13: No of features vs f-measure of SVM
Results of the feature selection process over SVM are shown in figure 14. It is clearly seen that on the same data set
the SVM have more stable than the Naïve base.
0
0.2
0.4
0.6
0.8
1
1.2
F -
me
asu
re
Number of features
Tar
SONA
SAR
PURI
DARAZ
MULK
KHALANA
Load tagged UNER dataset
Apply classifiers
Comparative Analysis
SVM
Precision
Recall
F-measure
Results of SVM
WSD
Dataset
Evaluation Tool
(WEKA)
Recall
NB
Precision
F-measure
Results of NB
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
751 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
Figure-14: No of features vs f-measure of NB
The tasks are performed in WEKA which is a popular machine learning toolkit and have not only built-in functionality
for pre-processing tasks but also classifier training algorithms [68]. Based on proposed methodology, a prototype
application is developed in java as shown in figure 15. Currently, the application performs two main tasks; model the
classifier from the training data, and predict the senses of the word from the given document. The Prediction process
operates on a single text document. On loading the document into the application, the system performs two tasks;
detect the presence of ambiguous words (i.e. the words that have more than one sense) and predict the sense of the
ambiguous word. The first task is performed by maintaining a list of ambiguous words while the second task is
achieved by maintaining a collection of models. In both collections words and their corresponding models have same
indexes.
Figure-15: Prototype system
V. CONCLUSION AND FUTURE WORK
In this research paper we presented a Word Sense Disambiguation in Urdu text document by using machine learning.
There are various issues related to Urdu language processing, including lack of standard Urdu corpus and
incompatibility issues of NLP tools for Urdu language which has been discussed earlier. That’s why in this research
work WSD for Urdu text documents has been implemented using SVM. We presented analysis result of two classifiers
support vector machine and Naïve Bayes, on the basis of their result we have choose the best features and developed
the prototype system using the standard SVM library for URDU text documents.
To validate results, we have used k-folded cross validation where training dataset is divided into k mutually exclusive
subsets; interactively one of the k subset is used for testing the classifier and other k-1 subset used for training the
classifier [69]. The 5-folded cross validation test is conducted to validate the classifiers. Significant results have been
0
0.2
0.4
0.6
0.8
1
1.2
F -m
easu
re
Number of features
Tar
SONA
SAR
PURI
DARAZ
MULK
KHALANA
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
752 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
produced on developed corpus. It is shown through results that our approach is very effective to perform Urdu WSD.
We believed that the proposed system will help in information retrieval system and our test collection dataset will also
help the other researcher to work in the field of WSD for URDU language.
In future we want to increase the test collection of more ambiguous words; we also want to enhance our prototype
system from supervised learning to unsupervised learning. By doing this we can make an online directory which will
help in the natural language processing. And also other research can easily find the crop for URDU language process
which is not available today. Finally, comparative analysis between various machines learning algorithm be performed
to analyze the impact of classifier over WSD.
REFERENCES
[1] R. Navigli, “Word sense disambiguation: A survey”, ACM Computing Surveys, Vol. 41 No. 2, pp. 1-69, 2009.
[2] E. Palta, Word Sense Disambiguation. Mumbai: Kanwal Rekhi School of Information Technology, 2007.
[3] M. Sanderson, “Word Sense Disambiguation and Information Retrieval”, In Proc. of 17th annual international
ACM SIGIR conference on Research and development in information retrieval, pp. 142-151, New York, USA, 1994.
[4] J. E. Ellman, I. Klincke, and J. I. Tait, “Word Sense Disambiguation by Information Filtering and Extraction”,
Computers and the Humanities, Vol. 34, pp. 127-134, 2000.
[5] M. Carpuat and D. Wu, “Improving Statistical Machine Translation using Word Sense Disambiguation”, In Proc.
of 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural
Language Learning, pp. 61-72, Prague, 2007.
[6] W. Weaver, “Machine Translation of Languages”, Cambridge, Massachusetts: MIT Press. pp. 15–23, 1955.
[7] G. Pilania, C. C. Wang, X. Jiang, S. Rajasekaran, and R. Ramprasad, “Accelerating materials property predictions
using machine learning” Scientific Reports, Vol. 3, 2013.
[8] L. Guo, “Applying Data Mining Techniques in Property/Casualty Insurance”, in CAS 2003 Winter Forum, Data
Management, Quality, and Technology Call Papers and Ratemaking Discussion Papers, CAS. 2003.
[9] F. Sebastiani, “Machine Learning in Automated Text Categorization”, ACM Computing Surveys, Vol. 34, No. 1,
pp. 1-47, 2002.
[10] V. Gupta and G. S. Lehal, “A Survey of Text Mining Techniques and Applications”, Journal of Emerging
Technologies in Web Intelligence, Vol. 1, No. 1, pp. 60-76, 2009.
[11] T. Joachims, “A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization”, In Proc.
of 14th International Conference on Machine Learning, pp. 143-151, 1997.
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
753 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
[12] D. Blei, “Probabilistic Topic Models”, Communications of the ACM, Vol. 54, No. 4, pp. 77-84, 2012.
[13] M. Ikonomakis and S. Kotsiantis, “Text Classification Using Machine Learning Techniques”, WSEAS
Transactions on Computers, Vol. 4, No. 8, pp. 966-974, 2005.
[14] S. Niharika and V. S. Latha, “A Survey on Text Categorization”, International Journal of Computer Trends and
Technology, Vol. 3, No. 1, pp. 39-45, 2012.
[15] A. Naseer and H, Sharmad, “Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification”,
In Proc. of Conference on Language & Technology (CLT10), pp.1-5, Pakistan, 2010.
[16] M. Hu and B. Liu, “Mining and summarizing customer reviews”, In Proc. of 10th ACM SIGKDD international
conference on Knowledge discovery and data mining, pp. 168-177, 2004.
[17] A. Fahrni, and M. Klenner, “Old wine or warm beer: Target-specific sentiment analysis of adjectives”, In Proc.
of Symposium on Affective Language in Human and Machine, pp. 60-63, 2008.
[18] S. M. Kim, and E. Hovy, “Determining the sentiment of opinions”, In Proc. of 20th international conference on
Computational Linguistics, Association for Computational Linguistics, pp. 13-67, 2004.
[19] A. M. Popescu and O. Etzioni, “Extracting product features and opinions from reviews”, Natural language
processing and text mining, pp. 9-28, 2007.
[20] J. Yi, T. Nasukawa, R. Bunescu, and W. Niblack, “Sentiment analyzer: Extracting sentiments about a given topic
using natural language processing techniques”, In Proc. of 3rd IEEE International Conference on Data Mining, pp.
427-434, 2003.
[21] T. T. Thet, J. C. Na, C. S. G. Khoo, and S. Shakthikumar, “Sentiment analysis of movie reviews on discussion
boards using a linguistic approach”, In Proc. of 1st International CIKM workshop on Topic-sentiment analysis for
mass opinion, pp. 81-84, 2009.
[22] M. Castillo, F. Real, J. Atserias, and G. Rigau, “The TALP Systems for Disambiguating”, In Proc. of International
Workshop on the Evaluation of Systems for the Semantic Analysis of Text, pp. 93-96, 2004.
[23] R. Navigli, and P. Velardi, “Structural semantic interconnections: a knowledge-based approach to word sense
disambiguation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 27, No.7, pp. 1075-1086,
2005.
[24] Navigli, Roberto, and Simone Paolo Ponzetto, “SemEval-2007 task 07: coarse-grained English all-words task”,
In Proc. of 4th International Workshop on Semantic Evaluations, pp. 30-35, Stroudsburg, PA, USA, 2007.
[25] Agirre, E., Lopez de Lacalle, A., Soroa, A., “Knowledge-based WSD on Specific Domains Performing better
than Generic Supervised WSD”, In Proc. of 21st international joint conference on Artificial intelligence, pp. 1501-
1506, San Francisco, CA, USA, 2009.
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
754 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
[26] Navigli, Roberto, and Mirella Lapata., “An experimental study of graph connectivity for unsupervised word sense
disambiguation”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 32 No.4, pp. 678-692, 2010.
[27] Ponzetto, Simone Paolo, and Roberto Navigli, "Knowledge-rich word sense disambiguation rivaling supervised
systems," In Proc. of 48th annual meeting of the association for computational linguistics. Association for
Computational Linguistics, 2010.
[28] Burges C., J. C., "A Tutorial on Support Vector Machines for Pattern Recognition," Data Mining and Knowledge
Discovery, pp. 121-167, 1998.
[29] H. Schütze, "Automatic word sense discrimination," Computational linguistics, vol. 24.1, pp. 97-123, 1998.
[30] Roberto Navigli, "A quick tour of word sense disambiguation, induction and related approaches," Theory and
practice of computer science, pp. 115-129, 2012.
[31] Di Marco, Antonio, and Roberto Navigli, "Clustering and diversifying web search results with graph-based word
sense induction," Computational Linguistics, Vol. 39 No.3, pp. 709-754, 2013.
[32] Thanh Phong Pham and Wee Sun Lee and Hwee Tou Ng, "Word Sense Disambiguation with Semi-Supervised
Learning," In Proc. of National Conference on Artificial Intelligence, London, 2005.
[33] Resnik, Philip, and David Yarowsky, "A perspective on word sense disambiguation methods and their
evaluation," in Proceedings of the ACL SIGLEX workshop on tagging text with lexical semantics: Why, what, and
how, Washington, 1997.
[34] Gliozzo, Alfio Massimiliano, Bernardo Magnini, and Carlo Strapparava, "Unsupervised Domain Relevance
Estimation for Word Sense Disambiguation," EMNLP, 2004.
[35] Siva Reddy, "Word Sense Disambiguation Using Semantic Categories, Domain Information and Knowledge
Sources. Diss.," In International Institute of Information Technology, Hyderabad, 2010.
[36] McCarthy, Diana, et al., "Unsupervised acquisition of predominant word senses," Computational Linguistics, vol.
33.4, pp. 553-590, 2007.
[37] Mohammad, Saif, and Graeme Hirst, "Distributional measures of concept-distance: A task-oriented evaluation,"
in Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing. Association for
Computational Linguistics, 2006.
[38] Roberto Navigli, "A quick tour of word sense disambiguation, induction and related approaches," Theory and
practice of computer science. Springer Berlin Heidelberg, pp. 115-129, 2012.
[39] Navigli, Roberto, and Simone Paolo Ponzetto, "BabelNet: The automatic construction, evaluation and application
of a wide-coverage multilingual semantic network," Artificial Intelligence 193, pp. 217-250, 2012.
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
755 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
[40] Martinez, David, Oier Lopez De Lacalle, and Eneko Agirre., "On the Use of Automatically Acquired Examples
for All-Nouns Word Sense Disambiguation," J. Artif. Intell. Res.(JAIR) 33, pp. 79-107, 2008.
[41] Asma Naseer, "Supervised Word Sense Disambiguation for Urdu Using Bayesian Classification," in Center for
Research in Urdu Language Processing, Lahore, Pakistan.
[42] Tom Mitchel, "Machine Learning," McGraw - Hill, 1997.
[43] G. Forman, "An Experimental Study of Feature Selection Metrics for Text," Journal of Machine Learning
Research, pp. 1289-1305, 2003.
[44] A. Dasgupta and Michael W. Mahoney, "Feature selection methods for text classification," in in Proceedings of
the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, New York, NY, USA,
2007, pp. pp. 230-239.
[45] Monica Rogati and Yiming yang, "High-Performing feature selection for text classification," in CIKM, Virginia,
USA, 2002.
[46] Yimming Yang, Fan Li, David D. Lewis,Tony G. Rose, "RCV1:A New Benchmark collection for text
categorization Research," journal of machine learning, vol. 5, pp. 361-397, 2004.
[47] Chih-Chung Chang and Chih-Jen Lin, "LIBSVM: A Library for Support Vector Machines," in CM Transactions
on Intelligent Systems and Technology, 2003, p. 2(3).
[48] Y. Yang and X. Liu, "A re-examination of text categorization methods," in SIGIR-99, 22nd ACM International
Conference on Research, Berkeley, 1999, pp. 42–49.
[49] T. Joachims, "Text Catagorization with Support Vector Machines: Learning," in Tenth European Conference on
Machine, 1998, pp. 137-142.
[50] D. W. Aha, D. Kibler, "Instance-based Learning Algorithms. Machine Learning," Kluwer Academic Publishers,
Boston. Manufactured in The Netherlands, pp. 37-66, 1991.
[51] E. Gabrilovich and S. Markovitch, "Text categorization with many redundant features: using aggressive feature
selection to make SVMs competitive with C4.5," In Proceedings of the ICML, the 21st International Conference on
Machine Learning, pp. 321-328, 2004.
[52] Milne, David, Olena Medelyan, and Ian H.Witten, "Mining Domain-specific thesauri from Wikipedia: A case
study," in Proceedings of the 2006 IEEE/WIC/ACM international conference on web intelligence. IEEE Computer
Society, 2006.
[53] Timothy. Weal, "Utilizing Wikipedia categories for document Classification," in Evaluation, 2006.
[54] Remko, Eibe Frank, and Mark A Hall, "WEKA—Experiences with a Java Open-Source Project," Journal of
Machine Learning Rearch 11, pp. 2533-2541, 2010.
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
756 https://sites.google.com/site/ijcsis/ ISSN 1947-5500
[55] M. Hall and I. H. Witten, "The WEKA data mining software: an update," SIGKDD Explorations, pp. 10-18,
2009.
[56] Mark Hall and Ian H witten, "The WEKA Data Mining Software: An Update," SIGKDD Explorations, vol. 1,
no. 1, 2005.
[57] "Data mining – Practical Machine Learning tools (2nd Ed.)," in Morgan Kaufmann Publisher, An imprint of
Elsevier, San Francisco, CA, 2005.
[58] Jan O. Pedersen and Yiming Yang, "A comparative Study on feature selection in text categorization," in in ICML
97 proceeding of the fourteenth international conference on machine learning, San Francisco, CA, USA, 1997, pp.
412-420.
[59] A. Rehman, M. Mustafa, I. Israr, and M. Yaqoob, “Survey of wearable sensors with comparative study of noise
reduction ECG filters”, International Journal of Computing and Network Technology, Vol. 1, No. 1, pp. 61-81, 2013.
[60] D. Anguita and L. Ghelardoni, "The ‘K’ in K-fold cross validation," in European Symposium on Artificial Neural
Networks, Computational Intelligence and Machine Learning, Bruges, Belgium, 2012.
[61] Dubey, Ajay, and Vasudeva Varma, "Generation of Bilingual Dictionaries using Structural Properties," 2013.
[62] K. Kamran, S. Afzal, M. Yaqoob, M. Sharif, “A Comparative Survey on Vehicular Ad-hoc Network (VANET)
Routing Protocol using Heuristic and Optimistic Techniques”, Research Journal of Information Technology, Vol. 6
No. 2, pp. 14-24, 2015.
International Journal of Computer Science and Information Security (IJCSIS), Vol. 14, No. 5, May 2016
757 https://sites.google.com/site/ijcsis/ ISSN 1947-5500