17
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=ccos20 Download by: [60.50.119.45] Date: 17 February 2016, At: 07:22 Connection Science ISSN: 0954-0091 (Print) 1360-0494 (Online) Journal homepage: http://www.tandfonline.com/loi/ccos20 Word sense disambiguation in evolutionary manner Saad Adnan Abed, Sabrina Tiun & Nazlia Omar To cite this article: Saad Adnan Abed, Sabrina Tiun & Nazlia Omar (2016): Word sense disambiguation in evolutionary manner, Connection Science To link to this article: http://dx.doi.org/10.1080/09540091.2016.1141874 Published online: 17 Feb 2016. Submit your article to this journal View related articles View Crossmark data

Connection Science Word sense disambiguation in evolutionary manner

  • Upload
    ukm

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=ccos20

Download by: [60.50.119.45] Date: 17 February 2016, At: 07:22

Connection Science

ISSN: 0954-0091 (Print) 1360-0494 (Online) Journal homepage: http://www.tandfonline.com/loi/ccos20

Word sense disambiguation in evolutionarymanner

Saad Adnan Abed, Sabrina Tiun & Nazlia Omar

To cite this article: Saad Adnan Abed, Sabrina Tiun & Nazlia Omar (2016): Word sensedisambiguation in evolutionary manner, Connection Science

To link to this article: http://dx.doi.org/10.1080/09540091.2016.1141874

Published online: 17 Feb 2016.

Submit your article to this journal

View related articles

View Crossmark data

CONNECTION SCIENCE, 2016http://dx.doi.org/10.1080/09540091.2016.1141874

Word sense disambiguation in evolutionary manner

Saad Adnan Abed, Sabrina Tiun and Nazlia Omar

Fakulti Teknologi dan Sains Maklumat, Universiti Kebangsaan Malaysia, UKM Bangi, Selangor Darul Ehsan,Malaysia

ABSTRACTThe task of assigning proper meaning to an ambiguous word in aparticular context is termed word sense disambiguation (WSD). Wepropose a genetic algorithm, improved by local search techniques,to maximise the overall semantic similarity or relatedness of a giventext. Local search is used because of the inefficiency of population-based algorithms (e.g. genetic algorithm) in exploiting the searchspace. Firstly, the proposed method assigns all potential senses foreach word using a WordNet sense inventory. Then, the improvedgenetic algorithm is applied to determine a coherent set of sensesthat carries maximum similarity or relatedness score based on infor-mation content and gloss overlap methods, namely extended Leskalgorithm and Jiang and Conrath (jcn). The obtained results out-performed other unsupervised methods, which are related to theproposed method, when tested on the same benchmark dataset. Itcan be concluded that the proposed method is an effective solutionfor unsupervised WSD.

ARTICLE HISTORYReceived 28 April 2015Accepted 11 January 2016

KEYWORDSWord sense disambiguation;genetic algorithm; localsearch; WordNet; semanticsimilarity; semanticrelatedness

1. Introduction

Ambiguity is a general property of English text due to the multi-sense characteristics ofmost English words. Therefore, an application that deals with English text such as informa-tion retrieval (IR) or machine translation (MT) has to be able to identify the proper meaningof each word in its context before it can achieve the final task of the application. If not, thefinal task of such applications will yield inaccurate results. The task of identifying themean-ing of a word in its context is known as word sense disambiguation (WSD). This task wasinitially formulated as a distinct computational mission since the beginning of MT in the1940s. Weaver (1955) first introduced the problem in a computational context in his well-knownmemorandumon translation. However, a WSD taskmakes use of the information oftheword being disambiguated alongwith the context of thewords. This information is col-lected from either an annotated corpus or dictionary. The annotated corpus is an intrusivehand tagging resource, which is quite difficult to collect. On the other hand, the methodsthat based on annotated corpus could yield an accurate results but for limited number ofwords. Recently, Yu, Li, Hong, Li, andMei (2015) proposed an approach of rule extraction by

CONTACT Saad Adnan Abed [email protected]

© 2016 Taylor & Francis

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

2 S. A. ABED ET AL.

feature attributes for English preposition on. Thismethod exploited 600 samples to achievethe disambiguation task for one preposition.

Alternatively, the dictionary is a flexible resource that can be used for numerouswords. Adecisive approach to resolving word sense using the Oxford Advanced Learners Dictionaryof Current English (OALD) was proposed by Lesk (1986). Consequently, the task of dis-ambiguating word sense was still knowledge-based or dictionary-based. Yarowsky (1992)proposed combining co-occurrence data from large corpora with information in Roget’sthesaurus for learning disambiguation rules of Roget classes. This methodmade use of thelarge scale of the thesaurus based on co-occurrence data.

Due to the difficulty of collecting adequate annotated corpora (supervised methods)and the dearth of dictionary-based methods, using an unannotated corpus seemed to bethe most promising solution for a WSD task. Unsupervised approaches are able to avoidknowledgeacquisitionbottlenecks (Gale, Church, &Yarowsky, 1992). These approaches relyon the notion that the same senses of a word tend to have similar neighbouring words.Here, word senses are prompted through input of text via clustering of word occurrences.After that, the newoccurrences are classified into the prompted clusters. These approachesdo not rely on a labelled dataset and in their purest form do not take advantage of anymachine-readable resources such as dictionaries, thesauri or ontologies. UnsupervisedWSD carries out word sense discrimination, i.e. it distributes the number of occurrencesof a word into a number of classes by identifying any two occurrences check if they belongto the same sense (Schütze, 1998). Admittedly, the aim of unsupervised WSD approachesis different as compared to knowledge-based and supervisedmethods, which detect senseclusters instead of allocating sense labels. Schütze (1998) proposed context-group discrim-ination, which is an algorithm that clusters the occurrences of an ambiguous word basedon the contextual similarity among the occurrences into clusters of senses. In the contextof clustering approaches, Lin (1998) proposed a well-known approach to word clusteringwhich consists of the identificationofwordsW = (w1, . . . ,wk)which is similar (possibly syn-onymous) to targetwordw0. At that point, Pantel andLin (2002) alsoproposedanalgorithmcalled clustering by committee for word clustering. An effective method that incorporatedweb-scale phrase clustering results for context matching was presented by Ji (2010) whichoutperformed the accuracy of a typical supervised classifier.

A distinct approach to word sense discrimination is known as co-occurrence graphs,which has achieved a certain amount of success in recent explorations. An example of thisapproach is a graph-basedalgorithm for sequencedata labellingbymeansof randomwalksongraphsencoding label dependencies (Sinha&Mihalcea, 2007). Likewise,Navigli andLap-ata (2010) explored theuseof co-occurrencegraphs for large-scaleWSD. Theyendeavouredto specify the most important node from the set of graph nodes, which represented theproper sense for the given word. The reader may refer to Navigli (2009) survey to acquiremore details about word clustering and co-occurrence graph methods as well as all otherWSD approaches.

The most similar approaches to co-occurrence graph methods were those thatattempted to formulate WSD as a combinatorial problem. In such methods, each pairof senses of whole words in the context was ranked according to a cost function. Thesequence of senses that carried the best cost value was nominated as a final solution. Thecost function was the indicator of how much the senses are coherent in semantic similar-ity or relatedness point of view. The cost function was chosen based on the assumption

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

CONNECTION SCIENCE 3

that the overall semantic relatedness of contextual word senses is maximised. Pedersen,Banerjee, and Patwardhan (2005) explored various methods of cost functions to max-imise the overall semantic relatedness so that WSD could be performed via brute force.Different strategies for maximising the overall semantic relatedness using meta-heuristicalgorithms started with Cowie, Guthrie, and Guthrie (1992) who applied Simulated Anneal-ing for lexical disambiguation. Likewise, Genetic Algorithm (Gelbukh, Sidorov, & Han, 2003;Hausman, 2011; Zhang, Zhou, & Martin, 2008), Ant Colony (Nguyen & Ock, 2013; Schwab,Goulian, Tchechmedjiev, & Blanchon, 2011), Harmony Search Algorithm (Abed, Tiun, &Omar, 2015) and Artificial Bee Colony (Abualhaija & Zimmermann, 2014) were also appliedto maximise the semantic relatedness of WSD applications. In this paper, we propose apopulation-based meta-heuristic algorithm called genetic algorithm. This algorithm wasimprovedwith local search technique forWSD. Our proposedmethod extends themethodused by Zhang et al. (2008) who proposed a genetic word sense disambiguation algorithmfor nouns whose method in turn was based on Wu and Palmer (1994) to obtain a similarityscore for nouns being disambiguated. However, our proposed method attempted to per-form WSD for all parts-of-speech based on hybridised semantic similarity and relatednessmethods. This hybridisation composed of the adapted Lesk algorithm (Banerjee & Peder-sen, 2002) and the jcn method (Jiang & Conrath, 1997). Moreover, the exploitation processof the proposed genetic algorithm was improved by invoking local search techniques.Hence, the genetic algorithm in our method maximised the overall coherence (semanticsimilarity or relatedness) score of the set of words being disambiguated.

1.1. Complexity ofWSD

The difficulty of WSD is a well-understood problem among earlier researcher. This difficultyis what led to the abandonment of MT in the 1960s when an argument had arisen to provethat there is no existing or imaginable program that couldmake a computer select a propersense of an ambiguous word, because of the need tomodel (Bar-Hillel, 1960). For example,the program could not discern that the word pen was used in its enclosure sense in thefollowing passage: Little Johnwas looking for his toy box. Finally he found it. Theboxwas in thepen. John was very happy. The complexity of the WSD task was attributed to the two typesof word sense, namely homograph and polysemy senses. Homograph is a coarse-graineddistinction between totally unrelated senses of the same word. For instance, the first twosenses of the word “bank” are “sloping land” and “depository financial institution”. Thesetwo senses can be easily distinguished as they belong to completely different domains.The other type of word sense is polysemy, which is more difficult than homograph wordsdue to the complexity of distinguishing this type of sense. A polysemy involves fine-graindistinction whereby word senses can somehow be related. An example of polysemy is theother senses of the word “bank” relating to “financial institution”, “bank building” and “amoneybox” (WordNet 3.1). In reference to these examples ofword senses, the issues relatedto WSD errors lay in polysemy type word senses rather than the homograph type.

1.2. Problem description

WSD is the problem of assigning the proper sense of an ambiguous word in a particularcontext. Usually, WSD performs the sense assigning process for one or more texts. In WSD,

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

4 S. A. ABED ET AL.

the text T canbe viewedas a sequenceofwords as follows:w1,w2,w3, . . . ,wn; wheren is thesize of the text. If we suppose that eachw has k of senses based on specific dictionary, thenWSD has to match the senses suite to the words in T which are presented as follows: wk

1,wk2,w

k3, . . . ,w

kn. Therefore, T that has nwords and eachword has k of senses (e.g. five senses

for each word), then the number of possible combinations will be 5n. Since the number ofsenses differ from one word to another, then the number of sense combinations for T willbe equal to:

Senses(w1) ∗ Senses(w2) . . . ∗ Senses(wSize(T)) (1)

Themain problem ofWSD is how to determine the coherent combination of senses. Theoriginal Lesk algorithm formulated by Lesk (1986) to solve the ambiguity of two words isshown in the following pseudo code:

Algorithm 1: Original Lesk AlgorithmInput: w1,w2

max=Gloss(S1(w1))⋂

Gloss(S1(w2))Best1 = Best2 = 1for i=(1 to Senses(w1)) do

for j=(1 to Senses(w2)) doOverlapScore=Gloss(Si(w1))

⋂Gloss(Sj(w2))

if (OverlapScore>max ) thenmax← OverlapScoreBest1← iBest2← j

endend

endOutput: Best1, Best2

This algorithm determines the proper senses of both ambiguous words based on thehighest overlap score between two senses. However, if the ambiguity problem is given abagofwords, then theLesk algorithmwill encounter ahugenumberof sense combinations.The most promising solution for this problem is to adopt intelligent search techniques,which are capable of exploring this huge space of combinations. The core constituent ofthe intelligent search technique is the existence of an evaluation unit. The Lesk algorithmand all semantic relatedness or similarity methods can contribute to the evaluation task ofthe search. For instance, the gloss overlap notion, which is the basis for a Lesk algorithm,evaluates the sense combinations as follows:

n−1∑i=1

ws∑j=i+1

Gloss(Si)⋂

Gloss(Sj), (2)

where n andws are the size of the text and the window size of themeasure, respectively. Ingeneral, there are two main types of evaluation functions for a WSD problem; relatednessand similarity measures. These types of measures have several variants. This work exploitstwo types of measures, onemethod belongs to the relatednessmeasure and the other onebelongs to the similarity measure. The next section extensively explains the modality of

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

CONNECTION SCIENCE 5

these measures and determines the role of each concept in the evaluation function of theproposed algorithm.

2. Semantic similarity and relatedness measures

When the corpus resources are absent, the decisive solution to a WSD problem is the useof semantic similarity and relatedness methods. This method aims to perform the task ofassigningword senses via quantification of the similarities or relatedness between two con-cepts. Semantic relatedness is more general than semantic similarity as long as the twoconcepts can be related without being similar.

Semantic similarity terms are used in very particular senses as they refer to the relation-ship between two synsets, which are based on information found in a specific hierarchy.Furthermore, similarity judgement is configured between pairs of verbs or nouns whenbased onWordNet because there is no link between a hierarchy of nouns and verbs. Thereare a number of methods that are applied to measure the similarity between two termswith each method based on a particular notion. These methods depend on the structureof WordNet so it can estimate the numeric degree, which specifies how two concepts aresimilar. The simplest form of these methods is to count the length between two conceptsphysically. The notion of counting the physical length has some limitations as long as thepath between very specific concepts (e.g. pitchfork) reveal much smaller distinctions insemantic similarity than path lengths of very general concepts (e.g. idea) (Pedersen, 2010).Hence, we investigated the use of the jcnmethod (Jiang & Conrath, 1997) which was basedon the notion of information content and this yielded higher similarity values for veryspecific concepts and less for very general concepts. The jcn method can only provide sim-ilarity values between pairs of verbs or nouns. Thus, we use a relatedness method, whichwas adapted from Lesk (Banerjee & Pedersen, 2002) to provide a relatedness value for allparts-of-speech as it was based on the gloss overlap notion.

2.1. Information contentmeasure

The semantic similarity between two concepts can be calculated based on the frequency ofthe concept, which subsumes the two concepts beingmeasured. This type of similarity cal-culation is called information content (Resnik, 1995), which is themeasure of the specificityof given concepts (Equation (3)). The information content of the lowest common subsumerconcept (LCS) of two concepts is defined as the similarity between these concepts as shownin Equation (4) (Resnik, 1995).

IC(C) = − log(P(C)), (3)

SimRes(C1,C2) = IC(LCS(C1,C2)). (4)

The information content notion revealed obvious evolution than the path length notion(Rada, Mili, Bicknell, & Blettner, 1989) where the latter only considered physical distancebetween two concepts. However, the Resnik method ignored the information content ofthe concepts being measured which led to an inaccurate similarity ratio. Hence, Lin (1998)and Jiang and Conrath (1997) attempted to make use of the information content of theconceptsbeingmeasuredalongwith information content of their LCS. Thesemethodswere

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

6 S. A. ABED ET AL.

Figure 1. A fragment of WordNet hierarchy (Lin, 1998).

formulated in different ways to refine the Resnik method, which are presented as follows:

Distjcn(C1,C2) = IC(C1)+ IC(C2)− 2× IC(LCS(C1,C2)), (5)

SimLin(C1,C2) = 2× IC(LCS(C1,C2))IC(C1)+ IC(C2)

. (6)

The jcn method combines a lexical taxonomy structure with statistical information of aspecific corpus. For the purpose of explaining similarity calculations, two given concepts of“hill” and “coast” are considered. The similarity score between these twowords is estimatedbased on the jcn method with respect to a WordNet hierarchy as shown in Figure 1.

In Figure 1, thenumbers in parenthesis represent theprobability of that concept in a spe-cific corpus. Regarding Figure 1, the similarity ratio between the concepts “hill” and “coast”via jcn measure will be matched as follows:

Distjcn(C1,C2) = IC(hill)+ IC(coast)− 2× IC(LCS(hill, coast)),

Figure 1 obviously shows the LCS of “hill” and “coast” concepts is “geological-formation”.The core notionof thismethod is to calculate the information content of the LCSwith its tar-get concepts. To calculate the information content for these concepts, we have to apply thedeclared formula in Equation (3). Hence, the information content ofgeologicalformation,hilland coast are 2.75, 4.72 and 4.67, respectively. Finally, the semantic distance between thetarget concepts will be calculated via the formula declared in Equation (5) as follows:

Distjcn(C1,C2) = 4.72+ 4.67− 2× 2.75 = 3.89.

The jcn measure was used in this study to estimate the similarity ratio between theconcepts that belong to the nouns or verbs. As in the case of all similarity methods, thejcn measure could not estimate the similarity ratio for the concepts from different parts-of-speech because it depended on WordNet hierarchies. These hierarchies did not mix

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

CONNECTION SCIENCE 7

different parts-of-speech. It should be mentioned that the jcn method does not measurethe similarity between two concepts if one of them did not have an occurrence record, i.e.produce zero as a similarity value in this case. This was because the information contentwillbe unknown for concepts that have zero probability occurrence.

Consequently, this studymade use of a gloss overlap approach, which adapted the Leskmethod to measure the relatedness ratio for these concepts, in which their similarity scoreis zero. Furthermore, we dedicated an adapted Lesk method to measure the relatednessamong the concepts from different parts-of-speech as shown in the next section.

2.2. Gloss overlapmethod

The Gloss overlap notion is hierarchy-independent, because it is the measure of glossescommonality. The notion of gloss overlap ismore general than the similaritymethods sinceit measures the relatedness between any two concepts regardless of part-of-speech. Thebasic algorithm of the gloss overlap method was proposed by Lesk (1986) to count thenumber of common words between two concept glosses as shown in Section 1.2.

The Lesk algorithm considers the overlaps among the glosses of the target word andthe glosses of neighbouring words in the given text. This consideration poses a signifi-cant limitation to the Lesk algorithm since dictionary glosses tend to be fairly short anddo not provide adequate vocabulary to make fine grained distinctions (Banerjee & Peder-sen, 2002). To overcome this limitation, Banerjee and Pedersen defined a newmethod thatexpanded the basic gloss overlap to accommodate the glosses of the concepts which areknown to be related to the concepts being compared. The extended gloss overlapmethodtakes two concepts as an input and the output is a numeric value, which specifies the relat-edness between the inputted concepts. The formula that was proposed by Banerjee andPedersen (2002) to calculate the relatedness between two concepts (C1 and C2) is shownin Equation (7) below.

Relatedness(C1,C2) =N∑i=1

score(R1i(C1), R2i(C2)), (7)

where N is the number of pair relations in RELPAIRS which is defined in a reflexive relationas shown below:

RELPAIRS = {(R1, R2) | R1, R2 ∈ RELS; if (R1, R2) ∈ RELPAIRS, then(R2, R1) ∈ RELPAIRS}RELS is a set of relations defined in theWordNet semantic network. For instance, RELS =

{hypo, hype, gloss}, theses relations constitutes the pairs, PAIRS = {(hypo, hypo), (hype,hype), (gloss, gloss), (gloss, hype), (hype, gloss)}. Hence, the adapted Lesk algorithm calcu-lates the relatedness score between two concepts as follows:

Relatedness(C1,C2) = score(hypo(C1), hypo(C2))+ score(hype(C1), hype(C2))

+ score(gloss(C1), gloss(C2))+ score(gloss(C1), hype(C2))

+ score(hype(C1), gloss(C2))

In regards to the selected pairs above, the relatedness score between C1 and C2 wasindeed equal the score of relatedness between C2 and C1. In this study, the adapted Lesk

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

8 S. A. ABED ET AL.

method was responsible for measuring the relatedness of the concepts that belong todifferent parts-of-speech, i.e. (noun, verb), (adjective, verb), (adjective, noun) as well asadjective and adverb pairs. The jcn and adapted Lesk methods constituted an integralunit that measure the semantic similarity or relatedness between two concepts (see Equa-tions (8) and (9)). This unit represents theobjective functionof themeta-heuristic algorithm,which is thegenetic algorithm that hasbeenused in this study tomaximise overall semanticsimilarity or relatedness (coherence), as the rest of this paper will show.

3. EvolutionaryWSD

Themaximum similarity or relatedness can be estimated in a brute force manner for targetwords in a particular context (Pedersen et al., 2005). Brute force is an intensive man-ner if it is applied to cases that involves many words (that have multiple meanings) ina context. Therefore, various approaches have been applied to reduce the cost of usingbrute force. These approaches exploit the use of meta-heuristic search algorithms such asCowie et al. (1992), Zhang et al. (2008), Hausman (2011) and Nguyen and Ock (2013). Theproposed methodology in this paper was motivated by the strength of the population-based algorithms that maximises the similarity or relatedness (coherence). In particular, weused genetic algorithm, which is an evolutionary population-based search algorithm (Hol-land, 1975). The population-based algorithms explore more areas of the search space ascompared to local search algorithms. However, population-based algorithms are incompe-tent in exploiting the solution space, in which there is no significant solution improvement.Therefore,weaddressed this problemby invoking the local search technique,which is aHill-Climbing algorithm that improved the exploitation performance of the genetic algorithmin terms of memetic techniques (Moscato & Norman, 1992).

In general, the process of genetic algorithm begins with a number of individuals (chro-mosomes). It then attempts to evolve these individuals over the next generations. In aWSDtask, each individual encodes a sequence of integer numbers (genes), which represent thesense numbers for the givenwords to be disambiguated. The generic procedure of geneticalgorithm goes through the following steps:

• Population initialisation.• Solutions selection, which is based on specific selection mechanism.• Producing new solutions (offspring) from the selected solutions based on crossover and

mutation operations.• Offspring evaluation, which is based on a specific objective function.• Repeating the procedure from the second step till the stopping criterion is reached.

In aWSD, thegenetic algorithmsearches thebest sense combinationbasedon theabovesteps. The genetic algorithm can be applied for WSD task by treating a set of words as a setof integer numbers. Each of these numbers corresponds to a sense number of the givencontext. For instance, if a set of words consist of six words, the corresponding numeric indi-vidual for the first sense of thosewords is {1, 1, 1, 1, 1, 1}. In otherwords, if a given context of50 words has six senses for each word, we will get 650 possible solutions. In this paper, thegenetic algorithm attempted to find the best solution among the set of potential solutions.We show how this algorithm was implemented for WSD in the next subsections.

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

CONNECTION SCIENCE 9

3.1. Population initialisation

The population of the proposed genetic algorithm in this paper was initialised by distribut-ing all senses for each word to the population individuals. Thus, the maximum number ofsenses in the given text represented the population size. In WordNet, the word senses aredistributed based on their frequencies, whichmeans the senseswith less frequency residedat the end of the sense list. Thus, we only considered the first 15 senses for the nouns, adjec-tives and adverbs, but we considered the first 20 senses for verbs as they had more senses.In consequence, the initialisation process of the population is as follows:

Algorithm 2: Population initialisationInput: w1,w2,. . . ,wn /* n is the text size*/pop← empty list of individuals w=maxsenses {w1,w2,. . . ,wn} ifw.pos =verb then

if w.SenseCount≥ 20 thenpopsize=20

elsepopsize=w.SenseCount

endelse

if w.senseCount≥ 15 thenpopsize=15

elsepopsize=w.SenseCount

endendfor i=(1 to popsize) do

for j=(1 to n) doindividualj= i mod SenseCount(wj)

endpopi← individual

endOutput: pop

Obviously, the population size was the maximum sense count of the given words. Weassigned the sense count for each word based on WordNet 3.0 (Miller, 1995) as we testedour method on the set of files from SemCor 3.0 corpus. Since the max sense count variedfrom one text to another, we did not affix the population size to a specific value. Hence,each entered text was processed as an isolated WSD instance.

3.2. Solution evaluation: the objective function

In the population-based algorithms, there is an important function that evaluates the qual-ity of the individuals. This function is used to achieve the objective of the given problem. Inthis body of work, our proposed method attempted to maximise the overall semantic sim-ilarity or relatedness for the given text. Thus, we used hybridised similarity and relatednessmethods to represent the objective function of our proposed method (see Equations (8)

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

10 S. A. ABED ET AL.

and (9)).

J1 =N−1∑i=1

WS∑j=i+1

2× IC(LCS(Ci,Cj))− IC(Ci)+ IC(Cj), (8)

J2 =N−1∑i=1

WS∑j=i+1

NP∑k=1

score(R1k(Ci), R2k(Cj)), (9)

where N,WS and NP are the solution length, window size of surrounding concepts, and thenumber of relation pairs respectively. Surrounding concepts could be distributed before orafter the target concept. Therefore, j in the above equations can be decreased or increased.These equations are based on jcn and adapted Lesk methods and the fitness function ofthe genetic algorithm could be represented as an integral unit as follows:

J ={J1, For the pairs: (noun,noun) or (verb,verb),

J2, Otherwise.

Previously, we pointed out that the adapted Lesk algorithm measured the semanticrelatedness for all pairs of concepts unless those pairs were nouns or verbs. In additionto this, our method used the adapted Lesk algorithm in case the jcn method failed (i.e.the information content of the concepts being measured was unknown) to estimate thesimilarity for noun or verb pairs.

3.3. Genetic operators

In genetic algorithm, the evolution of the solutions is conducted via two operators whichwork on preselected solutions. These operators are:

Crossover. The selected solutions are recombined to produce new solutions. We per-formed simple crossover at one point as shown in Figure 2.

Mutation. This operator attempts to improve the solutions produced by the crossoveroperator where the mutation may explore the solutions in the search space that thecrossover could not explore.

InWSD, eachword has a specific number of senses, which is represented in a single geneof the genetic algorithm. Therefore, the crossover and mutation must produce new indi-viduals with respect to upper and lower boundaries for each single gene. In our crossoveroperation, the individuals are produced without checking the boundaries of the genebecause this crossover recombined the corresponding genes of each individual as shownin Figure 2. On the other hand,we always checked the geneboundaries after eachmutationoperation to maintain the feasibility of the solution.

3.4. Local search operator

We used a Hill-Climbing algorithm to perform the process of local improvement process.In our method, the local search operator was applied periodically for each new produced

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

CONNECTION SCIENCE 11

Figure 2. Single point crossover.

individual, i.e. after the genetic operators. The Hill-Climbing algorithm for WSD was basedon a simple modification of the sense combinations. In specific terms, we used a swapoperator as a neighbourhood structure for this Hill-Climbing algorithm. In WSD, each swapmovement must satisfy the sense boundaries to preserve the feasibility of the solutions.

4. Experimental results and discussion

The results were gained using a genetic algorithm, which had a population size equal tothe maximum number of senses in the given text. In general, the population size will notexceed 20 solutions for verbs and 15 for other parts-of-speech. The objective function thatwas used to evaluate the individuals of genetic algorithmwas based on the hybridisation ofsemantic similarity and relatedness methods, namely the jcn and adapted Lesk algorithms.

The dataset thatwas used in this experimentwas a set of files from the corpus of SemCor3.0 (Miller, Leacock, Tengi, & Bunker, 1993). In order to compare our results to the mostclosely related works, which were Zhang et al. (2008) and Hausman (2011), we used thesame set of Semcor files that they tested on, namely, br-a01, b13, c0l, d02, e22, r05, g14,h21, j0l, k01, k11, l09, m02, n05, p07, r04, r06, r08, r09. The comparison was based on thefollowing metrics:

• Coverage(C): the ratio of all answered senses to the total number of senses as follows:

Coverage = all answered sensestotal number of senses

. (10)

• Precision(P): the ratio of correctly answered senses to the total possible answered sensesas follows:

Precision = correctly answered sensestotal possible answered senses

. (11)

• Recall(R): the ratio of correctly answered senses the total number of senses as follows:

Recall = correctly answered sensestotal number of senses

. (12)

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

12 S. A. ABED ET AL.

Table 1. The obtained results using jcn method.

POS C (%) P (%) R (%) F-measure (%)

Noun 100 71.3 71.3 71.3Verb 82.39 54.14 44.61 48.91

Table 2. The obtained results using adaptedLesk method.

POS C (%) P (%) R (%) F-measure (%)

Noun 100 72.20 72.20 72.20Verb 82.39 55.30 45.55 49.94Adj. 100 68.13 68.13 68.13Adv. 100 65.94 65.94 65.94All 95.15 66.60 63.36 64.92

Table 3. The results obtained using jcn andadapted Lesk methods.

POS C (%) P (%) R (%) F-measure (%)

Noun 100 73.12 73.12 73.12Verb 82.39 56.08 46.21 50.66Adj. 100 68.13 68.13 68.13Adv. 100 65.94 65.94 65.94All 95.15 66.97 63.85 65.37

• F-measure: it is a harmonic average between the recall and the precision calculated asfollows:

F −measure = 2× Precision× Recall

Precision+ Recall. (13)

Our proposed method was applied by separately using the jcn and adapted Lesk meth-ods. After hybridising thesemethods, the resultswere determined andpresented in Table 2and 3. The reported experimental results were done on a window of six words.

Table 1 outlines the results for theparts-of-speech inwhich their similaritywasmeasuredby the jcnmethod. In our proposedmethod, the jcn semantic similaritymethod achieved F-measures of 71.3% and 48.91% for nouns and verbs, respectively. Since the jcnmethodwasnot able to measure the similarity for all POS as seen in Table 1 (only nouns and verbs werebeing reported), the proposed methodology in this study used the adapted Lesk methodbeside of the jcn. The results obtained using the adapted Leskmethod are shown in Table 2,whilst Table 3 shows the results of combining the Lesk and jcn methods.

Table 2 shows the result of the proposed disambiguation methodology using theadapted Lesk method to measure the relatedness between each pair of concepts. Obvi-ously, the results obtained using the adapted Lesk algorithm outperformed the resultsobtained using the jcn method, due to the latter not being able to estimate the similarityof the given pair of concepts if one of them did not have any occurrences record. However,the jcnmethod onlymeasured the similarity for nouns and verbs. Therefore, combining theadapted Lesk algorithmwith the jcnmethod complemented the effectivemeasurement ofall parts-of-speech as shown in Table 3.

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

CONNECTION SCIENCE 13

Figure 3. The effectiveness of local search technique.

Combining the jcn method with adapted Lesk algorithm showed improvements. Theadjectives and adverbs were not effected in this combination as the jcn method couldnot estimate the similarity of these parts-of-speech. For the purpose of showing the effec-tiveness of the local search operator (shown in Section 3.4), we implemented our methodwithout this operator and then compared it with the entire proposed method as shownin Figure 3. The local search operator showed a significant impact in determining propersenses.

In comparison to related works, the proposed method in this study outperformed theother approaches proposed by Zhang et al. (2008) and Hausman (2011). Both of theserelated works applied a genetic algorithm to maximise overall similarity. In the method ofHausman (2011), the precision value of nine SemCor files was 70%, which was lower thanour precision value for noun part-of-speech. Moreover, we reported a precision value of19 SemCor files. Zhang et al. (2008) reported 71.96% as their average precision for nounswhile our method reported a precision of 73.12%. Practically, Zhang et al. (2008) exploiteda knowledge-based method based on Wu and Palmer (1994) similarity method while ourmethod exploits relatedness and similarity methods, which were considered more inten-sive. In comparison to Rosso, Masulli, and Buscaldi (2003), the proposed method in thisstudy attained abetter recall ratio for the nouns at 73.12%while Rosso et al. (2003) obtained62.19% in theirs. In the other hand, Rosso et al. (2003) obtained a better precision ratio80.91% while our method obtained only 73.12%. This was due to our coverage of 100%while Rosso et al. (2003) had 76.81% coverage. In general, the proposed evolutionaryalgorithm in this study attained a better recall value in comparison to the aforementionedrelated works.

5. Conclusion

This study presented the use of genetic algorithm to maximise the overall similarity andrelatedness for WSD task. The genetic algorithm for WSD was based on a combination of

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

14 S. A. ABED ET AL.

similarity and relatednessmethods as a fitness function to find an effective solution and thiswas done fruitfully. One of the most pertinent issues surrounding population-based algo-rithms (e.g. genetic algorithm) was their ineffectiveness in exploiting the solution space.This issue was addressed in this study via the use of a local search technique. In conse-quence, our proposed method yielded a disambiguation accuracy competing with thestate-of-the-art methods that follow the use of knowledge-based approaches in unsuper-vised framework. This has been proven by applying our method on the same benchmarkdataset that was used by others. However, further studies can be conducted to improve theproposed methodology. In future works, we plan to implement other similarity measuresto present the fitness function of the genetic algorithm. On the other hand, the geneticalgorithm can be further improved using population diversity techniques as well as viainvestigation into more local search techniques that can aid in the improvement of thequality of the search.

Acknowledgments

The authors would like to express their gratitude to Universiti Kebangsaan Malaysia for supportingthis work.

Disclosure statement

No potential conflict of interest was reported by the authors.

Funding

This research project was funded by the Malaysian Government under research grant ERGS/1/2013/ICT07/UKM/03/1.

References

Abed, S. A., Tiun, S., & Omar, N. (2015). Harmony search algorithm for word sense disambiguation.PloS One, 10(9), e0136614.

Abualhaija, S., & Zimmermann, K.-H. (2014).D-bees: A novelmethod inspired by bee colony optimizationfor solving word sense disambiguation, arXiv preprint arXiv:1405.1406.

Banerjee, S., & Pedersen, T. (2002). An adapted Lesk algorithm for word sense disambiguation usingWordNet. Berlin: Springer, pp. 136–145.

Bar-Hillel, Y. (1960). The present status of automatic translation of languages. Advances in Computers,1, 91–163.

Cowie, J., Guthrie, J., & Guthrie, L. (1992). Lexical disambiguation using simulated annealing. Proceed-ings of the 14th conference on computational linguistics–volume 1 (pp. 359–365). Stroudsburg,PA: Association for Computational Linguistics.

Gale, W. A., Church, K. W., & Yarowsky, D. (1992). A method for disambiguating word senses in a largecorpus. Computers and the Humanities, 26(5–6), 415–439.

Gelbukh, A., Sidorov, G., & Han, S. Y. (2003). Evolutionary approach to natural language word sensedisambiguation through global coherence optimization. WSEAS Transactions on Computers, 2(1),257–265.

Hausman, M. (2011). A genetic algorithm using semantic relations for word sense disambiguation(Thesis), Colorado Springs.

Holland, J. H. (1975). Adaptation in natural and artificial systems: an introductory analysis with applica-tions to biology, control, and artificial intelligence. Michigan: University of Michigan Press.

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

CONNECTION SCIENCE 15

Ji, H. (2010).One sense per context cluster: Improvingword sense disambiguation usingweb-scale phraseclustering. Universal communication symposium (IUCS) (pp. 181–184).

Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy.In Proceedings of international conference research on computational linguistics (ROCLING X),Taiwan. arXiv preprint cmp-lg/9709008.

Lesk, M. (1986). Automatic sense disambiguation usingmachine readable dictionaries: How to tell a pinecone from an ice cream cone. Proceedings of the 5th annual international conference on systemsdocumentation (pp. 24–26). New York, NY: ACM.

Lin, D. (1998). An information-theoretic definition of similarity. (Vol. 98, pp. 296–304). Winnipeg, Mani-toba: ICML.

Miller, G. A. (1995). Wordnet: A lexical database for English. Communications of the ACM, 38(11),39–41.

Miller, G. A., Leacock, C., Tengi, R., & Bunker, R. T. (1993). A semantic concordance. Proceedings ofthe workshop on human language technology (pp. 303–308). Stroudsburg, PA: Association forComputational Linguistics.

Moscato, P., & Norman, M. G. (1992). A memetic approach for the traveling salesman problemimplementation of a computational ecology for combinatorial optimization on message-passingsystems. Parallel Computing and Transputer Applications, 1, 177–186.

Navigli, R. (2009). Word sense disambiguation: A survey. ACMComputing Surveys (CSUR), 41(2), articleno. 10. doi:10.1145/1459352.1459355.

Navigli, R., & Lapata, M. (2010). An experimental study of graph connectivity for unsupervisedword sense disambiguation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(4),678–692.

Nguyen, K.-H., & Ock, C.-Y. (2013). Word sense disambiguation as a traveling salesman problem.Artificial Intelligence Review, 40(4), 405–427.

Pantel, P., & Lin, D. (2002). Discovering word senses from text. Proceedings of the eighth ACM SIGKDDinternational conference on Knowledge discovery and data mining (pp. 613–619). New York, NY:ACM.

Pedersen, T. (2010). Information content measures of semantic similarity perform better without sense-tagged text. Human language technologies: The 2010 annual conference of the North Americanchapter of theassociation for computational linguistics (pp. 329–332). Stroudsburg, PA:Associationfor Computational Linguistics.

Pedersen, T., Banerjee, S., & Patwardhan, S. (2005). Maximizing semantic relatedness to performwordsense disambiguation. University of Minnesota supercomputing institute research report UMSI, 25.

Rada, R., Mili, H., Bicknell, E., & Blettner, M. (1989). Development and application of a metric onsemantic nets. IEEE Transactions on Systems, Man and Cybernetics, 19(1), 17–30.

Resnik, P. (1995, November). Using information content to evaluate semantic similarity in a taxonomy,International joint conference for artificial intelligence (IJCAI-95), (pp. 448–453). arXivpreprint cmp-lg/9511007.

Rosso, P., Masulli, F., & Buscaldi, D. (2003). Word sense disambiguation combining conceptual dis-tance, frequency and gloss. Proceedings of the 2003 international conference on Natural languageprocessing and knowledge engineering, 2003 (pp. 120–125). Beijing: IEEE.

Schütze, H. (1998). Automatic word sense discrimination. Computational Linguistics, 24(1),97–123.

Schwab, D., Goulian, J., Tchechmedjiev, A., & Blanchon, H. (2011, December). Ant colony algorithmfor the unsupervised word sense disambiguation of texts: Comparison and evaluation. COLING (pp.2389–2404).

Sinha, R. S., & Mihalcea, R. (2007, September). Unsupervised graph-based word sense disambigua-tion using measures of word semantic similarity. IEEE sixth international conference on semanticcomputing (Vol. 7, pp. 363–369). Irvine, CA: ICSC. doi:10.1109/ICSC.2007.87.

Weaver, W. (1955). Translation.Machine Translation of Languages, 14, 15–23.Wu, Z., & Palmer, M. (1994). Verbs semantics and lexical selection. In Proceedings of the 32nd annual

meeting on association for computational linguistics (pp. 133–138). Stroudsburg, PA: Associationfor Computational Linguistics.

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016

16 S. A. ABED ET AL.

Yarowsky, D. (1992). Word-sense disambiguation using statistical models of roget’s categories trainedon large corpora. Proceedings of the 14th conference on computational linguistics–volume 2 (pp.454–460). Stroudsburg, PA: Association for Computational Linguistics.

Yu, J., Li, C., Hong, W., Li, S., & Mei, D (2015). A new approach of rules extraction for word sensedisambiguation by features of attributes. Applied Soft Computing, 27, 411–419.

Zhang, C., Zhou, Y., & Martin, T. (2008). Genetic word sense disambiguation algorithm. Second inter-national symposium on intelligent information technology application, 2008 (IITA’08), (Vol. 1, pp.123–127). Shanghai: IEEE.

Dow

nloa

ded

by [

60.5

0.11

9.45

] at

07:

22 1

7 Fe

brua

ry 2

016