Smart combination of web measures for solving semantic similarity problems

Smart combination of webmeasures for solving semantic

similarity problemsJorge Martinez-Gil and Jose F. Aldana-Montes

Department of Computer Languages and Computing Sciences,University of Malaga, Malaga, Spain

Abstract

Purpose – Semantic similarity measures are very important in many computer-related fields.Previous works on applications such as data integration, query expansion, tag refactoring or textclustering have used some semantic similarity measures in the past. Despite the usefulness of semanticsimilarity measures in these applications, the problem of measuring the similarity between two textexpressions remains a key challenge. This paper aims to address this issue.

Design/methodology/approach – In this article, the authors propose an optimization environmentto improve existing techniques that use the notion of co-occurrence and the information available onthe web to measure similarity between terms.

Findings – The experimental results using the Miller and Charles and Gracia and Mena benchmarkdatasets show that the proposed approach is able to outperform classic probabilistic web-basedalgorithms by a wide margin.

Originality/value – This paper presents two main contributions. The authors propose a noveltechnique that beats classic probabilistic techniques for measuring semantic similarity between terms.This new technique consists of using not only a search engine for computing web page counts, but asmart combination of several popular web search engines. The approach is evaluated on the Miller andCharles and Gracia and Mena benchmark datasets and compared with existing probabilistic webextraction techniques.

Keywords Similarity measures, Web intelligence, Web search engines, Information integration,Information searches, Internet

Paper type Research paper

IntroductionThe study of semantic similarity between terms is an important part of manycomputer-related fields (Zhu et al., 2010). Semantic similarity between terms changesover time and across domains. The traditional approach to solve this problem hasconsisted of using manually compiled taxonomies such as WordNet (Budanitsky andHirst, 2006). The problem is that many terms (proper nouns, brands, acronyms, newwords, and so on) are not covered by dictionaries; therefore similarity measures thatare based on dictionaries cannot be used directly in these tasks (Bollegala et al., 2007).

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/1468-4527.htm

The authors wish to thank all the anonymous reviewers for their comments and suggestions,which have helped to improve this work. The authors thank Lisa Huckfield for proofreading thismanuscript. This work was funded by the Spanish Ministry of Innovation and Science throughthe project ICARIA: From Semantic Web to Systems Biology, Project Code: TIN2008-04844 andby the Regional Government of Andalucia through the Pilot Project for Training on AppliedSystems Biology, Project Code: P07-TIC-02978.

OIR36,5

724

Received 13 July 2011Accepted 11 October 2011

Online Information ReviewVol. 36 No. 5, 2012pp. 724-738q Emerald Group Publishing Limited1468-4527DOI 10.1108/14684521211276000

However, we think that the great advances in web research have provided newopportunities for developing more accurate solutions.

In fact, with the increase in larger and larger collections of data resources on theworldwide web (WWW), the study of web extraction techniques has become one of themost active areas for researchers. We consider techniques of this kind very useful forsolving problems related to semantic similarity because new expressions areconstantly being created and also new meanings are assigned to existing expressions(Bollegala et al., 2007). Manually maintaining databases to capture these newexpressions and meanings is very difficult, but it is, in general, possible to find all ofthese new expressions in the WWW (Yadav, 2010).

Therefore our approach considers the chaotic and exponential growth of the WWWto be not only the problem, but also the solution. In fact we are interested in threecharacteristics of the web:

(1) It is one of the biggest and most heterogeneous databases in the world, andpossibly the most valuable source of general knowledge. Therefore the webfulfils the properties of domain independence, universality and maximumcoverage proposed by Gracia and Mena (2008).

(2) It is close to human language, and therefore can help to address problemsrelated to natural language processing.

(3) It provides mechanisms to separate relevant from non-relevant information, orrather the search engines do. We will use these search engines to our benefit.

One of the most outstanding works in this field is the definition of the “normalisedGoogle distance” (Cilibrasi and Vitanyi, 2007). This distance is a measure of semanticrelatedness derived from the number of hits returned by the Google search engine for agiven (set of) keyword(s). The idea behind this measure is that keywords with similarmeanings from a natural language point of view tend to be close according to theGoogle distance, while words with dissimilar meanings tend to be farther apart. In factCilibrasi and Vitanyi (2007, p. 1) state: “We present a new theory of similarity betweenwords and phrases based on information distance and Kolmogorov complexity. To fixthoughts, we used the World Wide Web (WWW) as the database, and Google as thesearch engine. The method is also applicable to other search engines and databases”.Our work is about those web search engines; more specifically we are going to use notonly Google, but a selected set of the most popular ones.

In this work we are going to mine the web, using web search engines to determinethe degree of semantic similarity between terms. It should be noted that under nocircumstances can data from experiments presented in this work be considered todemonstrate that one particular web search engine is better than another or that theinformation it provides is more accurate. In fact we show that the best results areobtained when weighting all of them in a smart way. Therefore the main contributionsof this work are:

. We propose a novel technique that beats classic probabilistic techniques formeasuring semantic similarity between terms. This new technique consists ofusing not only a search engine for computing webpage counts, but a smartcombination of several popular web search engines. The smart combination is

Semanticsimilarityproblems

725

obtained using an elitist genetic algorithm that is able to adjust the weights ofthe combination formula in an efficient manner.

. We evaluate our approach on the Miller and Charles (1998) and Gracia and Mena(2008) benchmark datasets and compare it with existing probabilistic webextraction techniques.

The rest of this work is organised as follows. The next section describes several usecases where our work can be helpful, followed by a discussion of related works. Thenwe describe the preliminary technical definitions that are necessary to understand ourproposal. After that we present our contribution, which consists of an optimisationschema for a weighted combination of popular web search engines. The subsequentsection shows the data that we have obtained from an empirical evaluation of ourapproach. Finally we reach conclusions and suggest future lines of research.

Use casesIdentifying semantic similarities between terms is not only an indicator of mastery of alanguage, but a key aspect in a lot of computer-related fields too. It should be noted thatsemantic similarity measures can help computers to distinguish one object fromanother, group them based on the similarity, classify a new object as part of a group,predict the behaviour of the new object or simplify all the data into reasonablerelationships. There are many disciplines that can benefit from these capabilities, forexample data integration, query expansion, tag refactoring and text clustering.

Data integrationNowadays data from a large number of webpages are collected in order to provide newservices. In such cases extraction is only part of the process. The other part is theintegration of the extracted data to produce a coherent database because different sitestypically use different data formats (Halevy et al., 2006). Integration means matchingcolumns in different data tables that contain the same kind of information (e.g. productnames) and matching values that are semantically equivalent but representeddifferently in other sites (e.g. “cars” and “automobiles”). Unfortunately only limitedintegration research has been done so far in this field.

Query expansionQuery expansion is the process of reformulating queries in order to improve retrievalperformance in information retrieval tasks (Vechtomova and Karamuftuoglu, 2007). Inthe context of web search engines, query expansion involves evaluating what termswere typed into the search query area and expanding the search query to matchadditional webpages. Query expansion involves techniques such as finding synonymsof words (and searching for the synonyms as well) or fixing spelling errors andautomatically searching for the corrected form or suggesting it in the results, forexample.

Web search engines invoke query expansion to increase the quality of user searchresults. It is assumed that users do not always formulate search queries using the mostappropriate terms. Inappropriateness in this case may be because the system does notcontain the terms typed by the user.

OIR36,5

726

Tag refactoringNowadays the popularity of tags in websites has markedly increased, but tags can beproblematic because lack of control means they are prone to inconsistency andredundancy. It is well known that if tags are freely chosen (instead of taken from agiven set of terms), synonyms (multiple tags for the same meaning), normalisation ofwords and even heterogeneity of users are likely to arise, lowering the efficiency ofcontent indexing and searching contents (Urdiales-Nieto et al., 2009). Tag refactoring(also known as tag cleaning or tag gardening) is very important in order to avoidredundancies when labelling resources in the WWW.

Text clusteringText clustering is closely related to the concept of data clustering. Text clustering is amore specific technique for document organisation, automatic topic extraction and fastinformation retrieval and filtering (Song et al., 2009).

A web search engine often returns many webpages in response to a broad query,making it difficult for users to browse or to identify relevant information. Clusteringmethods can be used to automatically group the retrieved webpages into a list oflogical categories.

Text clustering involves the use of descriptors and descriptor extraction.Descriptors are sets of words that describe the contents within the cluster.Examples of text clustering include web document clustering for search users.

Related workIn addition to their multiple applications in the natural language processing field, it iswidely accepted that similarity measures are essential to solve many problems such asclassification, clustering, and retrieval problems. For this reason several different wayshave been proposed over the last few years to measure semantic similarity (Budanitskyand Hirst, 2006). According to the sources exploited and the way in which they areused, different families of methods can be identified. These families are:

. taxonomies of concepts;

. feature-based approaches; and

. the web as corpus paradigm.

Among taxonomies of concepts the most popular method for calculating similaritybetween two words consists of finding the length of the shortest path connecting thetwo words in a taxonomy (Budanitsky and Hirst, 2006). If a word has two or moremeanings, then multiple paths may exist between the two words. In such cases it ispossible to choose the shortest path between any two meanings of the words(optimistic approach) or the largest path between them (pessimistic approach). Aproblem frequently acknowledged with this approach is that it relies on the notion thatall links in the taxonomy represent uniform distances.

An advantage of our method compared to the taxonomy-based semantic similaritymeasures is that our method requires no taxonomies for computing the semanticsimilarity. Therefore the proposed method can be applied in many tasks where suchtaxonomies do not exist or are not up-to-date.

In feature-based approaches it is possible to estimate semantic similarity accordingto the amount of common and non-common features (Petrakis et al., 2006); for features


727

authors typically consider taxonomical information found in an ontology and conceptdescriptions retrieved from dictionaries.

The problem of feature-based approaches is that the feature detection stage requiresfeatures to be located accurately and reliably. This is a non-trivial task. However it isnot easy to determine whether two words share a common feature, due to the problemof feature matching. Our approach does not require feature detection or matching.

Using the web as a knowledge corpus (Lapata and Keller, 2005) unsupervisedmodels demonstrably perform better when n-gram counts are obtained from the webrather than from any other corpus (Keller and Lapata, 2003). In fact Resnik and Smith(2003) extracted sentences from the web to create parallel corpora for machinetranslation. We have identified two research lines related to the use of web measures:

(1) measures based on web hits; and

(2) measures based on text snippets.

Web hitsOne of the best measures was introduced by Turney (2001) and it consists of apoint-wise mutual information measure using the number of hits returned by a websearch engine to recognise synonyms (Sanchez et al., 2010). Meanwhile, Bollegala et al.have proposed several methods which use web hits for measuring semantic similarity(Bollegala et al., 2007), mining personal name aliases (Bollegala et al., 2008), ormeasuring relational similarity (Bollegala et al., 2009).

Text snippetsIt should be noted that snippets are brief windows of text extracted by a search enginearound the query term in a document, and provide information regarding the queryterm. Methods addressing this issue were proposed by Sahami and Heilman (2006),who measured the semantic similarity between two queries using snippets returned forthose queries. For each query they collected snippets from a search engine andrepresented each snippet as a TF-IDF-weighted vector. Chen and Lin (2006) proposed adouble-checking model using text snippets returned by a web search engine tocompute semantic similarity between words.

Technical preliminariesGiven two terms a and b, the problem which we are addressing consists of trying tomeasure the semantic similarity between them. Semantic similarity is a concept thatextends beyond synonymy and is often called semantic relatedness in the literature.According to Bollegala et al. (2007) a certain degree of semantic similarity can beobserved not only between synonyms (e.g. lift and elevator), but also betweenmeronyms (e.g. car and wheel), hyponyms (leopard and cat), related words (e.g. bloodand hospital) as well as between antonyms (e.g. day and night). In this work we focuson optimising web extraction techniques that try to measure the degree of synonymybetween two given terms using the information available on the WWW.

Below are the definitions for the concepts that we are going to use.A similarity measure sm is a function sm : m1 £ m2 ! R that associates the

similarity of two input terms m1 and m2 to a similarity score sc [ R in the range [0,1]which states the confidence for the relation m1 and m2 to be true. In this way a

OIR36,5

728

similarity score of 0 stands for complete inequality of the input terms and 1 for thesemantic equivalence of m1 and m2.

A hit (also known as page count) is an item found by a search engine to matchspecified search conditions. More formally we can define a hit as the functionhit : expr ! N , which associates an expression with a natural number whichascertains the popularity of expr in the subset of the WWW indexed by the searchengine. A value of 0 stands for no popularity and the bigger the value, the bigger itsassociated popularity. Moreover the function “hit” has many possible implementations.In fact, every web search engine implements it a different way.

Similarity measures based on web hitsIf we look at the literature we can find many similarity measures. For examplemeasures based on hits, which are a kind of similarity measure calculated based on theco-occurrence (Dagan et al., 1999) of the terms in the WWW, and thus the number ofhits for each individual term to be compared and their conjunction. Four of the mostpopular measures of this kind are pointwise mutual information (PMI) (Turney, 2001),dice, overlap coefficient, and Jaccard (Manning and Schutze, 1999). When thesemeasures are used on the web, it is necessary to add the prefix “Web”, WebPMI,WebDice, and so on. All of them are considered probabilistic because given a webpagecontaining one of the terms a or b, these measures represent different ways to computethe probability of that webpage also containing the other term (Manning and Schutze,1999). These are their corresponding formulae:

WebPMIða; bÞ ¼pða; bÞ

pðaÞ · pðbÞ;

WebDiceða; bÞ ¼2 · pða; bÞ

pðaÞ þ pðbÞ;

WebOverlapða; bÞ ¼pða; bÞ

min½pðaÞ; pðbÞ�;

WebJaccardða; bÞ ¼pða; bÞ

pðaÞ þ pðbÞ2 pða; bÞ:

On the WWW probabilities of term co-occurrence can be expressed by hits. In factthese formulas are measures for the probability of co-occurrence of the terms a and b(Cilibrasi and Vitanyi, 2007). The probability of a specific term is given by the numberof hits returned when a given search engine is presented with this search term dividedby the overall number of webpages possibly returned. The joint probability p(a, b) isthe number of hits returned by a web search engine, containing both the search term aand the search term b divided by the overall number of webpages possibly returned.Estimation of the number of webpages possibly returned by a search engine has beenstudied in the past (Bar-Yossef and Gurevich, 2006). In this work we set the overallnumber of webpages as 1010 according to the number of indexed pages reported byBollegala et al. (2007).


729

Despite its simplicity using hits as a measure of co-occurrence of two terms presentsseveral advantages. It is surprisingly good at recognising synonyms because it seemsto be empirically supported that synonyms often appear together in webpages(Cilibrasi and Vitanyi, 2007). Moreover if the search terms never occur (or almost never)together on the same webpages, but do occur separately, this kind of measure willpresent very low similarity values.

ContributionTraditionally web extraction techniques which are based on the notion of web hitshave used a given web search engine (frequently Google) in order to collect thenecessary information. We could see no compelling reason for considering Google themost appropriate web search engine in order to collect information and we began to useother search engines.

When we collected and processed the data from a wide variety of search engines, wesaw that Google was the best source. However we discovered that the average mean ofthe results was better than the results from Google. So we felt that an optimisedcombination of those search engines could be even better than the average means.Based on this we designed an experiment in order to test our hypothesis.

Figure 1 shows the solution proposed for the computation of the optimised webextraction approaches. The key of the architecture is an elitist genetic algorithm (i.e. analgorithm which selects the better individuals in each iteration), with a small mutation ratethat generates a vector of numerical weights and associates them with their correspondingweb search engine. The final solution can be expressed in the form of a numeric weightvector. It should be noted that the output of this architecture is a function that tries tosimulate the behaviour of the (group of) human(s) who solve the input benchmark dataset.

For the implementation of the function “hit”, we have chosen the following searchengines from among the most popular in the Alexa ranking (see www.alexa.com):

. Google;

. Yahoo!;

. Altavista;

. Bing; and

. Ask.

It should be noted that our approach does not include the automatic selection ofappropriate web search engines because it assumes that all of them are offered initially,and those associated with a weight of 0 will be automatically deselected.

The vector of numerical weights is encoded in binary format in each of the genesbelonging to the domain of the genetic algorithm. For example a vector of 20 bits canrepresent one weight of 20 bits, two weights of 10 bits, four weights of five bits, or fiveweights of four bits. It is necessary to implement a function to convert each binaryweight into decimal format before calculating the fitness of the weight vector. Thenumber of decimal possibilities for each weight is 2bits; for example, in the case ofchoosing a vector of five weights of four bits, it is possible to represent five weightswith 24 ¼ 16 different values. These different values can range from 0 to 15, from 8 to27, from 0 to 1 (where ticks have double precision) and so on. This depends on theimplementation for the conversion from binary format into decimal format.

OIR36,5

730

The comparison between the benchmark datasets and our results is made using thePearson’s correlation coefficient, a statistical measure which allows comparing twomatrices of numeric values. In this way we can obtain evidence about how ourapproach can replicate human judgment. The results can be in the interval [21, 1],where 21 represents the worst case (totally different values) and 1 represents the bestcase (totally equivalent values).

Our approach is efficient and scalable because it only processes the snippets (nodownloading of webpages is necessary) for the result pages from the web searchengines. Moreover, scalability is provided by our genetic algorithm, which allows us toexpand the set of web search engines without causing any kind of technical problem.This would not be possible if we used a brute force strategy, for example.

It should be noted that problems related to the speed of the process, and moreimportantly to the restrictions on the maximum number of allowed queries per engine,are solved by means of a preliminary phase which loads the results retrieved from the

Figure 1.Solution proposed for the

computation of thesemantic similarity

measure


731

search engines into appropriate data structures. We consider this information valid fora period of seven days, because the content indexed by the search engines growsdynamically.

EvaluationThe evaluation of our approach is divided into:

. a preliminary study to choose the appropriate parameters for configuring theoptimisation environment;

. the empirical data that we have collected from the experiments;

. a study on the quality of the optimised function;

. the effect of the operators on the final result; and

. the information necessary to repeat the experiments.

Preliminary studyThis was a preliminary study of the configuration parameters for the environment.

For the number of genes per chromosome we selected such values as 5, 10 and 20. Astudy using a t-test distribution showed that the differences between samples obtainedby means of these parameters are not statistically significant. Therefore we selected 20genes per chromosome.

For the number of individuals in the population we selected such values as 20, 50and 100. Again a t-test statistical distribution showed that the differences betweenthese samples are not statistically significant. We thus selected a population of 100individuals.

For the crossover and mutation fraction we chose a high value for the crossoverbetween genes and a small percentage for mutations, because we require a classicalconfiguration for the genetic algorithm.

For the combination formula our preliminary study showed that the weighted sumis better than the weighted product and the weighted square.

After ten independent executions we noticed that the genetic algorithm did notimprove the results beyond the 200th generation, so we set a limit of 200 generationsfor the algorithm. The final results were computed by means of the average mean fromthese ten generations. We evaluated our approach using the Miller and Charles (1998)benchmark dataset, which is a dataset of term pairs rated by a group of 38 people.Term pairs are rated on a scale from 0 (no similarity) to 4 (complete similarity). TheMiller-Charles dataset benchmark is a subset of Rubenstein and Goodenough’s (1965)original benchmark dataset of 65 term pairs. Although Miller and Charles’s experimentwas carried out many years later than Rubenstein and Goodenough’s, Bollegala et al.(2007) state that the two sets of ratings are highly correlated. Therefore theMiller-Charles ratings can be considered a good benchmark dataset for evaluatingsolutions that involve semantic similarity measures.

ResultsIn this section we present the empirical results of our experiments. It should be notedthat all figures, except those for the Miller-Charles ratings, are normalised into valuesin the [0, 1] range for ease of comparison. Pearson’s correlation coefficient is invariantagainst a linear transformation (Bollegala et al., 2007).

OIR36,5

732

Table I shows the collected data using several search engines for solving theMiller-Charles benchmark dataset. The measure that we used is normalised Googledistance (Cilibrasi and Vitanyi, 2007). As we commented previously Google is the bestsearch engine for our purposes; however the average mean and the median presenteven better results. Other kinds of statistical measures such as mode, maximum andminimum were considered but not listed because their associated results were no betterthan the current ones.

Figure 2 shows us the behaviour of the average mean in relation to theMiller-Charles benchmark dataset. As can be seen the behaviour is similar, however itcan be improved by using more elaborated formulas than the average mean, as weshow in our second experiment.

Meanwhile Table II shows us the results from using several probabilistic webextraction algorithms described previously. The data represent the numerical valuesobtained before and after the optimisation process. The classic score is obtained usingonly the Google search engine, while the optimised score is obtained by means of thePearson’s correlation coefficient obtained after 200 generations of an elitist geneticalgorithm that tries to maximise the weighted sum of some of the most popular websearch engines.

M-C Google Ask Altavista Bing Yahoo! Average Median

cord-smile 0.13 0.05 0.25 1.00 0.00 0.00 0.26 0.05rooster-voyage 0.08 0.24 1.00 0.00 0.00 0.00 0.25 0.00noon-string 0.08 0.50 1.00 0.00 0.00 0.00 0.30 0.00glass-magician 0.11 1.00 1.00 1.00 0.00 0.01 0.60 1.00monk-slave 0.55 1.00 1.00 1.00 0.00 0.00 0.60 1.00coast-forest 0.42 1.00 1.00 1.00 0.02 0.01 0.61 1.00monk-oracle 1.10 1.00 1.00 1.00 0.00 0.00 0.60 1.00lad-wizard 0.42 0.04 1.00 1.00 0.00 0.00 0.41 0.04forest-graveyard 0.84 1.00 1.00 1.00 0.00 0.01 0.60 1.00food-rooster 0.89 1.00 1.00 1.00 0.00 0.00 0.60 1.00coast-hill 0.87 1.00 1.00 1.00 0.06 0.02 0.62 1.00car-journey 1.16 0.17 1.00 1.00 0.01 0.00 0.44 0.17crane-implement 1.68 0.14 1.00 0.00 0.00 0.00 0.23 0.00brother-lad 1.66 0.18 1.00 1.00 0.00 0.00 0.44 0.18bird-crane 2.97 1.00 1.00 1.00 0.13 0.09 0.64 1.00bird-cock 3.05 1.00 1.00 1.00 0.07 0.07 0.63 1.00food-fruit 3.08 1.00 1.00 1.00 0.03 0.01 0.61 1.00brother-monk 2.82 1.00 1.00 1.00 0.11 0.04 0.63 1.00asylum-madhouse 3.61 1.00 1.00 1.00 0.00 0.00 0.60 1.00furnace-stove 3.11 0.46 1.00 1.00 0.00 1.00 0.69 1.00magician-wizard 3.50 1.00 1.00 1.00 0.04 0.98 0.80 1.00journey-voyage 3.84 1.00 1.00 1.00 0.00 0.00 0.60 1.00coast-shore 3.70 1.00 1.00 1.00 0.02 0.08 0.62 1.00implement-tool 2.95 1.00 1.00 1.00 0.00 0.02 0.60 1.00boy-lad 3.76 1.00 1.00 1.00 0.18 0.02 0.64 1.00automobile-car 3.92 1.00 1.00 1.00 0.01 0.34 0.67 1.00midday-noon 3.42 1.00 1.00 1.00 0.07 0.00 0.61 1.00gem-jewel 3.84 1.00 1.00 1.00 0.39 0.05 0.69 1.00Correlation 1.00 0.47 0.26 0.35 0.43 0.34 0.61 0.54

Table I.Summary results for the

Miller-Charlesbenchmark using several

search engines


733

Lastly, Figure 3 is a histogram of the results that clearly show our approachsignificantly outperforms classic probabilistic web extraction techniques. In somecases there is an improvement of more than twice the original score.

Although our hypothesis has been tested using measures based on web hits, we seeno problem in using other kinds of web extraction techniques, and thus theoptimisation environment can be used to optimise other kinds of web extractiontechniques if these techniques are able to be parameterised.

Measure Classic score Optimised score Improvement

WebPMI 0.548 0.678 23.72WebDice 0.267 0.622 132.95WebOverlap 0.382 0.554 45.02WebJaccard 0.259 0.565 118.14

Table II.Results from probabilisticweb extractionalgorithms before andafter the optimisationprocess for theMiller-Charles dataset

Figure 2.Comparison betweenMiller-Charles benchmarkdataset and average meanfrom several web searchengines

Figure 3.Existing techniques can besignificantly improved byusing our approach

OIR36,5

734

Quality of the obtained functionPerhaps the most important part of this work is to determine whether the trainedoptimal weights for the previous dataset can maintain the same level of improvementfor other term pairs. In order to clarify this we used the function optimised using theMiller-Charles benchmark dataset for solving a different benchmark.

For this we have chosen the Gracia and Mena (2008) benchmark dataset, which issimilar to our previous dataset. It contains 30 pairs of English nouns, with some kind ofrelationship present in most of them, i.e. similarity (person, soul), meronymy (hour,minute), frequent association (penguin, Antarctica), and others.

As can be seen in Table III the quality of the optimised function when solving theGracia-Mena benchmark dataset is very good, or at least it is better to use theoptimised function that the classical approach.

Effect of the operatorsIn our experiments we chose the weighted sum because the results are generally betterthan the results for the weighted product or the weighted square. The effects of theoperators in the optimisation environment can be seen in Table IV.

The weighted sum is the most appropriate operator in all cases, because values arealways better than those obtained using a weighted product or a weighted square. Thisis why we chose the weighted sum as the operator for our optimisation environment.

In future work we can try to test the effects of other kinds of statistical measuressuch as median, mode, maximum and minimum for the values retrieved from thedifferent search engines, and so on. These measures were not considered in this work,because they do not have an associated numeric value that can be optimised using ourapproach.

Data for repeating the experimentsRelated to the conditions of the experiment, we used Google, Ask, Altavista, Bing andYahoo! as web search engines for implementing the function “hit”.

Measure G-M G-M0 Improvement


Note: The G-M column shows the results from Google; G-M0 shows the results from the smartfunction obtained in the previous experiment

Table III.Quality of the optimised

function when solving theGracia-Mena benchmark

Measure Weighted sum Weighted product Weighted square


Table IV.Effect of the operators inthe final results from theoptimisation environment


735

The elitist genetic algorithm was configured taking the following parameters intoaccount (fitness and search space were explained in the previous section):

. 20 genes per chromosome;

. a population of 100 individuals;

. 0.98 for crossover fraction;

. 0.05 for mutation fraction;

. we allow 200 generations; and

. the goal is to optimise the weighted sum.

We chose the weighted sum because the results are generally better than the results forthe weighted product.

It should be noted when repeating the experiments that the results can vary slightlyover time because the content indexed by the web search engines is not static.

Conclusions and outlookIn this work we have presented a novel approach that generalises and extends previousproposals for determining the semantic similarity between terms using web searchengines. The optimisation environment consists of several web extraction techniquesthat use a selected set of simple web search engines (instead of a single web searchengine), which are combined in order to achieve satisfactory results.

We have provided an analysis of the most popular algorithms for extracting webinformation by using web hits, and characterised their relative applicability as blackboxes in our optimisation environment. It is necessary to bear in mind that the successof our approach depends largely on the kind of underlying simple web search enginesincluded and the heterogeneity and soundness of the benchmark datasets fordetermining their associated weights.

We have implemented a prototype for our optimisation schema following anapproach based on genetic algorithms that is highly scalable, and thus can beexpanded with many simple web search engines. Our proposal has been optimisedusing a widely used benchmark dataset (Miller and Charles, 1998) and applied to solveanother widely used but different benchmark dataset (Gracia and Mena, 2008). Wehave shown that our approach significantly outperforms the classical probabilistictechniques by a wide margin when solving the two datasets.

As future work we propose to compare the knowledge provided by onlineencyclopaedias and that provided by the web search engines. The idea behind thisproposal is to use not only simple measures such as those based on web hits, but tobenefit from the structured nature of these encyclopaedias in order to use and optimisemore complex web extraction techniques. The goal is to further improve semanticsimilarity detection techniques. In this way semantic interoperability between people,computers or simply agents in the WWW might become possible.

References

Bar-Yossef, Z. and Gurevich, M. (2006), “Random sampling from a search engine’s index”,Proceedings of the World Wide Web Conference, ACM Press, New York, NY, pp. 367-76.

OIR36,5

736

Bollegala, D., Matsuo, Y. and Ishizuka, M. (2007), “Measuring semantic similarity between wordsusing web search engines”, Proceedings of the World Wide Web Conference, ACM Press,New York, NY, pp. 757-66.

Bollegala, D., Matsuo, Y. and Ishizuka, M. (2009), “Measuring the similarity between implicitsemantic relations from the web”, Proceedings of the World Wide Web Conference,ACM Press, New York, NY, pp. 651-60.

Bollegala, D., Honma, T., Matsuo, Y. and Ishizuka, M. (2008), “Mining for personal name aliaseson the web”, Proceedings of the World Wide Web Conference, ACM Press, New York, NY,pp. 1107-8.

Budanitsky, A. and Hirst, G. (2006), “Evaluating WordNet-based measures of lexical semanticrelatedness”, Computational Linguistics, Vol. 32 No. 1, pp. 13-47.

Chen, H. and Lin, W. (2006), “Novel association measures using web search with doublechecking”, Proceedings of COLING/ACL, Association for Computational Linguistics,Stroudsburg, PA, pp. 1009-16.

Cilibrasi, R. and Vitanyi, P.M. (2007), “The Google similarity distance”, IEEE Transactions inKnowledge Data Engineering, Vol. 19 No. 13, pp. 370-83.

Dagan, I., Lee, L. and Pereira, F. (1999), “Similarity-based models of word co-occurrenceprobabilities”, Machine Learning, Vol. 34 No. 1, pp. 43-69.

Gracia, J. and Mena, E. (2008), “Web-based measure of semantic relatedness”, Proceedings ofWISE, Springer, Berlin, pp. 136-50.

Halevy, A., Rajaraman, A. and Ordille, J. (2006), “Data integration: the teenage years”,Proceedings of VLDB, pp. 9-16, available at: www.vldb.org/conf/2006/p9-halevy.pdf(accessed 16 March 2012).

Keller, F. and Lapata, M. (2003), “Using the web to obtain frequencies for unseen bigrams”,Computational Linguistics, Vol. 29 No. 3, pp. 459-84.

Lapata, M. and Keller, F. (2005), “Web-based models of natural language processing”,ACM Transactions on Speech and Language Processing, Vol. 2 No. 1, pp. 1-31.

Manning, C.D. and Schutze, H. (1999), Foundations of Statistical Natural Language Processing,MIT Press, Cambridge, MA.

Miller, G. and Charles, W. (1998), “Contextual correlates of semantic similarity”, Language andCognitive Processes, Vol. 6 No. 1, pp. 1-28.

Petrakis, E.G.M., Varelas, G., Hliaoutakis, A. and Raftopoulou, P. (2006), “X-similarity:computing semantic similarity between concepts from different ontologies”, Journal ofDigital Information Management, Vol. 4 No. 4, pp. 233-7.

Resnik, P. and Smith, N.A. (2003), “The web as a parallel corpus”, Computational Linguistics,Vol. 29 No. 3, pp. 349-80.

Rubenstein, H. and Goodenough, J.B. (1965), “Contextual correlates of synonymy”,Communications of the ACM, Vol. 8 No. 10, pp. 627-33.

Sahami, M. and Heilman, T. (2006), “A web-based kernel function for measuring the similarity ofshort text snippets”, Proceedings of the World Wide Web Conference, ACM Press,New York, NY, pp. 377-86.

Sanchez, D., Batet, M., Valls, A. and Gibert, K. (2010), “Ontology-driven web-based semanticsimilarity”, Journal of Intelligent Information Systems, Vol. 35 No. 3, pp. 383-413.

Song, W., Hua Li, C. and Cheol Park, S. (2009), “Genetic algorithm for text clustering usingontology and evaluating the validity of various semantic similarity measures”, ExpertSystems with Applications, Vol. 36 No. 5, pp. 9095-104.


737

Turney, P.D. (2001), “Mining the web for synonyms: PMI-IR versus LSA on TOEFL”,Proceedings of ECML, Springer, Berlin, pp. 491-502.

Urdiales-Nieto, D., Martinez-Gil, J. and Aldana-Montes, J.F. (2009), “MaSiMe: a customizedsimilarity measure and its application for tag cloud refactoring”, Proceedings of OTMWorkshops, Springer, Berlin, pp. 937-46.

Vechtomova, O. and Karamuftuoglu, M. (2007), “Query expansion with terms selected usinglexical cohesion analysis of documents”, Information Processing Management, Vol. 43No. 4, pp. 849-65.

Yadav, S.B. (2010), “A conceptual model for user-centered quality information retrieval on theWorld Wide Web”, Journal of Intelligent Information Systems, Vol. 35 No. 1, pp. 91-121.

Zhu, L., Ma, Q., Liu, C., Mao, G. and Yang, W. (2010), “Semantic-distance based evaluation ofranking queries over relational databases”, Journal of Intelligent Information Systems,Vol. 35 No. 3, pp. 415-45.

About the authorsDr Jorge Martinez-Gil is a Senior Researcher in the Department of Computer Languages andComputing Sciences at the University of Malaga (Spain). His main research interests are relatedto interoperability in the worldwide web. In fact, his PhD thesis addressed ontologymeta-matching and reverse ontology matching problems. Dr Martinez-Gil has published severalpapers in prestigious journals like SIGMOD Record, Knowledge and Information Systems,Knowledge Engineering Review, and so on. Moreover, he is a reviewer for conferences andjournals related to the data and knowledge engineering field. Jorge Martinez-Gil is thecorresponding author and can be contacted at: [email protected]

Professor Dr Jose F. Aldana-Montes is currently a Professor in the Department of Languagesof Computing Sciences at the Higher Computing School of the University of Malaga (Spain) andHead of Khaos Research, a group that researches the semantic aspects of databases. DrAldana-Montes has more than 20 years of experience in research relating to several aspects ofdatabases, semi-structured data and semantic technologies and its application to such fields asbioinformatics and tourism. He is the author of several relevant papers in top bioinformaticjournals and conferences. He has taught theoretical and practical aspects of databases at allpossible university levels from undergraduate courses to PhD.

OIR36,5

738

To purchase reprints of this article please e-mail: [email protected] visit our web site for further details: www.emeraldinsight.com/reprints

Documents

Smart combination of web measures for solving semantic similarity problems