1 IOS Press EQUATER - An Unsupervised Data-driven Method ... · 2 Zhang et al. / EQUATER - An Unsupervised Data-driven Method to Discover Equivalent Relations in Large Linked Datasets

Undefined 1 (2009) 1–5 1IOS Press

EQUATER - An Unsupervised Data-drivenMethod to Discover Equivalent Relations inLarge Linked DatasetsEditor(s): Name Surname, University, CountrySolicited review(s): Name Surname, University, CountryOpen review(s): Name Surname, University, Country

Ziqi Zhang a,∗, Anna Lisa Gentile a and Eva Blomqvist b Isabelle Augenstein a Fabio Ciravegna a

a Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello, Sheffield, S1 4DPE-mail: {ziqi.zhang,a.gentile,i.augenstein,f.ciravegna}@sheffield.ac.ukb Department of Computer and Information Science, Linköping University, 581 83 Linköping, SwedenE-mail: [email protected]

Abstract. The Web of Data is currently undergoing an unprecedented level of growth thanks to the Linked Open Data effort.One escalated issue is the increasing level of heterogeneity in the published resources. This seriously hampers interoperabilityof Semantic Web applications. A decade of effort in the research of Ontology Alignment has contributed to a rich literaturededicated to such problems. However, existing methods can be still limited when applied to the domain of Linked Open Data,where the widely adopted assumption of ‘well-formed’ ontologies breaks due to the larger degree of incompleteness, noise andinconsistency both found in the schemata and in the data described by them. Such problems become even more noticeable inthe problem of aligning relations, which is very important but insufficiently addressed. This article makes contribution to thisparticular problem by introducing EQUATER, a domain- and language-independent and completely unsupervised method toalign equivalent relations across schemata based on their shared instances. Included by EQUATER are a novel similarity measureable to cope with unbalanced population of schema elements, an unsupervised technique to automatically decide similarity cutoffthresholds to assert equivalence, and an unsupervised clustering process to discover groups of equivalent relations across differentschemata. The current version of EQUATER is particularly suited for a more specific yet realistic case: addressing alignmentwithin a single large Linked Dataset, the problem that is becoming increasingly prominent as collaborative authoring is adoptedby many large-scale knowledge bases. Using three datasets created based on DBpedia (the largest of which is based on a realproblem currently concerning the DBpedia community), we show encouraging results from a thorough evaluation involving fourbaseline similarity measures and over 15 comparative models by replacing EQUATER components with their alternatives: theproposed EQUATER makes significant improvement over baseline models in terms of F1 measure (mostly between 7% and40%). It always scores the highest precision and is also among the top performers in terms of recall. Together with the releaseddataset to encourage comparative studies, this work contributes valuable resources to the related area of research.

Keywords: ontology alignment, ontology mapping, Linked Data, DBpedia, similarity measure

1. Introduction

The Web of Data is currently seeing remarkablegrowth under the Linked Open Data (LOD) commu-

*Corresponding author. E-mail: [email protected].

nity effort. The LOD cloud currently contains over 870datasets and more than 62 billion triples1. It is be-coming a gigantic, constantly growing and extremely

1http://stats.lod2.eu/, visited on 01-11-2013

0000-0000/09/$00.00 © 2009 – IOS Press and the authors. All rights reserved

2 Zhang et al. / EQUATER - An Unsupervised Data-driven Method to Discover Equivalent Relations in Large Linked Datasets

valuable knowledge source useful to many applica-tions [30,15]. Following the rapid growth of the Webof Data is the increasingly pressing issue of hetero-geneity, the phenomenon that multiple vocabulariesexist to describe overlapped or even the same domains,and the same objects are labeled with different iden-tifiers. The former is usually referred to schema-levelheterogeneity and the latter as data or instance-levelheterogeneity. It is widely recognized that currentlyLOD datasets are characterized by dense links at data-level but very sparse links at schema-level [40,24,14].This may hamper the usability of data over large scaleand decreases interoperability between Semantic Webapplications built on LOD datasets. This work ex-plores this issue and particularly studies linking rela-tions across different schemata in the LOD domain, aproblem that is currently under-represented in the lit-erature.

Research in the area of Ontology Alignment [13,41]has contributed to a plethora of methods towards solv-ing heterogeneity on the Semantic Web. The main-stream work [36,28,37,21,25,29,43,8,6,38] is archivedunder the Ontology Alignment Evaluation Initiative(OAEI) [16]. However, it has been criticized that thesemethods are tailored to cope with nicely structuredand well defined ontologies [17], which are differentfrom LOD ontologies characterized by noise and in-completeness [39,40,47,14,17,51]

Despite such rich literature, we notice that aligningheterogeneous relations is not yet well-addressed, es-pecially in the LOD context. Recent research has foundthat this problem is considered to be harder than, e.g.,aligning classes or concepts [18,14,5]. Relation namesare more diverse than concept names [5], and the syn-onymy and polysemy problems are also more typical[14,5]. This makes aligning relations in the LOD do-main more challenging. Structural information of rela-tions is particularly lacking [14,51], and the inconsis-tency between intended meaning of schemata and theirusage in data is more wide-spread [18,14,17].

Further, a common limitation to nearly all exist-ing methods is the need for setting a cutoff thresh-old of computed similarity scores in order to assertcorrespondences. It is known that the performanceof different methods are very sensitive to thresholds[37,29,46,19,5], while finding optimal thresholds re-quires expensive tuning and availability of annota-tions; unfortunately, the thresholds are often context-dependent and requires re-tuning for different tasks[22,42].

This work focuses specifically on linking hetero-geneous relations in the LOD domain. We introduceEQUATER (EQUivalent relATion findER), a com-pletely unsupervised method for discovering equiva-lent relations for specific concepts, using only data-level evidence without any schema-level information.EQUATER has three components: (1) a similaritymeasure that computes pair-wise similarity betweenrelations, designed to cope with the unbalanced (andparticularly sparse) population of schemata in LODdatasets; (2) an unsupervised method of detecting cut-off thresholds based on patterns discovered in the data;(3) and an unsupervised clustering process that groupsequivalent relations, potentially discovering relationalignments among multiple schemata. The principle ofEQUATER is studying the shared instances betweentwo relations, a feature that makes it particularly suitedfor aligning relations found in a single, large LinkedDataset, such as DBpedia. Although Ontology Align-ment is usually concerned about linking schemata anddata instances across different datasets, usage of het-erogeneous resources is also common in single dataset,and is becoming an increasingly prominent problemas the practice of collaborative authoring encouragesintegration with existing large LOD datasets by vari-ous parties, who often fail to conform to a universalschema. As a realistic scenario, the DBpedia mappingsportal2 is a community effort dedicated to such a prob-lem. Nevertheless, we also discuss how EQUATERcan be improved to address cross-dataset alignment.

To thoroughly evaluate EQUATER, we use a num-ber of datasets collected in a controlled manner, in-cluding one based on the practical problem faced bythe DBpedia mapping portal. We create a large numberof comparative models to assess EQUATER along thefollowing dimensions: its similarity measure, capabil-ity of coping with dataset featuring unbalanced usageof schemata, automatic threshold detection, and clus-tering. We report encouraging results from these ex-periments. EQUATER successfully discovers equiva-lent relations across multiple schemata, and the simi-larity measure of EQUATER is shown to significantlyoutperform all baselines in terms of F1 (maximum im-provement of 0.47, or 47%). It also handles unbalancedpopulations of schema elements and shows stabilityagainst several alternative models. Meanwhile, the au-tomatic threshold detection method is shown to be very

2http://mappings.dbpedia.org/index.php/Mapping_en, visitedon 01 August 2014

Zhang et al. / EQUATER - An Unsupervised Data-driven Method to Discover Equivalent Relations in Large Linked Datasets 3

competitive - it even outperforms the supervised mod-els on one dataset in terms of F1. Overall we believethat EQUATER provides an effective solution to prac-tical problems in the LOD domain.

In the remainder of this paper, Section 2 discussesrelated work; Section 3 introduces the method; Section4 describes a series of designed experiments and 5 dis-cusses results, followed by conclusion in Section 6.

2. Related Work

2.1. Scope and terminology

An alignment between a pair of ontologies is a setof correspondences between entities across the on-tologies [13,41]. Ontology entities are usually: classesdefining the concepts within the ontology; individu-als denoting the instances of these classes; literals rep-resenting concrete data values; datatypes defining thetypes that these values can have; and properties com-prising the definitions of possible associations betweenindividuals, called object properties, or between oneindividual and a literal, called datatype properties [25].Properties connect other entities to form statements,which are called triples each consisting of a subject,a predicate (i.e., a property) and an object3. A corre-spondence asserts certain relation holds between twoontological entities, and the most frequently studiedrelations are equivalence and subsumption. Ontologyalignment is often discussed at ‘schema’ or ‘instance’level, where the former usually addresses alignmentfor classes and properties, the latter addresses align-ment for individuals. This work belongs to the domainof schema level alignment.

As we shall discuss, in the LOD domains, data arenot necessarily described by formal ontologies, butsometimes vocabularies that are simple renderings ofrelational databases [40]. Therefore in the following,wherever possible, we will use the more general termschema or vocabulary instead of ontology, and relationand concept instead of property and class. When weuse the terms class or property we mean strictly in theformal ontology terms unless otherwise stated.

A fundamental operation in discovering ontologyalignment is matching pairs of individual entities. Suchmethods are often called ‘matchers’ and are usually di-vided into three categories depending on the type of

3To be clear, we will always use ‘object’ in the context of triples;we will always use ‘individual’ to refer to object instances of classes.

data they work on [13,41]. Terminological matcherswork on textual strings such as URIs, labels, commentsand descriptions defined for different entities within anontology. The family of string similarity metrics hasbeen widely employed for this purpose [5]. Structuralmatchers make use of the hierarchy and relations de-fined between ontological entities. They are closely re-lated to measures of semantic similarity, or relatednessin more general sense [50]. Extensional matchers ex-ploit data that constitute the actual population of an on-tology or schema in the general sense, and therefore,are often referred to as instance- or data-based meth-ods. For a concept, ‘instances’ or ‘populated data’ areindividuals in formal ontology terms; for a relation,these can depend on specific matchers, but are typi-cally defined based on triples containing the relation.Matchers compute a degree of similarity between enti-ties in certain numerical range, and use cutoff thresh-olds to assert correspondences.

We discuss state-of-the-art in three sub-sections inthe following: the mainstream work led under the an-nual OAEI campaigns; work particularly addressingontology alignment in the LOD domain; and workspecifically looking at aligning relations across differ-ent schemata.

2.2. OAEI and state-of-the-art

The OAEI maintains a number of well-known pub-lic datasets for evaluating ontology alignment meth-ods, and hosts annual campaigns to compare theperformance of different systems under a uniformframework. Work under this paradigm has been well-summarized in [13,41]. A predominant pattern sharedby these work [36,28,37,21,25,29,43,8,6,38,19] is thestrong preference of terminological and structuralmatchers to extensional methods [41]. There is alsoa trend of using a combination of different matchers(either across or within categories), since it is arguedthat the suitability of a matcher is dependent on differ-ent scenarios and therefore combining several match-ers could improve alignment quality [38]. However, anassociated problem is finding an optimal ‘configura-tion’ in the combination such as tuning weights asso-ciated with different matchers [36,37,29,43,38]. With-out these, multi-matcher methods can even underper-form single matchers [29]. Several studies have beencarried out in this direction, such as Hu et al. [21] andLi et al. [29] that build on the notions of linguistic andstructural ‘comparability’ of two ontologies; Nagy etal. [36,37] that uses the Dempster-Shafer [44] theory to


combine different evidences given by different match-ers to arrive at a coherent interpretation; and Ngo etal. [38] that uses supervised learning to learn an op-timal setting from training data. Most studies showedthat ontology alignment can benefit from the use ofbackground knowledge sources such as WordNet andWikipedia [36,28,37,25,8,19].

General limitations Despite their success at OAEIcompaigns, these methods suffer from several limita-tions even without considering the complications dueto the LOD domain or relation heterogeneity. First,it is well-known that terminological matchers easilyfail when the linguistic features of two entities differlargely. While to certain degree structural matchers cansolve this issue, the problem is that both categoriesare very ontology-specific. As shown by Jean-Marryet al. [25] and Gruetze et al. [17], two ontologies con-structed for the same domain using the same data re-sources by different experts could be vastly dissimi-lar in terms of taxonomy and terminological features(e.g., both DBpedia4 and YAGO5 ontologies representknowledge extracted from Wikipedia).

On the other hand, even strong similarity betweenentities measured at terminological or structural leveldoes not always imply strict equivalence, since onto-logical schemata may be interpreted in different waysby data publishers thus creating inconsistency betweentheir intended meanings and actual usage patterns indata [39,17,51] (e.g., foaf6:Person may represent re-searchers in a scientific publication dataset, but artistsin a music dataset).

Many state-of-the-art methods use a combinationof different matchers. This does not necessarily solvethe problems but can significantly increase complexity.Similarity computation is typically quadratic; the morematchers are combined, the larger the search spacegrows and the more computation is required. Addi-tional effort is also required to carefully combine out-put from different matchers in a coherent way. As weshall discuss in the next sections, tougher challengesarise in the LOD domain or in the problem of align-ing heterogeneous relations, as some of those problemsbecome even more typical.

4http://dbpedia.org/Ontology, visited on 01-11-2013.5http://www.mpi-inf.mpg.de/yago-naga/yago/, visited on 01-

11-20136foaf:http://xmlns.com/foaf/0.1/

2.3. Ontology alignment in the LOD domain

Some characteristics of Linked Data require particu-lar attention when adapting ontology alignment meth-ods from the classic OAEI scenario to the LOD do-main. First and foremost, vocabulary definitions areoften highly heterogeneous and incomplete [17]. Tex-tual features such as labels and comments for conceptsand relations that are used by almost every methoddocumented by OAEI are non-existent in some largeontologies. In particular, many vocabularies generatedfrom (semi-)automatically created large datasets arebased on simple rendering of relational databases andare unlikely to contain such information. For instance,Fu et al. [14] showed that the DBpedia ontology con-tained little linguistic information about relations ex-cept their names. The problem of inconsistency be-tween the intended definitions and the actual usageof concepts and relations discussed before, is particu-larly prominent in the LOD domain [39,47,17], mak-ing such kinds of information unreliable evidence evenif they are available. Empirically, Jain et al. [23] andCruz et al. [6] showed that the top-performing sys-tems in ‘classic’ ontology alignment settings such asthe OAEI do not have clear advantage over others inthe LOD domain.

Another feature of the Linked Data environment isthe presence of large volumes of data and the avail-ability of many interconnected information sources[39,40,46,17]. Thus extensional matchers can be bet-ter suited for the problem of ontology alignment in theLOD domain as they provide valuable insights into thecontents and meaning of schema entities from the waythey are used in data [39,47].

The majority of state-of-the-art in the LOD domainemployed extensional matchers. Nikolov et al. [39]proposed to recursively compute concept mappingsand entity mappings based on each other. Suchanek etal. [46] built a holistic model starting with initializingprobabilities of correspondences based on instance (forboth concepts and relations) overlap, then iterativelyre-compute probabilities until convergence. However,equivalence between relations are not addressed.

Parundekar et al. [40] discussed aligning ontologiesthat are defined at different levels of granularity, whichis common in the LOD domain. As a concrete exam-ple, they mapped the only class in the GeoNames7 on-tology - geonames:Feature - with a well defined one,

7http://www.geonames.org/


such as the DBpedia ontology, by using the notion of‘restriction class’. Slabbekoorn et al. [45] explored asimilar problem: matching a domain-specific ontologyto a general purpose ontology.

Jain et al. [24] proposed BLOOMS+, the idea ofwhich is to build representations of concepts as sub-trees from Wikipedia category hierarchy, then deter-mine equivalence between concepts based on the over-lap in their representations. Both structural and exten-sional matcher are used and combined. Cruz et al. [7]created a customization of the AgreementMaker sys-tem [8] to address ontology alignment in the LOD con-text, and achieved better average precision but worserecall than BLOOMS+. Gruetze et al. [17] and Duanet al. [12] also used extensional matchers in the LODcontext but focusing on improving computation effi-ciency of the algorithms.

A review of these methods show that many havestrong preference towards using extensional matchersfor ontology alignment in the LOD domain, but theytypically focus on aligning concepts not relations. Aswe shall discuss in the following, aligning relations in-volves new challenges.

2.4. Matching relations

Compared to concepts, aligning relations is gener-ally considered to be harder [18,14,5]. The challengesconcerning the LOD domain can become more notice-able when dealing with relations. In terms of linguis-tic features, relation names can be more diverse thanconcept names, this is because they frequently involveverbs that can appear in a wider variety of forms thannouns, and contain more functional words such as ar-ticles and prepositions [5]. The synonymy and poly-semy problems are common. Verbs in relation namesare more generic than nouns in concept names andtherefore, they generally have more synonyms [14,5].

Same relation names are frequently found to beardifferent meanings in different contexts [18], e.g., inthe DBpedia dataset ‘before’ is used to describe rela-tionship between consecutive space missions, or con-secutive Academy Award winners [14]. Such poly-semy issue causes wide-spread inconsistency betweendefinitions of relations and their actual usage in data.Indeed, Gruetze et al. [17] suggested definitions of re-lations should be ignored when they are studied in theLOD domain due to such issues.

In terms of structural features, Zhao et al. [51]showed that relations may not have domain or rangedefined in the LOD domain. Moreover, we carried

out a test on the ‘well-formed’ ontologies releasedby the OAEI-2013 website, and found that among 21downloadable8 ontologies 7 defined relation hierarchyand the average depth is only 3. Fu et al. [14] alsoshowed hierarchical relations between DBpedia prop-erties were very rare.

For these reasons, terminological and structuralmatchers [53,25,29,38] can be seriously hampered ifapplied to matching relations, particularly in the LODdomain. Indeed, Cheatham et al. [5] compared a wideselection of string similarity metrics in several tasksand showed their performance on matching relationsto be inferior to matching concepts. Thus in line with[39,12], we argue in favour of extensional matchers.

We notice only a few related work specifically fo-cused on matching relations based on data-level evi-dence. Fu et al. [14] studied mapping relations in theDBpedia dataset. The method uses three types of fea-tures: data level, terminological, and structural. Simi-larity is computed using three types of matchers corre-sponding to the features. Zhao et al. [52,51] first cre-ated triple sets each corresponding to a specific subjectthat is an individual, such as dbr:Berlin. Then initialgroups of equivalent relations are identified for eachspecific subject: if, within the triple set containing thesubject, two lexically different relations have identi-cal objects, they are considered equivalent. The initialgroups are then pruned by a large collection of termi-nological and structural matchers, applied to relationnames and objects to discover fuzzy matches.

Many extensional matchers used for matching con-cepts could be adapted to matching relations. One pop-ular strategy is to compare the size of the overlap inthe instances of two relations against the size of theirtotal combined, such as the Jaccard and Dice metrics(or similar) used in [22,12,14,17]. However, we arguethat in the LOD domain, usage of vocabularies canbe extremely unbalanced due to the collaborative na-ture of LOD. Data publishers have limited knowledgeabout available vocabularies to describe their data, andin worst cases they simply do not bother [47]. As a re-sult, concepts and relations defined from different vo-cabularies bearing the same meaning can have differ-ent population sizes. In such cases, the above strategyis unlikely to succeed, as suggested in [39].

Another potential issue is that current work assumesrelation equivalence to be ‘global’, while it has been

8http://oaei.ontologymatching.org/2013/, visited on 01-11-2013. Some datasets were unavailable at the time.


suggested that, interpretation of relations should becontext-dependent, and argued that equivalence shouldbe studied at concept-specific context because essen-tially relations are defined specifically with respectto concepts [18,14]. Global equivalence cannot dealwith the polysemy issue such as the previously illus-trated example of ‘before’ bearing different meaningsin different contexts. Further, to our knowledge, thereis currently no public dataset specifically for align-ing relations in the LOD domain, and current meth-ods [14,52,51] have been evaluated on smaller datasetsthan those used in this study.

Additionally, it can be argued that aligning relationsis closely related to the extensive literature on map-ping database schemata. To name a few, Madhavanet al. [34] and Do and Rahm [10] studied methodsof combining a number of different matchers to aligndatabase tables. Doan et al. [11] introduced a super-vised model, using similar features used by extensionaland terminological matchers. Kang and Naughton [27]suggested that attributes from two database tablescan be aligned based on pair-wise attribute correla-tion detected within each individual table. This com-plements the classical extensional and terminologicalmethods. Madhavan et al. [33] looked at using a cor-pus of schema as additional evidence (to the concern-ing schemata to be matched) for attribute mapping.Apart from the intrinsic limitations of the highly sim-ilar matching methods to those used in the ontologycommunity, the LOD domain introduces new problemssuch as data sparsity and noise, which can invalidatesuch methods or require adaptation.

2.5. The cutoff thresholds in matchers

To date, nearly all existing matchers require a cut-off threshold to assert correspondence between enti-ties. The performance of a matcher can be very sensi-tive to thresholds and finding an optimal point is of-ten necessary to warrant the effectiveness of a matcher[37,29,46,19,5]. Such thresholds are typically decidedbased on some annotated data (e.g., [29,43,18]), oreven arbitrarily in certain cases. In the first case, ex-pensive effort must be spent on annotation and train-ing. In both cases, the thresholds are often context-dependent and requires re-tuning for different tasks[22,43,42].

Another approach adopted in [12] and [14] is to sortthe matching results in a descending order of the sim-ilarity score, and pick only the top-k results. This suf-fers from the same problem as cutoff thresholds since

the value of k can be different in different contexts(e.g., in [12] this varied from 1 to 86 in the groundtruth). To the best of our knowledge, our work is thefirst that studies the problem of automatically decidingthresholds based on data in ontology alignment.

2.6. Remark

To conclude this section of related work, we arguethat aligning relations across schemata in the LODdomain is an important research problem that is cur-rently under-represented in the literature of ontologyalignment. The characteristics of relations found in theschemata from the LOD domain, i.e., incomplete (orlack of) definitions, inconsistency between intendedmeaning of schemata and their usage in data, and verylarge amount data instances, advocate for a renewed in-spection of existing ontology alignment methods. Webelieve the solution rests in extensional methods thatprovide insights into the meaning of relations based ondata, and unsupervised methods that alleviate the needfor threshold tuning.

Towards these directions we developed a prototype[49] specifically to study aligning equivalent relationsin the LOD domain. We proposed a different exten-sional matcher designed to reduce the impact of the un-balanced populations , and a rule-based clustering thatemploys a series of cutoff thresholds to assert equiv-alence between relation pairs and discover groups ofequivalent relations specific to individual concepts.The method showed very promising results in terms ofprecision, and was later used in constructing knowl-edge patterns based on data [2,48]. The work describedin this article is built on our prototype but largelyextends it in several dimensions: (1) the matcher islargely revised and extended; (2) a method of auto-matic threshold detection based on data; (3) an unsu-pervised machine learning clustering approach to dis-cover groups of equivalent relations; (4) augmentedand re-annotated datasets that we make available topublic; (5) extensive and thorough evaluation against alarge set of comparative models, together with an in-depth analysis of the task of aligning relations in theLOD domain.

We focus on equivalence only because firstly, it isconsidered the major issue in ontology alignment as itis the focus by the majority of related work; secondly,hierarchical structures for relations are very rare, espe-cially in the LOD domain.


<x, r, y> <dbr9:Sydney_Opera_House,dbo10:openningDate, ‘1973’>,<dbr:Royal_Opera_House,dbo:openningDate, ‘1732’>,<dbr:Sydney_Opera_House,dbpp11:yearsactive, ‘1973’>

r1 dbo:openningDate

r2 dbpp:yearsactive

arg(r1) (dbr:Sydney_Opera_House, ‘1973’),(dbr:Royal_Opera_House, ‘1732’)

args(r1) dbr:Sydney_Opera_House,dbr:Royal_Opera_House

argo(r1) ‘1973’,‘1732’Table 1

Notations used in this paper and their meaning

3. The EQUATER Method

3.1. Task formalization

In this section we describe EQUATER - our domain-and language-independent method for finding equiva-lent relations from LOD datasets. EQUATER belongsto the category of extensional matchers according to[13,41], and only uses instances of relations as its ev-idence to predict equivalence. In the following, wewrite <x, r, y> to represent triples, where x, y and rare variables representing subject, object and relationrespectively. We will call x, y the arguments of r, orlet arg(r) = (x, y) return pairs of x and y betweenwhich r holds true. We call such argument pairs as in-stances of r. We will also call x the subject of r, orlet args(r) = x return the subjects of any triples thatcontain r. Likewise we call y the object of r or letargo(r) = y return the objects of any triples that con-tain r. Table 1 shows examples using these notations.

EQUATER takes as input a URI representing a spe-cific concept C and a set of triples <x, r, y> whosesubjects are individuals of C, or formally type(x) =C. In other words, we study the relations that link Cwith everything else. The intuition is that such rela-tions may carry meanings that are specific to the con-cept (e.g., the example of the DBpedia relation ‘before’in the context of different concepts).

Our task can be formalized as: given the set of triples<x, r, y> such that x are instances of a particular con-cept, i.e., type(x) = C, determine 1) for any pair of(r1, r2) derived from <x, r, y> if r1 ≡ r2; and 2) cre-

9dbr:http://dbpedia.org/resource/10dbo:http://dbpedia.org/ontolgy/11dbpp:http://dbpedia.org/property/

ate clusters of relations that are mutually equivalent.To approach the first goal, we firstly introduce a data-driven similarity measure (Section 3.2). For a specificconcept we hypothesize there exists only a handfulof truly equivalent relation pairs (true positives) withhigh similarity scores, however, there can be a largenumber of pairs of relations with low similarity scores(false positives) due to noise in the data caused by, e.g.,misuse of schemata or purely coincidence. Therefore,we propose to automatically detect concept-specificthresholds based on patterns in the similarity scores ofrelation pairs of the concept. Then pairs with scoresbeyond the threshold are considered to be equivalent(Section 3.3). For the second goal, we apply unsuper-vised clustering to the set of equivalent pairs and cre-ate clusters of mutually equivalent relations (Section3.4). Clustering effectively discovers equivalence tran-sitivity or invalidates pair-wise equivalence. This mayalso discover alignments among multiple schemata atthe same time, while state-of-the-art alignment modelsusually align pairs of schemata.

3.2. Measure of similarity

The goal of the measure is to assess the degree ofsimilarity between a pair of relations within a concept-specific context, as illustrated in Figure 1. The measureconsists of three components, the first two of which arepreviously introduced in our prototype [49].

Fig. 1. The similarity measure computes a numerical score for pairsof relations. r3 and r5 has a score of 0.

3.2.1. Triple agreementTriple agreement evaluates the degree of shared ar-

gument pairs of two relations in triples. Equation 1firstly computes the overlap (intersection) of argumentpairs between two relations.

arg∩(r1, r2) = arg(r1) ∩ arg(r2) (1)

Then the triple agreement is a function that returnsa value between 0 and 1.0:


ta(r1, r2) = max{ |arg∩(r1, r2)||arg(r1)|

,|arg∩(r1, r2)||arg(r2)|

(2)

The intuition of triple agreement is that if two rela-tions r1 and r2 have a large overlap of argument pairswith respect to the size of either relation, they are likelyto have an identical meaning. We choose the max ofthe two values in equation 2 rather than balancing thetwo as this copes with the unbalanced usage of differ-ent schemata in LOD datasets, the problem which wediscussed in Section 2.4. As an example, consider Fig-ure 2. The size of argument pair overlap between r1and r2 is 4 and it is relatively large to r1 but ratherinsignificant to r2. ta chooses the maximum betweenthe two giving a strong indication of equivalence be-tween the relations. We note that similar forms havebeen used in [39] for discovering similar concepts andin [40,46] for studying subsumption relations betweenconcepts. However we believe that this could be usedto find equivalent relations due to the largely unbal-anced population for different vocabularies, as well asthe lack of hierarchical structures for relations as dis-cussed before in Section 2.4. We confirm this empiri-cally in experiments later in Section 5.

Fig. 2. Illustration of triple agreement.

3.2.2. Subject agreementSubject agreement provides a complementary view

by looking at the degree to which two relations sharethe same subjects. The motivation of having sa in ad-dition to ta can be illustrated by Figure 3. The exam-ple produces a low ta score due to the small overlapin the argument pairs of r1 and r2. A closer look re-veals that although r1 and r2 have 7 and 11 argumentpairs, they have only 3 and 4 different subjects respec-tively and two are shared in common. This indicatesthat both r1 and r2 are 1-to-many relations. Again dueto publisher preferences or lack of knowledge, triplesmay describe the same subject (e.g., dbr:London) us-ing heterogeneous relations (e.g., dbo:birthPlace Of,dbpp:place OfOriginOf ) with different sets of objects(e.g., {dbr:Marc_Quinn, dbr:David_Haye, dbr:Alan_

Keith} for dbo:birthPlaceOf and {dbr:Alan_Keith,dbr:Adele_Dixon} for dbpp:placeOfOriginOf ). ta doesnot discriminate such cases.

Fig. 3. Illustration of subject agreement.

Subject agreement captures this situation by hypoth-esizing that two relations are likely to be equivalentif (α) a large number of subjects are shared betweenthem and (β) a large number of such subjects also haveshared objects.

sub∩(r1, r2) = args(r1) ∩ args(r2) (3)

sub∪(r1, r2) = args(r1) ∪ args(r2) (4)

α(r1, r2) =|sub∩(r1, r2)||sub∪(r1, r2)|

(5)

β(r1, r2) =

∑x∈sub∩(r1,r2)

1 if ∃y : (x, y) ∈ arg∩(r1, r2)0 otherwise

|sub∩(r1,r2)|

(6)

, both α and β return a value between 0 and 1.0, andsubject agreement combines both to also return a valuein the same range as

sa(r1, r1) = α(r1, r2) · β(r1, r2) (7)

Equation 5 evaluates the degree to which two re-lations share subjects based on the intersection andthe union of the subjects of two relations. Equation 6counts the number of shared subjects that have at leastone overlapping object. The higher the β, the more thetwo relations ‘agree’ in terms of their shared subjectssub∩. For each subject shared between r1 and r2 wecount 1 if they have at least 1 object in common and 0


otherwise. Since both r1 and r2 can be 1-to-many re-lations, few overlapping objects could mean that oneis densely populated while the other is not, which doesnot mean they ‘disagree’. The agreement sa(r1, r2)balances the two factors by taking the product. As aresult, relations that have high sa will share many sub-jects (α), a large proportion of which will also share atleast one object (β). Following the example in Figure 3it is easy to calculate α = 0.4, β = 1.0 and sa = 0.4.

3.2.3. Knowledge confidence modifierAlthough ta and sa computes scores of similarity

from different dimensions, as argued by [22], in prac-tice, datasets often have imperfections due to incor-rectly annotated instances, data spareness and ambigu-ity, so that basic statistical measures of co-occurrencemight be inappropriate if interpreted in a naive way.Specifically in our case, the divisional equations of taand sa components can be considered as comparisonbetween two items - sets of elements in this case. Fromthe cognitive point of view, to make a meaningful com-parison of two items we must possess adequate knowl-edge about each such that we ‘know’ what we are com-paring and can confidently identify their ‘difference’.Thus we hypothesize that our confidence about the out-come of comparison directly depends on the amountof knowledge we possess about the compared items.We can then solve the problem by solving two sub-tasks: (1) quantifying knowledge and (2) defining ‘ad-equacy’.

The quantification of knowledge can be built on theprinciple of human inductive learning - learning by ex-amples. The intuition is that given a task (e.g., learn-ing to recognize horses) of which no a-priori knowl-edge is given, humans are capable of generalizing ex-amples and inducing knowledge about the task. Ex-hausting examples is unnecessary and typically ourknowledge converges after seeing certain amount ofexamples and we learn little from additional examples- a situation that indicates the notion of ‘adequacy’.Such a learning process can be modeled by ‘learningcurves’, which are designed to capture the relation be-tween how much we experience (examples) and howmuch we learn (knowledge). Therefore, we proposeto approximate the modeling of confidence by modelsof learning curves. In the context of EQUATER, theitems we need knowledge of are pairs of relations tobe compared. Practically, each is represented as a setof instances, i.e., examples. Thus our knowledge aboutthe relations can be modeled by learning curves cor-

responding to the number of examples (i.e., argumentpairs) in each set.

We propose to model this problem based on the the-ory by Dewey [9], who suggests human learning fol-lows an ‘S-shaped’ curve as shown in Figure 4. Aswe begin to observe examples, our knowledge growsslowly as we may not be able to generalize over lim-ited cases. This is followed by a steep ascending phasewhere, with enough experience and new re-assuringevidence, we start ‘putting things together’ and gain-ing knowledge at a faster phase. This rapid progresscontinues until we reach convergence, an indication of‘adequacy’ and beyond which the addition of examplesadds little to our knowledge.

Fig. 4. The logistic function modelling knowledge confidence

Empirically, we model such a curve using a logisticfunction shown in Equation 8, where T denotes a setof argument pairs of relations we want to understand,kc is the shorthand for knowledge confidence (between0.0 and 1.0) representing the amount of knowledge orlevel of confidence corresponding to different amountsof examples, and n denotes the number of examples bywhich one gains adequate knowledge about the set andbecomes fully confident about comparisons involvingthe set (hence the corresponding relation it represents).

kc(|T |) = lgt(|T |) = 1

1 + en2 −|T | (8)

It could be argued that other learning curves (e.g.,exponential) could be used as alternative; or we coulduse simple heuristics instead (e.g., discard any rela-tions that have fewer than n argument pairs). How-ever, we believe that the logistic model better fits theproblem since the exponential model usually impliesrapid convergence, which is hardly the case in manyreal learning situations; while the simplistic thresh-


old based model may harm recall. We show empiricalcomparison in Section 5.

Next we revise ta and sa as takc and sakc respec-tively by integrating the kc measure in the equation:

takc(r1, r2) = max{ |arg∩(r1, r2)||arg(r1)|

· kc(|arg(r1)|),

|arg∩(r1, r2)||arg(r2)|

· kc(|arg(r1)|)}

(9)

sakc(r1, r1) =

α(r1, r2) · β(r1, r2) · kc(|arg∩(r1, r2)|)(10)

, where the choice of the kc function can be eitherlgt, or some alternative models to be detailed in Sec-tion 4. The choice of the kc function does not breakthe mathematical consistency of the formula. In Equa-tion 9, our confidence about a ta score depends on ourknowledge of either arg(r1) or arg(r2) (i.e., the de-nominators). Note that the denominator is always a su-perset of the numerator, the knowledge of which wedo not need to quantify separately since intuitively, ifwe know the denominator we should also know its el-ements and its subsets. Likewise in Equation 10, ourconfidence about an sa score depends on the knowl-edge of the shared argument pairs between r1 and r2as any other components in the equation are essentiallysubsets of this set. Both Equations return a value be-tween 0 and 1.0.

Finally, the similarity of r1 and r2 is:

e(r1, r2) =takc(r1, r2) + sakc(r1, r2)

2(11)

3.3. Determining thresholds

After computing similarity scores for relation pairsof a specific concept, we need to interpret the scoresand be able to determine the minimum score that jus-tifies equivalence between two relations (Figure 5).This is also known as the mapping selection problem.As discussed before, one typically derives a thresh-old from training data or makes an arbitrary decision.The solutions are non-generalizable and the supervisedmethod also requires expensive annotations.

Fig. 5. Deciding a threshold beyond which pairs of relations are con-sidered to be truely equivalent.

We use an unsupervised method that determinesthresholds automatically based on observed patterns indata. We hypothesize that a concept may have onlya handful of equivalent relation pairs whose similar-ity scores should be significantly higher than the non-equivalent noisy pairs that may happen to have non-zero similarity scores due to imperfections in data suchas spelling errors, schema misuse, or merely coinci-dence. For example, Figure 6 shows the scores (e) of101 pairs of relations of the DBpedia concept Bookranked by e(> 0) appear to form a long-tailed patternconsisting of a small population with high similarityand a very large population with low similarity.

Fig. 6. The long-tailed pattern in similarity scores between relationscomputed using e. t could be the boundary threshold.

On this basis, we propose to separate the non-zeroscored relation pairs into two groups based on the prin-ciple of maximizing the difference of similarity scoresbetween the groups. While a wide range of data classi-fication and clustering methods can be applied for thispurpose, here we use an unsupervised method - Jenksnatural breaks [26].

Jenks natural breaks aims to minimize within-classvariance while maximizing between-class variance.Given i the expected number of groups in the data,the algorithm starts by dividing the data into arbi-trary i groups, followed by an iterative process aimedat optimizing the ‘goodness-of-variance-fit’ based ontwo figures: the sum of squared deviations betweenclasses, and the sum of squared deviations from the ar-ray mean. The resulting optimal classification is calledJenks natural breaks.

Empirically, given a continuous variable (i.e., e) andan array of data values (i.e., similarity scores of re-lation pairs for a concept C), we apply Jenks natural


breaks to the values with i = 2 to break them into twosets. The boundary value t is the threshold used to sep-arate relation pairs that we consider as equivalent fromthose that we consider non-equivalent.

3.4. Clustering

So far Sections 3.2 and 3.3 described our method toanswer the first question set out in the beginning of thissection, i.e., predicting relation equivalence. The pro-posed method studies each relation pair independentlyfrom other pairs. This may not be sufficient for discov-ering equivalent relations due to two reasons. First, tworelations may be equivalent even though no supportingdata are present. For example, in Figure 7 we can as-sume r1 ≡ r3 based on transitivity although no datadirectly supports a positive similarity score betweenthem. Second, a relation may be equivalent to multi-ple relations (e.g., r2 ≡ r1 and r2 ≡ r3) from differ-ent schemata, thus forming a cluster; and furthermoresome equivalence links may appear too weak to holdwhen compared to the cluster context (e.g., e(r1, r4)appears to be much lower compared to other links inthe cluster of r1, r2, and r3).

Fig. 7. Clustering discovers transitive equivalence and invalidatesweak links.

The second goal of EQUATER is to address suchissues by clustering mutually equivalent relations fora concept. Essentially clustering brings in additionalcontext to decide pair-wise equivalence, which maylead to discover transitive equivalence and invalidateweak links. Potentially, this also allows creating align-ments between multiple schemata at the same time.Given {ri, rj : e(ri, rj) ≥ t} the set of equivalent re-lation pairs discovered before, we identify the numberof distinct relations h and create an h×h distance ma-trix M . The value of each cell mi,j , (0 ≤ i, j < h) isdefined as:

mi,j = maxE − e(ri, rj) (12)

where maxE is the maximum possible similaritygiven by a measure (e.g., in Equation 11 maxE =

1.0). Then we use the group-average agglomerativeclustering algorithm [35] that takes M as input andcreates clusters of equivalent relations. To automati-cally decide the optimal number of clusters, we use thewell-known Calinski and Harabasz [3] stopping rule.

4. Experiment Settings

Following the two goals described at the beginningof Section 3, we design a series of experiments to thor-oughly evaluate EQUATER in terms of its capabilityof predicting equivalence of two relations of a con-cept (pair equivalence) and grouping equivalent rela-tions (clustering). Different settings are created alongthree dimensions by selecting from several choices of(1) similarity measure, (2) threshold detection methodsand (3) different models of kc.

4.1. Measures of similarity

We compare the proposed measure of similarityagainst four baselines. Our criteria for the baselinemeasures are: 1) to cover different types of match-ers; 2) to focus on methods that have been practicallyshown effective in the LOD context, and where pos-sible, particularly for aligning relations; 3) to includesome best performing methods for this particular task.

The first is a string similarity measure, the Leven-shtein distance metric (lev) that proves to be one of thebest performing terminological matcher for aligningboth relations and classes [5]. Specifically, we measurethe string similarity (or distance) between the URIsof two relations, but we remove namespaces from re-lation URIs before applying the metric. As a result,dbpp:name and foaf:name will be both normalizedto name and thus receiving the maximum similarityscore.

The second is a semantic similarity measure by Lin(lin) [31], which uses both WordNet’s hierarchy andword distributional statistics as features to assess simi-larity of two words. Thus two lexically different words(e.g., ‘cat’ and ‘dog’ can also be similar). Since URIsoften contain strings that are concatenation of multiplewords (e.g., ‘birthPlace’), we use simple heuristics tosplit them into multiple words when necessary (e.g.,‘birth place’). Semantic similarity measures are alsopopular techniques in ontology alignment.

The third is the extensional matcher proposed by[14] (fu) to address particularly the problem of align-ing relations in DBpedia:


fu(r1, r2) =

|args(r1) ∩ args(r2)||args(r1) ∪ args(r2)|

· |argo(r1) ∩ argo(r2)||argo(r1) ∪ argo(r2)|

(13)

The fourth baseline is the ‘corrected’ Jaccard func-tion proposed by Isaac et al. [22]. The original Jaccardfunction has been used in a number of studies concern-ing mapping concepts across ontologies [22,12,42].Isaac et al. [22] showed that it is one of the best per-forming measures in their experiment, however, theyalso pointed out that one of the issue with Jaccard isits inability to consider the absolute sizes of two com-pared sets. As an example, Jaccard does not distinguishthe cases of 100

100 and 11 . In the latter case, there is lit-

tle evidence to support the score (both 1.0). To addressthis, they introduced a ‘corrected’ Jaccard measure (jc)as below:

jc(r1, r2) =√|arg(r1) ∩ arg(r2)| · (|arg(r1) ∩ arg(r2)| − 0.8)

|arg(r1) ∪ arg(r2)|(14)

4.2. Methods of detecting thresholds

We compare three different methods of thresholddetection. The first is Jenks Natural Breaks jk thatcomprises part of EQUATER, discussed in Section 3.3.For the second method we use the k-means cluster-ing [32] algorithm (km) for unsupervised threshold de-tection. K-means takes the same input of jk and cre-ates two clusters such that each data value belongs tothe cluster with the nearest mean. The boundary valuethat separates the two clusters are used as threshold.Since both methods find boundaries based on data inan unsupervised manner, we are able to define concept-specific threshold that may fit better than an arbitrarilydetermined global threshold.

Next, we also use a supervised method (denoted bys) to derive a uniform threshold for all concepts basedon annotated data. To do so, suppose we have a set ofm concepts and for each concept, we create pairs of re-lations found in data and ask humans to annotate eachpair (to be detailed in Section 4.5). This becomes thetraining data that we use to derive a uniform thresh-old. Then we choose a similarity measure to be eval-

uated, and use it to score each pair and rank resultsby scores. Using the annotations, we can evaluate ac-curacy at each rank and at certain rank the accuracyshould be maximized. We record the similarity scoreat this rank, and use it as the optimal threshold for thatconcept. Due to the difference in concept-specific data,we expect to obtain different optimal thresholds foreach of the m concepts in the training data. However,in reality, the thresholds for new data will be unknowna-priori. Therefore we use the average of all thresholdsderived from the training data concepts as an approxi-mation and use it for testing.

4.3. Models of kc

We compare EQUATER’s logistic model (lgt) ofkc against two alternative models. The first is a naivethreshold based model that discards any relations thathave fewer than n argument pairs. Intuitively, n can beconsidered the minimum number of examples to en-sure that a relation has ‘sufficient’ data evidence to ‘ex-plain’ itself. Following this model, if either r1 or r2 in apair has fewer than n triples their ta and sa scores willbe 0, because there is insufficient evidence in the dataand hence we ‘know’ too little about them to evaluatesimilarity. Such strategy is adopted in [22]. To denotethis alternative method we use −n.

The second is an exponential model, denoted by expand shown in Figure 8. We model such a curve usingan exponential function shown in Equation 15, wherek is a scalar that controls the speed of convergence and|T | returns the number of observed examples in termsof argument pairs.

kc(|T |) = exp(|T |) = 1− e−|T |·k (15)

Fig. 8. The exponential function modelling knowledge confidence

For each model we need to define a parameter. Forlgt, we need to define n, the number of examples


above which we obtain adequate knowledge and there-fore maximum confidence. Our decision is inspired bythe empirical experiences of bootstrapping learning, inwhich machine learns a task starting from a handful ofexamples. Carlson et al. [4] suggest that typically 10 to15 examples are sufficient to bootstrap learning of re-lations from free form Natural Language texts. In otherwords, we consider 10 to 15 examples are required to‘adequately’ explain the meaning of a relation. Basedon this intuition, we experiment with n = 10, 15, and20. Likewise this also applies to the model −n, forwhich we experiment with 10, 15 and 20 as thresholds.

We apply the same principle to the exp model. How-ever, the scalar k is only indirectly related to the num-ber of examples. As described before, it affects thespeed of convergence, thus by setting appropriate val-ues the knowledge confidence score returned by thefunction reaches its maximum at different numbers ofexamples. We choose k = 0.55, 0.35 and 0.25 that areequivalent to reaching the maximum kc of 1.0 at 10,15 and 20 examples.

Additionally, we also compare against a version ofEQUATER’s similarity measure without kc, denotedby kc, which simply combines ta and sa in their origi-nal forms. Note that this can be considered as the pro-totype similarity measure12 we developed in [49].

4.4. Creation of settings

By taking different choices from the three dimen-sions above, we create different models for experi-mentation. We will denote each setting in the formof msrkcthd, where msr, kc, thd are variables each rep-resenting one dimension (similarity measure, kc andthreshold detection respectively). Note that the vari-able kc only applies to EQUATER’s similarity mea-sure. Thus js means scoring relation pairs using theJaccard function, then find threshold based on trainingdata; while elgtjk is EQUATER in its original form, i.e.,using EQUATER’s similarity measure (with the logis-tic model of knowledge confidence), and Jenks Natu-ral Breaks for automatic threshold detection. Figure 9shows a contingency chart along msr and thd dimen-sions, with the third dimension included as a variablekc. The output from each setting is then clustered us-ing the same algorithm.

12Readers may notice that we dropped the ‘cardinality ratio’component from the prototype, since we discovered that componentmay negatively affect performance.

Fig. 9. Different settings based on the choices of three dimensions.kc is a variable whose value could be lgt (Equation 8), exp (Equation15), or −n.

The Metrics we use for evaluating pair accuracy arethe standard Precision, Recall and F1; and the met-rics for evaluating clustering are the standard purity,inverse-purity and F1 [1].

4.5. Dataset preparation

The OAEI archives a fair number of datasets forevaluating ontology alignment systems. However, wedo not use these because, as discussed before, they donot represent the particular characteristics in the LODdomain, also the number of aligned relations is verysmall - less than 2‰(56) of mappings found in theirgold standard datasets are equivalent relations13. In-stead, we study the problem of heterogeneous relationson DBpedia. Although DBpedia is a single dataset, webelieve it as an adequate testbed of the problem forseveral reasons. First and foremost, multiple vocabu-laries are used in the dataset, including RDFS, DublinCore14, WGS84 Geo15, FOAF, SKOS16, the DBpediaontology, original Wikipedia templates and so on. Inparticular, the DBpedia ontology is extremely rich inrelations: the current DBpedia ontology version 3.9covers 529 concepts and 2,333 different relations17.Previous researchers have already noted the prevailingissue of relation heterogeneity in the DBpedia dataset[18,14]. The majority is found between the DBpediaontology and other vocabularies, especially the origi-nal Wikipedia templates, due to the enormous amountof relations in both vocabularies. A Wikipedia tem-

13Based on the downloadable datasets as by 01-11-2013.14dc=http://purl.org/dc/elements/1.1/15geo=http://www.w3.org/2003/01/geo/wgs84_pos#16skos=http://www.w3.org/2004/02/skos/core#17http://dbpedia.org/Ontology


plate usually defines a concept and its properties18.When populated, they become infoboxes, which areprocessed to extract triples that form the backboneof the DBpedia dataset. Currently, data described byrelations in the DBpedia ontology and the originalWikipedia template properties co-exist and account fora very large population in the DBpedia dataset.

The disparity between the different vocabularies inDBpedia is such a pressing issue that the team has ded-icated particular effort to address it, which is known asthe DBpedia mappings portal. The DBpedia mappingsportal is a website that invites collaborative effort tocreate mappings between certain structured content onWikipedia to the manually curated DBpedia ontology.One task is mapping Wikipedia templates to conceptsin the DBpedia ontology, and then mapping proper-ties in the templates to relations of mapped concepts.Such mappings are useful for both tidying up existingDBpedia dataset and future data publication on DBpe-dia. On the one hand, they can improve information re-trieval from DBpedia, as we already showed in [2,48].On the other hand, the DBpedia Extraction Frameworkcan use such mappings in future to homogenize in-formation extracted from Wikipedia before generatingstructured information in RDF. It is known that manu-ally creating such mappings requires significant work,and as a result, as by November 2013, less than 55% ofmappings between Wikipedia template properties andrelations in the DBpedia ontology are complete19.

Further, DBpedia is the most representative LODdataset as it is predominantly used in research concern-ing Linked Data. It is also currently the largest hubconnecting multiple datasets in the LOD domain, thusa large majority of LOD datasets can benefit from re-ducing heterogeneity on DBpedia. All these facts makeDBpedia an interesting and reasonable testbed of theproblem.

We collected three datasets for experiments. Thefirst dataset is created based on the mappings publishedon the DBpedia mappings portal. We processed theDBpedia mappings Webpages as by 30 Sep 2013 andcreated a dataset containing 203 DBpedia concepts.Each concept has a page that defines the mapping froma Wikipedia template to a DBpedia concept, and lists anumber of mapping pairs from template properties tothe relations of the corresponding concept in the DB-

18Not in formal ontology terms, but rather a Wikipedia terminol-ogy.

19http://mappings.dbpedia.org/server/statistics/en/, visited on01-11-2013

pedia ontology. We extracted a total of 5388 mappingsand use them as gold standard (dbpm). However, thereare three issues with this dataset. First, the communityportal focuses on mapping the DBpedia ontology withthe original Wikipedia templates. Therefore, mappingsbetween the DBpedia ontology and other vocabulariesare rare. Second, due to the ongoing nature of the map-ping task, the dataset is largely incomplete. Therefore,we only use this dataset for evaluating recall. Third, ithas been noticed that the mappings created are not al-ways strictly ‘equivalence’. Some infrequent mappingssuch as ‘broader-than’ have also been included. Over-all the dbpm dataset should not be considered a perfectgold standard for the task.

For this reason, we manually created a dataset basedon 40 DBpedia (DBpedia ontology version 3.8) andYAGO20 concepts. The choices of such concepts arebased on the QALD1 question answering dataset21 forLinked Data. For each concept, we query the DBpediaSPARQL endpoint using the following query templateto retrieve all triples related to the concept22.

SELECT * WHERE {?s a <[Concept_URI]> .?s ?p ?o .}

Next, we build a set P containing unordered pairs ofpredicates from these triples and consider them as can-didate relation pairs for the concept. We also use a stoplist of relation URIs to filter meaningless relations thatusually describes Wikipedia meta-level information,e.g., dbpp:wikiPageID, dbpp:wikiPageUsesTemplate.Each of the measures listed in Section 4.1 is then ap-plied to compute similarity of the pairs in this set andmay produce either a zero or non-zero score. We thencreate a set cP concatenating the pairs with non-zeroscores by any of the measures, and ask human anno-tators to annotate cP . Note that cP ⊂ P and maynot exclusively contain all true positives of the conceptsince there can be equivalent pairs of relations obtain-ing zero similarity score by all the measures. However,we believe it is a reasonable approximation. Moreover

20http://www.mpi-inf.mpg.de/yago-naga/yago/21http://greententacle.techfak.uni-bielefeld.de/˜cunger/qald1/

evaluation/dbpedia-test.xml22Note that DBpedia by default returns a maximum of 50,000

triples per query. We did not incrementally build the exhaustive re-sult set for each concept since we believe the data size is sufficientfor experiment purposes.


it would be extremely expensive to annotate the set ofall pairs completely.

The data is annotated by four computer scientistsand the annotation took three weeks, where one weekwas spent on creating guidelines. Annotators can alsoquery DBpedia for triples containing the relation toassist their interpretation. However, a notable num-ber of relations are still incomprehensible. These oftenhave peculiar names and are rarely used (e.g., dbpp:n,dbpp:wikt). Pairs containing such relations cannot beannotated and are ignored in evaluation. On average,it takes 0.5 to 1 hour to annotate one concept. Wemeasured inter-annotator-agreement using a sampledataset based on the method by [20], and the IAA is0.8. The dataset is then randomly split into a develop-ment set (dev) containing 10 concepts for developingour measure and a test set (test) containing 30 conceptsfor evaluation.

To encourage comparative studies in the future, wepublish all datasets and associated resources used inthis study23. The statistics of the three datasets areshown in Table 2. Figure 10 shows the ranges of thepercentage of true positives in the dev and test datasets.To our knowledge, this is by far the largest annotateddataset for evaluating relation alignment in the LODdomain.

4.6. General process

Given a concept from any of the three datasets, wequery the DBpedia SPARQL endpoint to obtain a tripledataset and create candidate set of relation pairs fol-lowing the same procedure described above. For eachsetting created according to Section 4.4, we apply themethods to the triple dataset to (1) compute similar-ity score for each relation pair and determine if theyshould be considered truly equivalent based on thescore, and (2) create clusters of equivalent relations forthe concept.

Fig. 10. % of true positives in dev and test. Diamonds indicate themean.

23http://staffwww.dcs.shef.ac.uk/people/Z.Zhang/resources/jws2014/data_release.zip. The cached DBpedia query results arealso released.

Dev Test dbpmConcepts 10 30 203Relation pairs (P.) 2316 6657 5388True positive P. 473 868 -P. with incomprehensible 316 549 -relations (I.R.)% of triples with I.R. 0.2% 0.2% -

Schemata in datasetsdbo, dbpp, rdfs, skos,

dc, geo, foafTable 2

Dataset statistics

The output from (1) is then evaluated against thethree gold standard datasets described above. To evalu-ate clustering, we derived gold standard clusters usingthe three pair-equivalence gold standards by assumingequivalence transitivity, i.e., if r1 is equivalent to r2,which is equivalent to r3 in the gold standard then thethree relations are grouped in a single cluster. We onlyconsider clusters of positive pairs as the larger amountof negative pairs (which results in a large number ofsingle-element clusters) may bias the evaluation.

5. Results and discussion

5.1. Difficulty of the task

Annotating relation equivalence is a non-trivial task.The annotation process costs many person-days with aresulting average IAA of 0.8, while the lowest boundis 0.68 and the highest is 0.87. It has been foundthat LOD datasets are characterized by a notable de-gree of noise. As Table 2 shows, about 8 to 14% ofpairs contain incomprehensible relations. Such rela-tions have peculiar names (e.g., dbpp:v, dbpp:trW ofdbo:University) and ambiguous names (e.g., dbpp:law,dbpp:bio of dbo:University). They are undocumentedand have little usage in data, which makes them diffi-cult to interpret. Moreover, there is also a high degreeof inconsistent usage of relations. A typical exampleis dbo:railwayPlatforms of dbo:Station. It is used torepresent the number of platforms in a station, but alsothe types of platforms in a station. These findings arein line with [14].

Table 2 and Figure 10 both show that the dataset isoverwhelmed by negative examples. On average, lessthan 25% of non-zero similarity pairs are true positivesand in extreme cases this drops to less than 6% (e.g., 20out of 370 relation pairs of yago:EuropeanCountries


lev jc fu lint 0.43 0.07 0.1 0.65

min 0.06 0.01 6 ×10−6 0.14max 0.77 0.17 0.31 1.0

Table 3Optimal thresholds t for each baseline similarity measures derivedfrom dev

t min max

lgt, n=10 0.24 0.06 0.62lgt, n=15 0.22 0.05 0.62lgt, n=20 0.2 0.06 0.39

exp, k=0.55 0.32 0.06 0.62exp, k=0.35 0.31 0.06 0.62exp, k=0.25 0.33 0.11 0.62−n, n=10 0.29 0.06 0.62−n, n=15 0.28 0.07 0.59−n, n=20 0.28 0.07 0.59

Table 4Optimal thresholds (t) for different variants of the similarity measureof EQUATER derived from dev

are true positive). These findings suggest that findingequivalent relations on Linked Data is indeed a chal-lenging task.

Table 3 shows the learned thresholds for each of thebaseline similarity measures based on the dev data,and Table 4 shows the learned thresholds for differ-ent variants of EQUATER by replacing its kc com-ponent variables. In any case, the learned thresholdsspan across a wide range, suggesting that the opti-mal thresholds to decide equivalence are indeed data-specific, and finding these values can be difficult.

5.2. EQUATER performance

In Table 5 we show the results of EQUATER on thethree datasets with varying n in the lgt knowledge con-fidence function. All figures are averages over all con-cepts in a dataset. Figure 11 shows the ranges of per-formance scores for different concepts in each dataset.Table 6 shows example clusters of equivalent rela-tions discovered for different concepts. It shows thatEQUATER manages to discover alignment betweenmultiple schemata used in DBpedia.

On average, EQUATER obtains 0.65∼0.67 F1 inpredicting pair equivalence on dev and 0.59∼0.61 F1

n of lgt 10 15 20Pair equivalence

dev, F1 0.67 0.66 0.65test, F1 0.61 0.60 0.59dbpm, R 0.68 0.66 0.66

Clusteringdev, F1 0.74 0.74 0.74test, F1 0.70 0.70 0.70dbpm, R 0.72 0.70 0.70

Table 5Results of EQUATER on all datasets. R - Recall

Fig. 11. Performance ranges on a per-concept basis for dev, testand dbpm. R - Recall, pe - pair equivalence, c - clustering

Concept Example cluster

dbo:Actordbpp:birthPlace, dbo:birthPlace,

dbpp:placeOfBirth

dbo:Bookdbpp:name, foaf:name,dbpp:titleOrig, rdfs:label

dbo:Companydbpp:website, foaf:websitedbpp:homepage, dbpp:url

Table 6Examples clusters of equivalent relations.

on test. These translate to 0.74 and 0.70 clustering ac-curacy on each dataset respectively. For dbpm, we ob-tain a recall between 0.66 and 0.68 for pair equivalenceand 0.7 and 0.72 for clustering. It is interesting to notethat EQUATER appears to be insensitive to the varyingvalues of n. This stability is a desirable feature since itmay be unnecessary to tune the model and therefore,the method is less prone to overfitting. This also con-firms the hypothetical analogy between the amount ofseed data needed for bootstrap relation learning andthe amount of examples needed to obtain maximumknowledge confidence in EQUATER.

Figure 11 shows that the performance of EQUATERcan vary depending on specific concepts. To under-stand the errors, we randomly sampled 100 false pos-


itive and 100 false negative examples from the testdataset, and 200 false negative examples from thedbpm dataset, then manually analyzed and dividedthem into several types24. The prevalence of each typeis shown in Table 7.

5.2.1. False positivesThe first main source of errors are due to high

degree of semantic similarity: e.g., dbpp:residenceand dbpp:birthPlace of dbo:TennisPlayer are highlysemantically similar but non-equivalent. The secondtype of errors is due to low variability in the ob-jects of a relation: semantically dissimilar relationscan have the same datatype and have many overlap-ping values by coincidence. The overlap is caused bysome relations having a limited range of object values,which is especially typical for relations with booleandatatype because they only have two possible val-ues. The third type of errors is entailment, e.g., fordbo:EuropeanCountries dbpp:officialLanguage en-tails dbo:language because official languages of acountry are a subset of languages spoken in a country.These could be considered as cases of subsumption,which accounts for less than 15%. Finally, some of theerrors are arguably due to imperfect gold standard,as analysers sometimes disagree with the annotations(see Table 8).

5.2.2. False negativesThe first type of common errors is due to repre-

sentation of objects. For instance, for dbo:AmericanFootballPlayer, dbo:team are associated with mostlyresource URIs (e.g., ‘dbr:Detroit_Lions’) while dbpp:teams are mostly associated with lexicalization ofliteral objects (e.g.,‘* Detroit Lions’) that are typ-ically names of the resources. The second type isdue to different datatypes, e.g., for dbo:Building,dbpp:startDate typically have literal objects indicatingyears, while dbo:buildingStartDate usually has pre-cisely literal date values as objects. Thirdly, the lex-icalization of objects can be different. An examplefor this category is dbpp:dialCode and dbo:areaCodeof dbo:Settlement, the objects of the two relations arerepresented in three different ways, e.g. ‘0044’, ‘+44’,‘44’. Many false negatives are due to sparsity: e.g.,dbpp:oEnd and dbo:originalEndPoint of dbo:Canalhave in total only 2 triples. There are also noisy re-lations, whose lexicalization appears to be inconsis-

24Analysis based on the DBpedia SPARQL service as by 31-10-2013. Inconsistency should be anticipated if different versions ofdatasets are used.

Error Type PrevalenceFalse PositivesSemantically similar 52.4Low variability 29.1Entailment 14.6Arguable gold standard 3.88False NegativesObject representation 25.1Different datatype 24.7Noisy relation 19.4Different lexicalisations 11.7Sparsity 10.5Limitation of method 5.67Arguable gold standard 2.83

Table 7Relative prevalence of error types.

tent with how it is used. Usually the lexicalization isambiguous, such as the dbo:railwayPlatforms exam-ple discussed before. Some errors are simply due to thelimitation of our method, i.e., our method still failsto identify equivalence even if sufficient, quality dataare available, possibly due to inappropriate automaticthreshold selection. And further, arguable gold stan-dard also exist (e.g., dbpp:champions and dbo:teamsof dbo:SoccerLeague are mapped to each other in thedbpm dataset.

We then also manually inspected some worst per-forming concepts in the dbpm dataset, and noticedthat some of them are due to extremely small goldstandard. For example, dbo:SportsTeamMember anddbo:Monument have only 3 true positives each in theirgold standard and as a result, EQUATER scored 0 inrecall. However, we believe that these gold standardsare largely incomplete. For example, we consider mostproposals by EQUATER in Table 8 to be correct.

Some of the error types mentioned above could berectified by modifying EQUATER. For example, wecould combine string similarity metrics, which mayhelp errors due to representation of objects and dif-ferent lexicalization of objects. Regular expressionscould be used to parse values in order to match dataat semantic level, e.g., for dates, weights, and lengths.These could be useful to solve errors due to differentdatatypes. Other error groups are much harder to pre-vent: even annotators often struggled to distinguish be-tween semantically similar and equivalent relations orto understand what a relation is supposed to mean.


Pair equivalencedev F1 test F1 dbpm R.

PPPPPPPmsrthd

jk km s jk km s jk km s

lev 0.16∼18 0.16∼18 0.17∼19 0.12∼14 0.12∼14 0.13∼15 0.07∼08 0.08∼09 0.07∼08f 0.18∼20 0.20∼22 0.09∼11 0.20∼21 0.22∼23 0.10∼11 0.21∼23 0.24∼26 0.03∼04lin 0.27∼29 0.29∼31 0.28∼30 0.19∼21 0.19∼21 0.19∼21 0.39∼40 0.37∼38 0.38∼39jc 0.09∼11 0.11∼12 0.01∼03 0.10∼11 0.12∼13 0.07∼08 0.09∼10 0.10∼12 -0.05∼-0.04

Clusteringdev F1 test F1 dbpm R.

PPPPPPPmsrthd

jk km s jk km s jk km s

lev 0.35 0.36 0.36 0.40 0.41 0.39 0.02∼04 0.04∼06 0.03∼05f 0.15∼16 0.17∼18 0.06∼07 0.23 0.25 0.11 0.21∼23 0.24∼26 0.01∼03lin 0.45 0.47 0.47 0.41 0.42 0.42 0.37∼39 0.37∼39 0.39∼0.41jc 0.06∼07 0.07∼08 0.00∼01 0.14∼15 0.17 0.06 0.08∼10 0.09∼11 -0.07∼-0.05

Table 9Improvement of EQUATER over different baselines. The highestimprovements on each dataset are highlighted in bold. Negative im-provements are highlighted in italic.

r1 r2 #x, y argument pairsdbo:synonym dbp:otherName 6

rdfs:label foaf:name 10rdfs:label dbp:name 10

rdfs:comment dbo:abstract 41dbp:material dbo:material 10

dbp:city dbo:city 5Table 8

The equivalent relations for dbo:Monument proposed by EQUATERbut considered false positive according to the gold standard.

5.3. EQUATER v.s. baseline

Next, in Table 9 we show the improvement ofEQUATER over different models that use a base-line similarity measure. Since the performance ofEQUATER depends on the parameter n in the simi-larity measure, we show the ranges between minimumand maximum improvement due to the choice of n.

It is clear from Table 9 that EQUATER (unsuper-vised) significantly outperforms most baseline mod-els, either supervised or unsupervised. Exceptions arenoted against jcs in the clustering task on the devdataset, where EQUATER achieves comparable re-sults; and on the dbpm dataset in both pair equivalenceand clustering tasks, where EQUATER underperforms

jcs in terms of recall. However, as discussed before thedbpm gold standard has many issues; furthermore, weare unable to evaluate precision on this dataset whileresults on the dev and test sets suggest EQUATER hasmore balanced performance. The relatively larger im-provement over unsupervised baselines than over su-pervised baselines may suggest that the scores pro-duced by EQUATER may exhibit a more ‘separable’pattern (e.g., like Figure 6) of distribution for unsuper-vised threshold detection.

Figures 12a and 12b compares the balance betweenprecision and recall of EQUATER against baselineson the dev and test datasets. For EQUATER we usethree different shapes to represent models with differ-ent n values in lgt; for baseline models we use dif-ferent shapes to represent different similarity measuresand different colours (black, white and grey) to repre-sent different thd choices. It is clear that EQUATERalways outperforms any baselines in terms of preci-sion, and also finds the best balance between precisionand recall thus resulting in the highest F1.

Interesting to note is the inconsistent performanceof string similarity baselines (levjk, levkm, levs) inpair equivalence experiments and clustering experi-ments. While in pair equivalence experiments they ob-tain between 0.45 and 0.5 F1 (second best among base-lines) on both dev and test with arguably balancedprecision and recall, in clustering experiments the fig-


ures sharply drop to 0.3∼0.4 (second worst amongbaselines) skewed towards very high recall and verylow precision. This suggests that the string similarityscores are non-separable by clustering algorithms, cre-ating larger clusters that favour recall but precision.

Very similar pattern is also noted for the semanticsimilarity baselines (linjk, linkm, lins). In fact, se-mantic similarity and string similarity baselines gen-erally obtain much worse results than other baselinesthat belong to extensional matchers, a strong indica-tion that the latter are better fit for aligning relationsin the LOD domain. This can be partially attributedto the fact that the relation URIs can be very noisyand many do not comply with naming conventions andrules (e.g., ‘birthplace’ instead of ‘birthPlace’).

Fig. 12a. Balance between precision and recall for EQUATER andbaselines on dev. pe - pair equivalence, c - clustering. The dottedlines are F1 references.

5.4. Variations of EQUATER components

In this section, we compare EQUATER against sev-eral alternative designs based on the alternative choicesof knowledge confidence (kc) functions and thresholddetection (thd) methods. We pair different kc modelsdescribed in Section 4.3 with different threshold detec-tion methods described in Section 4.2 to create vari-ants of EQUATER and compare them against the orig-inal EQUATER method. In addition, we also create

Fig. 12b. Balance between precision and recall for EQUATER andbaselines on test.

the similarity measure ekc, which only takes ta and sawithout the knowledge confidence factor. Combinedwith different choices of thd we obtain models thatrepresent our earlier prototype in [49]. Moreover, wealso select jc as the best performing baseline similaritymeasure and use corresponding baseline settings (jcjk,jckm, jcs) as comparative references.

5.4.1. Alternative kc modelsFigure 13a compares variations of EQUATER by al-

ternating kc functions under each thd method. Sinceeach of the functions lgt, exp and −n requires a pa-rameter to be set, we show the ranges of performancegained with different settings of their parameter. Theseare represented as black caps on top of each bar. Thebigger the cap, the wider the range between the min-imum and the maximum performance obtainable bytuning these parameters. Firstly, under the same cho-sen threshold detection method, settings with kc out-perform the best baseline in most cases. This suggeststhat ta and sa are indeed more effective indicatorsof relation equivalence than other metrics, and alsosuggests that the issue of unbalanced populations of


Fig. 13. Comparing variations of EQUATER by (a) alternating kc functions under each threshold detection method, and (b) alternating thdmethods under each knowledge confidence function (incl. without kc).

schemata in the LOD domain is very common. Sec-ondly, we can see that the accuracy of EQUATER asmeasured by F1 does benefit from the integration ofkc functions; the changes are also substantial on test

set. While combining results on the dbpm set, it seemsthat kc functions may trade off recall for precisionto achieve overall higher F1. Thirdly, in terms of thethree kc functions, the performance given by the exp

and −n model appears to be volatile since changingtheir parameters caused considerable variation of per-formance in most cases. This also caused several vari-ants of EQUATER to underperform the baseline withthe same thd setting.

By analyzing the precision and recall trade-off fordifferent kc functions, it shows that without kc, thesimilarity measure of EQUATER tends to favour high-recall but perhaps lose too much precision. Any kcfunction thus has the effect of re-balancing towardsprecision. Among the three, the exp function gener-ally favours recall over precision, the threshold basedmodel favours precision over recall, while the lgt func-tion finds the best balance. Details of this part of anal-ysis can be found in A.

5.4.2. Alternative thd methodsFigure 13b is a re-arranged view of Figure 13a,

in the way that it compares variations of EQUATER


by alternating the thd methods under each kc func-tion. This gives a better view for comparing differ-ent choices of thd. Generally it appears that regard-less of the kc function, the jk method has slight ad-vantage over km. The unsupervised jk variants ofEQUATER also obtain close performance to their su-pervised counterparts in many cases; and even best theperformance on the test set (excluding kc). This sug-gests Jenks Natural Breaks an effective method for au-tomatic threshold detection for EQUATER.

5.4.3. Limitations of EQUATERThe current version of EQUATER is limited in a

number of ways. First and foremost, being an exten-sional matcher, it requires relations to have shared in-stances to work. This is usually a reasonable require-ment for individual dataset, and hence experimentsbased on DBpedia have shown it to be very effective.However, in a cross-dataset context, concepts and in-stances will have to be aligned first in order to applyEQUATER. This is because often, different datasetsuse different URIs to refer to the same entities; as aresult, counting overlap of a relation’s arguments willhave to go beyond syntactic level.

A basic and simplistic solution could be a pre-process that maps concepts and instances from dif-ferent datasets using existing ‘sameAs’ mappings, asdone by Parundekar et al. [40] and Zhao et al. [52].Unfortunately, when such mappings are unavailable,EQUATER may require other methods to firstly alignconcepts and instances in order to work effectively forcross-dataset settings. Therefore, we consider the sec-ond major limitation of EQUATER as it being a par-tial ontology alignment method addressing a specificbut practical issue - aligning relations only. Ideally,EQUATER should be extended to iteratively align re-lations based on aligned concepts and instances andvice-versa.

Despite these limitations, we consider EQUATERa valuable contribution to the literature as it targetsspecifically at the gap in related work, and lessonslearned could be useful for future development.

6. Conclusions

This article explored the problem of aligning het-erogeneous resources in LOD datasets. Heterogeneitydecreases the quality of the data and may eventuallyhamper its usability over large scale. It is a major re-search problem concerning the Semantic Web commu-

nity and significant effort has been made to addressthis problem in the area of ontology alignment. Whilemost work studied mapping concepts and individuals,relation heterogeneity in LOD datasets is becomingan increasingly pressing issue but still remains muchless studied. The annotation practice undertaken in thiswork has shown that the task is even challenging tohumans.

This article makes particular contribution to thisproblem by introducing EQUATER - a domain- andlanguage-independent and unsupervised method toalign relations based on their shared instances. Cur-rently, EQUATER fits best with aligning relations fromdifferent schemata used in a single Linked Dataset, apractical problem that has emerged with the increas-ing collaborative effort in creating very large LinkedDatasets. It can potentially be used in cross-datasetsettings, provided that concepts and instances acrossthe datasets are aligned to ensure relations have sharedinstances.

A series of experiments have been designed tothoroughly evaluate EQUATER in two tasks: predict-ing relation pair equivalence and discovering clus-ters of equivalent relations. These experiments haveconfirmed the advantage of EQUATER: compared tobaseline models including both supervised and un-supervised versions, it makes significant improve-ment in terms of F1 measure, and always scores thehighest precision. Compared to different variants ofEQUATER, the logistic model of knowledge confi-dence achieves the best scores in most cases and is seento give stable performance regardless of its parame-ter setting, while the alternatives suffer from higherdegree of volatility that occasionally causes them tounderperform baselines. The Jenks Natural Breaksmethod for automatic threshold detection also provesto have slight advantage than the k-means alternative,and even outperformed the supervised method on thetest set. Although EQUATER does not achieve thebest recall on the dbpm dataset, we believe its resultsare still encouraging and that it can achieve the mostbalanced performance had we been able to evaluateprecision. Overall we believe that it may potentiallyspeed up the practical mapping task currently concern-ing the DBpedia community.

As future work, we will explore methods to addressthe previously discussed limitations of EQUATER.

Acknowledgement Part of this research has beensponsored by the EPSRC funded project LODIE:


Linked Open Data for Information Extraction, EP/J019488/1

References

[1] Javier Artiles, Julio Gonzalo, and Satoshi Sekine. The semeval-2007 weps evaluation: Establishing a benchmark for the webpeople search task. In Proceedings of the 4th InternationalWorkshop on Semantic Evaluations, SemEval ’07, pages 64–69, Stroudsburg, PA, USA, 2007. Association for Computa-tional Linguistics.

[2] E. Blomqvist, Z. Zhang, A. Gentile, I. Augenstein, andF. Ciravegna. Statistical knowledge patterns for characteriz-ing linked data. In Workshop on Ontology and Semantic WebPatterns (4th edition) - WOP2013 at ISWC2013, 2013.

[3] T. Calinski and J. Harabasz. A dendrite method for cluster anal-ysis. Cluster Analysis Communications in Statistics, 3(1):1–27,1974.

[4] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E.R. HruschkaJr., and T.M. Mitchell. Toward an architecture for never-endinglanguage learning. In Proceedings of the Conference on Artifi-cial Intelligence (AAAI), pages 1306–1313. AAAI Press, 2010.

[5] Michelle Cheatham and Pascal Hitzler. String similarity met-rics for ontology alignment. In Proceedings of the 12th Inter-national Semantic Web Conference, ISWC’13, pages 294–309,Berlin, Heidelberg, 2013. Springer-Verlag.

[6] Isabel F. Cruz, Matteo Palmonari, Federico Caimi, and Cos-min Stroe. Towards ’on the go’ matching of linked opendata ontologies. In Pavel Shvaiko, Jérôme Euzenat, FaustoGiunchiglia, and Bin He, editors, IJCAI Workshop DiscoveringMeaning On the Go in Large and Heterogeneous Data (LHD-11), Barcelona, Spain, July 2011.

[7] Isabel F. Cruz, Matteo Palmonari, Federico Caimi, and Cos-min Stroe. Building linked ontologies with high precision us-ing subclass mapping discovery. Artificial Intelligence Review,40(2):127–145, aug 2013.

[8] Isabel F. Cruz, Cosmin Stroe, Michele Caci, Federico Caimi,Matteo Palmonari, Flavio Palandri Antonelli, and Ulas C. Ke-les. Using agreementmaker to align ontologies for oaei 2010.In Proceedings of ISWC International Workshop on OntologyMatching (OM), CEUR Workshop Proceedings, 2010.

[9] Russell Dewey. Chapter 7: Cognition. Psychology: An Intro-duction, 2007.

[10] Hong-Hai Do and Erhard Rahm. Coma: A system for flexi-ble combination of schema matching approaches. In Proceed-ings of the 28th International Conference on Very Large DataBases, VLDB ’02, pages 610–621. VLDB Endowment, 2002.

[11] AnHai Doan, Pedro Domingos, and Alon Y. Halevy. Recon-ciling schemas of disparate data sources: A machine-learningapproach. SIGMOD Rec., 30(2):509–520, May 2001.

[12] Songyun Duan, Achille Fokoue, Oktie Hassanzadeh, Anasta-sios Kementsietsidis, Kavitha Srinivas, and Michael J. Ward.Instance-based matching of large ontologies using locality-sensitive hashing. In Proceedings of the 11th international con-ference on The Semantic Web - Volume Part I, ISWC’12, pages49–64, Berlin, Heidelberg, 2012. Springer-Verlag.

[13] J. Euzenat and P Shvaiko. Ontology Matchin. Springer, 2007.

[14] Linyun Fu, Haofen Wang, Wei Jin, and Yong Yu. Towardsbetter understanding and utilizing relations in dbpedia. WebIntelligence and Agent Systems, 10(3):291–303, 2012.

[15] Anna Lisa Gentile, Ziqi Zhang, Isabelle Augenstein, and FabioCiravegna. Unsupervised wrapper induction using linked data.In Proceedings of the seventh international conference onKnowledge capture, K-CAP ’13, pages 41–48, New York, NY,USA, 2013. ACM.

[16] Bernardo Cuenca Grau, Zlatan Dragisic, Kai Eckert, JeréomeEuzenat, Alo Ferrara, Roger Granada, Valentina Ivanova,Ernesto Jimenez-Ruiz11, Andreas Oskar Kempf, Patrick Lam-brix, Christian Meilicke, Andriy Nikolov, Heiko Paulheim, Do-minique Ritze, Francois Scharffe, Pavel Shvaiko, Cássia Tro-jahn, and Ondrej Zamazal. Preliminary results of the ontologyalignment evaluation initiative 2013. In The Eighth Interna-tional Workshop on Ontology Matching at ISWC2013, OM’13,2013.

[17] Toni Gruetze, Christoph Böhm, and Felix Naumann. Holisticand scalable ontology alignment for linked open data. In Chris-tian Bizer, Tom Heath, Tim Berners-Lee, and Michael Hausen-blas, editors, Proceedings of the 5th Linked Data on the Web(LDOW) Workshop at the 21th International World Wide WebConference (WWW), CEUR Workshop Proceedings. CEUR-WS.org, 2012.

[18] Lushan Han, Tim Finin, and Anupam Joshi. Gorelations: anintuitive query system for dbpedia. In Proceedings of the 2011joint international conference on The Semantic Web, JIST’11,pages 334–341, Berlin, Heidelberg, 2012. Springer-Verlag.

[19] Sven Hertlingand and Heiko Paulheim. Wikimatch - usingwikipedia for ontology matching. In proceedings OntologyMatching : Proceedings of the 7th International Workshop onOntology Matching (OM-2012) collocated with the 11th Inter-national Semantic Web Conference (ISWC-2012), ISWC’12,pages 37–48, 2012.

[20] George Hripcsak and Adam Rothschild. Agreement, the f-measure, and reliability in information retrieval. Journal of theAmerican Medical Informatics Association, 12:296–298, 2005.

[21] Wei Hu, Yuzhong Qu, and Gong Cheng. Matching large on-tologies: A divide-and-conquer approach. Data Knowledg. En-gineering, 67(1):140–160, October 2008.

[22] Antoine Isaac, Lourens Van Der Meij, Stefan Schlobach, andShenghui Wang. An empirical study of instance-based ontol-ogy matching. In Proceedings of the 6th international The se-mantic web and 2nd Asian conference on Asian semantic webconference, ISWC’07/ASWC’07, pages 253–266, Berlin, Hei-delberg, 2007. Springer-Verlag.

[23] Prateek Jain, Pascal Hitzler, Amit P. Sheth, Kunal Verma, andPeter Z. Yeh. Ontology alignment for linked open data. InProceedings of the 9th International Semantic Web Confer-ence, ISWC2010, pages 402–417, Berlin, Heidelberg, 2010.Springer-Verlag.

[24] Prateek Jain, Peter Z. Yeh, Kunal Verma, Reymonrod G.Vasquez, Mariana Damova, Pascal Hitzler, and Amit P. Sheth.Contextual ontology alignment of lod with an upper ontology:A case study with proton. In Proceedings of the 8th ExtendedSemantic Web Conference on The Semantic Web: Research andApplications - Volume Part I, ESWC’11, pages 80–92, Berlin,Heidelberg, 2011. Springer-Verlag.

[25] Yves R. Jean-Mary, E. Patrick Shironoshita, and Mansur R.Kabuka. Ontology matching with semantic verification. WebSemantics, 7(3):235–251, September 2009.


[26] George Jenks. The data model concept in statistical mapping.International Yearbook of Cartography, 7:186–190, 1967.

[27] Jaewoo Kang and Jeffrey F. Naughton. On schema matchingwith opaque column names and data values. In Proceedingsof the 2003 ACM SIGMOD International Conference on Man-agement of Data, SIGMOD ’03, pages 205–216, New York,NY, USA, 2003. ACM.

[28] Patrick Lambrix and He Tan. Sambo-a system for aligning andmerging biomedical ontologies. Web Semantics, 4(3):196–206,September 2006.

[29] Juanzi Li, Jie Tang, Yi Li, and Qiong Luo. RiMOM:A Dynamic Multistrategy Ontology Alignment Framework.IEEE Transactions on Knowledge and Data Engineering,21(8):1218–1232, August 2009.

[30] Girija Limaye, Sunita Sarawagi, and Soumen Chakrabarti. An-notating and searching web tables using entities, types and rela-tionships. Proceedings of the VLDB Endowment, 3(1-2):1338–1347, 2010.

[31] D. Lin. An Information-Theoretic Definition of Similarity.In Proceedings of the 5th International Conference on Ma-chine Learning, ICML ’98, pages 296–304, San Francisco,CA, USA, 1998b. Morgan Kaufmann Publishers Inc.

[32] J. B. MacQueen. Some methods for classification and anal-ysis of multivariate observations. In Proceedings of the fifthBerkeley Symposium on Mathematical Statistics and Probabil-ity, volume 1, pages 281–297. University of California Press,1967.

[33] Jayant Madhavan, Philip A. Bernstein, AnHai Doan, and AlonHalevy. Corpus-based schema matching. In Proceedings of the21st International Conference on Data Engineering, ICDE ’05,pages 57–68, Washington, DC, USA, 2005. IEEE ComputerSociety.

[34] Jayant Madhavan, Philip A. Bernstein, and Erhard Rahm.Generic schema matching with cupid. In Proceedings ofthe 27th International Conference on Very Large Data Bases,VLDB ’01, pages 49–58, San Francisco, CA, USA, 2001. Mor-gan Kaufmann Publishers Inc.

[35] Fionn Murtagh. Multidimensional clustering algorithm. inCOMPSTAT Lectures 4. Wuerzburg: Physica-Velag, 1985.

[36] Miklos Nagy, Maria Vargas-Vera, and Enrico Motta. Multiagent ontology mapping framework in the aqua question an-swering system. In Proceedings of the 4th Mexican Inter-national Conference on Advances in Artificial Intelligence,MICAI’05, pages 70–79, Berlin, Heidelberg, 2005. Springer-Verlag.

[37] Miklos Nagy, Maria Vargas-Vera, and Enrico Motta. Dssim— managing uncertainty on the semantic web. In PavelShvaiko, Jérôme Euzenat, Fausto Giunchiglia, and Bin He, ed-itors, The 2nd International Workshop on Ontology Match-ing (OM-2007), volume 304 of CEUR Workshop Proceedings.CEUR-WS.org, 2007.

[38] DuyHoa Ngo, Zohra Bellahsene, and Remi Coletta. A genericapproach for combining linguistic and context profile metricsin ontology matching. In Proceedings of the 2011th Confeder-ated International Conference on On the Move to MeaningfulInternet Systems - Volume Part II, OTM’11, pages 800–807,Berlin, Heidelberg, 2011. Springer-Verlag.

[39] Andriy Nikolov, Victoria Uren, Enrico Motta, and AnneRoeck. Overcoming schema heterogeneity between linkedsemantic repositories to improve coreference resolution. InProceedings of the 4th Asian Conference on The Semantic

Web, ASWC ’09, pages 332–346, Berlin, Heidelberg, 2009.Springer-Verlag.

[40] Rahul Parundekar, Craig A. Knoblock, and José Luis Ambite.Linking and building ontologies of linked data. In Proceedingsof the 9th International Semantic Web Conference on The Se-mantic Web - Volume Part I, ISWC’10, pages 598–614, Berlin,Heidelberg, 2010. Springer-Verlag.

[41] Shvaiko Pavel and Jerome Euzenat. Ontology matching: Stateof the art and future challenges. IEEE Trans. on Knowl. andData Eng., 25(1):158–176, January 2013.

[42] Balthasar A. C. Schopman, Shenghui Wang, Antoine Isaac,and Stefan Schlobach. Instance-based ontology matching byinstance enrichment. J. Data Semantics, 1(4):219–236, 2012.

[43] Md. Hanif Seddiqui and Masaki Aono. An efficient and scal-able algorithm for segmented alignment of ontologies of arbi-trary size. Web Semantics, 7(4):344–356, December 2009.

[44] G. Shafer. A Math. Theory of Evidence. Princeton Univ. Press,1976.

[45] Kristian Slabbekoorn, Laura Hollink, and Geert-Jan Houben.Domain-aware ontology matching. In Proceedings of the11th International Conference on The Semantic Web - VolumePart I, ISWC’12, pages 542–558, Berlin, Heidelberg, 2012.Springer-Verlag.

[46] Fabian M. Suchanek, Serge Abiteboul, and Pierre Senel-lart. Paris: Probabilistic alignment of relations, instances, andschema. Proc. VLDB Endow., 5(3):157–168, November 2011.

[47] Johanna Völker and Mathias Niepert. Statistical schema in-duction. In Proceedings of the 8th Extended Semantic WebConference on The Semantic Web: Research and Applications -Volume Part I, ESWC’11, pages 124–138, Berlin, Heidelberg,2011. Springer-Verlag.

[48] Z. Zhang, A. Gentile, E. Blomqvist, I. Augenstein, andF. Ciravegna. Statistical knowledge patterns: Identifying syn-onymous relations in large linked datasets. In The 12th In-ternational Semantic Web Conference and the 1st AustralasianSemantic Web Conference, Sydney, Australia, 2013.

[49] Ziqi Zhang, Anna Lisa Gentile, Isabelle Augenstein, EvaBlomqvist, and Fabio Ciravegna. Mining equivalent relationsfrom linked data. In Proceedings of the 51st Annual Meeting ofthe Association for Computational Linguistics - Short Papers(ACL Short Papers 2013), ACL’13. ACL, 2013.

[50] Ziqi Zhang, Anna Lisa Gentile, and Fabio Ciravegna. Recentadvances in methods of lexical semantic relatedness - a survey.Natural Language Engineering, FirstView:1–69, 4 2012.

[51] Lihua Zhao and Ryutaro Ichise. Graph-based ontology analysisin the linked open data. In Proceedings of the 8th InternationalConference on Semantic Systems, I-SEMANTICS ’12, pages56–63, New York, NY, USA, 2012. ACM.

[52] Lihua Zhao and Ryutaro Ichise. Mid-ontology learning fromlinked data. In Proceedings of the 2011 Joint InternationalConference on The Semantic Web, JIST’11, pages 112–127,Berlin, Heidelberg, 2012. Springer-Verlag.

[53] Jiwei Zhong, Haiping Zhu, Jianming Li, and Yong Yu. Con-ceptual graph matching for semantic search. In Proceedings ofthe 10th International Conference on Conceptual Structures:Integration and Interfaces, ICCS ’02, pages 92–196, London,UK, UK, 2002. Springer-Verlag.


Appendix

A. Precision and recall obtained with different kcfunctions

Figure 14 complements Figure 13a by comparingthe balance between precision and recall for differentvariants of EQUATER using the dev and test sets. Weuse different shapes to represent different kc functionsand different colours (black, white and grey) to repre-sent different parameter settings for each kc function.It is clear that without kc functions, the similarity mea-sure of EQUATER tends to favour high-recall but per-haps lose too much precision. All kc functions havethe effect to balance towards precision due to the con-straints on the number of examples required to com-pute similarity confidently. Among the three, the expmodel generally produces the highest recall with trade-off of precision. To certain extent, this confirms our be-lief that the knowledge confidence score under the ex-ponential model may converge too fast: it may be over-confident in small set of examples, causing EQUATERto over-predict equivalence. On the other hand, thethreshold based model trades off recall for precision.The variants with the lgt model generally find the bestbalance - in fact, under unsupervised settings, achievebest or close-to-best precision.

The lgt model also warrants more stability sincechanging parameters caused little performance varia-tion (note that the different coloured squares are gen-erally cluttered, while the different coloured trianglesand diamonds are far away). Although occasionallyvariants with the exp model may outperform thosebased on lgt (e.g., when thd = km in the clusteringexperiment on dev), the difference is small and theirperformance is more dependent on the setting of theparameter in these cases and can sometimes underper-form baselines. Based on these observations, we arguethat the lgt model of knowledge confidence is betterthan exp, −n, or kc.

B. Exploration during the development ofEQUATER’s similarity measure

In this section we present some earlier analysisthat helped us during the development of EQUATER.These analysis helped us to identify useful featuresfor evaluating relation equivalence, as well as unsuc-cessful features which we abandoned in EQUATER.We analyzed EQUATER’s components ta and sa from

a different perspective to understand if they could beuseful indicators of equivalence B.1. We also exploredanother dimension - the ranges of relations B.2. Theintuition is that ranges provide additional informationabout relations. Unfortunately our analysis showedthat ranges derived for relations from data are highlyinconsistent and therefore, they are not discriminativefeatures for this task. As a result they were not used byEQUATER. We carried out all analysis using the devdataset only.

B.1. ta and sa

We applied ta and sa separately to each relation pairin the dev dataset, then studied the distribution of taand sa scores for true positives and true negatives. Fig-ure 15 shows that both ta and sa create different dis-tributional patterns of scores for positive and negativeexamples in the data. Specifically, the majority of truepositives receive a ta score of 0.2 or higher and an sascore of 0.1 or higher, the majority of true negativesreceive a ta < 0.15 and sa < 0.1. Based on such dis-tinctive patterns we concluded that ta and sa could beuseful indicators in discovering equivalent relations.

Fig. 15. Distribution of ta and sa scores for true postive and truenegative examples in dev.

B.2. Ranges of relations

We also explored several ways of deriving rangesof a relation to be considered in measuring similar-ity. One simplistic method is to use ontological defini-tions. For example, the range of dbo:birthPlace of theconcept dbo:Actor is defined as dbo:Place accordingto the DBpedia ontology. However, this does not workfor relations that are not defined formally in ontolo-gies, such as any predicates with the dbpp namespaces,which are very common in the datasets.

Instead, we chose to define ranges of a relationbased on its objects r(y) in data. One approach is toextract classes of their objects y : r(y) and expect adominant class for all objects of this relation. Thus westarted by querying the DBpedia SPARQL endpointwith the following queries:


Fig. 14. Balance between precision and recall for EQUATER and its variant forms. pe - pair equivalence, c - clustering. The dotted lines are F1references.

SELECT ?o ?range WHERE {?s [RDF Predicate URI] ?o .?s a [Concept URI] .OPTIONAL {?o a ?range .}}

Next, we counted the frequency of each distinctvalue for the variable ?range and calculated its frac-tion with respect to all values. We found three issuesthat make this approach unreliable. First, if a subjects had an rdfs:type triple defining its type c, (e.g., srdfs:type c), it appears that DBpedia creates additional

rdfs:type triples for the subject with every superclassof c. For example, there are 20 rdfs:type triples fordbr:Los_Angeles_County,_California and the objectsof these triples include owl:Thing, yago:Object100002684 and gml:_Feature (gml: Geography MarkupLanguage). These triples will significantly skew thedata statistics, while incorporating ontology-specificknowledge to resolve the hierarchies can be an ex-pensive process due to the unknown number of on-tologies involved in the data. Second, even if we areable to choose always the most specific class ac-cording to each involved ontology for each subject,


we notice a high degree of inconsistency across dif-ferent subjects in the data. For example, this givesus 13 most specific classes as candidate ranges fordbo:birthPlace of dbo:Actor, and the dominant classis dbo:Country representing just 49% of triples con-taining the relation. Other ranges include scg:Place,dbo:City, yago:Location (scg: schema.org) etc. Thethird problem is that for values of ?o that are literals,no ranges will be extracted in this way (e.g., values of?range extracted using the above SPARQL templatefor relation dbpp:othername are empty when ?o valuesare literals).

For these reasons, we abandoned the two methodsbut proposed to use several simple heuristics to clas-sify the objects of triples into several categories basedon their datatype and use them as ranges. Thus giventhe set of argument pairs r(x, y) of a relation, we clas-sified each object value into one of the six categories:URI, number, boolean, date or time, descriptive textscontaining over ten tokens, and short string for every-thing else. A similar scheme is used in [51]. Althoughthese range categories are very high-level, they shouldcover all data and may provide limited but potentiallyuseful information for comparing relations.

We developed a measure called maximum rangeagreement, to examine the degree to which both rela-tions use the same range in their data. Let RGr1,r2 de-note the set of shared ranges discovered for the relationr1 and r2 following the above method, and frac(rgir1)denote the fraction of triples containing the relation r1whose range is the ith element in RGr1,r2 , we defined

maximum range agreement (mra) of a pair of relationsas:

mra(r1, r2) ={0, if RGr1,r2 = ∅max{frac(rgir1) + frac(rgir2)}, otherwise

(16)

The intuition is that if two relations are equivalent,each of them should have a dominant range as seenin their triple data (thus a high value of frac(rgir)for both r1 and r2) and their dominant ranges shouldbe consistent. Unfortunately, as Figure 16 shows, mrahas little discriminating power in separating true pos-itives from true negatives. As a result, we did not useit in EQUATER. In the error analysis, the errors dueto incompatible datatypes may potentially benefit from

Fig. 16. Distribution of mra scores for true postive and true negativeexamples in dev.

range information of relations. However, the proposedsix categories of ranges may have been too general tobe useful.

Documents

1 IOS Press EQUATER - An Unsupervised Data-driven Method ... · 2 Zhang et al. / EQUATER - An Unsupervised Data-driven Method to Discover Equivalent Relations in Large Linked Datasets