15
Research Article Building Integrated Ontological Knowledge Structures with Efficient Approximation Algorithms Yang Xiang 1 and Sarath Chandra Janga 2 1 Department of Biomedical Informatics, e Ohio State University, Columbus, OH 43210, USA 2 Department of BioHealth Informatics, Indiana University-Purdue University Indianapolis, Indianapolis, IN 46202, USA Correspondence should be addressed to Sarath Chandra Janga; [email protected] Received 22 October 2014; Revised 30 December 2014; Accepted 1 January 2015 Academic Editor: Dongchun Liang Copyright © 2015 Y. Xiang and S. C. Janga. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e integration of ontologies builds knowledge structures which brings new understanding on existing terminologies and their associations. With the steady increase in the number of ontologies, automatic integration of ontologies is preferable over manual solutions in many applications. However, available works on ontology integration are largely heuristic without guarantees on the quality of the integration results. In this work, we focus on the integration of ontologies with hierarchical structures. We identified optimal structures in this problem and proposed optimal and efficient approximation algorithms for integrating a pair of ontologies. Furthermore, we extend the basic problem to address the integration of a large number of ontologies, and correspondingly we proposed an efficient approximation algorithm for integrating multiple ontologies. e empirical study on both real ontologies and synthetic data demonstrates the effectiveness of our proposed approaches. In addition, the results of integration between gene ontology and National Drug File Reference Terminology suggest that our method provides a novel way to perform association studies between biomedical terms. 1. Introduction In recent years, ontologies are becoming increasingly impor- tant in knowledge engineering. Generally speaking, an ontol- ogy is a collection of concepts and their relations. It has wide applications in computer science and life science. For exam- ple, in computer science, semantic web uses web ontology language (OWL) to represent knowledge bases [1]. In life sci- ences, numerous important data structures and tools are built on ontologies. One of the most popular ontologies is gene ontology. Researchers use it frequently to measure the enrich- ment of gene clusters and to identify potential biomarkers. Two most famous ontology databases in the biomedical field are Unified Medical Language System (UMLS) [2] and NCBO BioPortal (https://bioportal.bioontology.org/). e former has more than 100 ontology datasets and the latter has more than 300 ontology datasets. Although ontologies can be modeled as a directed graph, many ontologies are in fact hierarchical trees or have hier- archical tree-like structures. In the BioPortal website, users can find basic hierarchical properties of an ontology, such as the maximum depth and the maximum number of children. In the UMLS, the hierarchical structure of an ontology is documented in the “MRHIER.RRF” file with each line being a path from a term to its root. We can build a hierarchical tree from these paths by merging the common nodes starting from the root. Because the hierarchical structures of some ontologies are in fact directed acyclic graphs, the hierarchical tree may contain some duplicated concepts. To simplify our study, we treat them as independent concepts in this work. An important knowledge discovery task is to identify knowledge associations. In life science, this task includes finding the associations between diseases and genes [3, 4] and between phenotypes and genotypes [5]. With the presence of ontologies, such a task has been extended from identifying the associations between terms to the associations between ontologies as a whole. For the latter, we should not only consider the term associations, but also the term associations in the context of their ontological structures. For example, if the parent and children of term ( from ontology ) are Hindawi Publishing Corporation BioMed Research International Volume 2015, Article ID 501528, 14 pages http://dx.doi.org/10.1155/2015/501528

Research Article Building Integrated Ontological Knowledge

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Research Article Building Integrated Ontological Knowledge

Research ArticleBuilding Integrated Ontological Knowledge Structures withEfficient Approximation Algorithms

Yang Xiang1 and Sarath Chandra Janga2

1Department of Biomedical Informatics The Ohio State University Columbus OH 43210 USA2Department of BioHealth Informatics Indiana University-Purdue University Indianapolis Indianapolis IN 46202 USA

Correspondence should be addressed to Sarath Chandra Janga scjangaiupuiedu

Received 22 October 2014 Revised 30 December 2014 Accepted 1 January 2015

Academic Editor Dongchun Liang

Copyright copy 2015 Y Xiang and S C Janga This is an open access article distributed under the Creative Commons AttributionLicense which permits unrestricted use distribution and reproduction in any medium provided the original work is properlycited

The integration of ontologies builds knowledge structures which brings new understanding on existing terminologies and theirassociations With the steady increase in the number of ontologies automatic integration of ontologies is preferable over manualsolutions in many applications However available works on ontology integration are largely heuristic without guarantees on thequality of the integration results In this work we focus on the integration of ontologies with hierarchical structures We identifiedoptimal structures in this problem and proposed optimal and efficient approximation algorithms for integrating a pair of ontologiesFurthermore we extend the basic problem to address the integration of a large number of ontologies and correspondingly weproposed an efficient approximation algorithm for integrating multiple ontologies The empirical study on both real ontologiesand synthetic data demonstrates the effectiveness of our proposed approaches In addition the results of integration between geneontology and National Drug File Reference Terminology suggest that our method provides a novel way to perform associationstudies between biomedical terms

1 Introduction

In recent years ontologies are becoming increasingly impor-tant in knowledge engineering Generally speaking an ontol-ogy is a collection of concepts and their relations It has wideapplications in computer science and life science For exam-ple in computer science semantic web uses web ontologylanguage (OWL) to represent knowledge bases [1] In life sci-ences numerous important data structures and tools are builton ontologies One of the most popular ontologies is geneontology Researchers use it frequently tomeasure the enrich-ment of gene clusters and to identify potential biomarkersTwo most famous ontology databases in the biomedical fieldareUnifiedMedical Language System (UMLS) [2] andNCBOBioPortal (httpsbioportalbioontologyorg) The formerhas more than 100 ontology datasets and the latter has morethan 300 ontology datasets

Although ontologies can be modeled as a directed graphmany ontologies are in fact hierarchical trees or have hier-archical tree-like structures In the BioPortal website users

can find basic hierarchical properties of an ontology such asthe maximum depth and the maximum number of childrenIn the UMLS the hierarchical structure of an ontology isdocumented in the ldquoMRHIERRRFrdquo file with each line beinga path from a term to its root We can build a hierarchicaltree from these paths by merging the common nodes startingfrom the root Because the hierarchical structures of someontologies are in fact directed acyclic graphs the hierarchicaltree may contain some duplicated concepts To simplify ourstudy we treat them as independent concepts in this work

An important knowledge discovery task is to identifyknowledge associations In life science this task includesfinding the associations between diseases and genes [3 4] andbetween phenotypes and genotypes [5] With the presence ofontologies such a task has been extended from identifyingthe associations between terms to the associations betweenontologies as a whole For the latter we should not onlyconsider the term associations but also the term associationsin the context of their ontological structures For example ifthe parent and children of term 119886 (119886 from ontology 119860) are

Hindawi Publishing CorporationBioMed Research InternationalVolume 2015 Article ID 501528 14 pageshttpdxdoiorg1011552015501528

2 BioMed Research International

Disease

Disease

Infectious disease

Infectious disease

Cancer

Cancer

Skin cancer

Skin cancer

Flu

Flu

Hepatitis B

Hepatitis B

Medicine

Medicine

Radiation

Radiation Vaccination

Vaccination

Brachy-therapy

Brachy-therapy

Radio-isotape

Radio-isotape

Influenza vaccine

Influenza vaccine

HBV vaccine

HBV vaccine

Thyroid cancer

Thyroid cancer

middot middot middot

middot middot middot

middot middot middot

Figure 1 A simple example of integrating two hypothetical ontologies

similar to those of term 119887 (119887 from ontology 119861)Then 119886may bea good choice to be associated with 119887

Early studies on ontology integration relied on domainexperts to manually set up the integration rules [6] Howeverthis approach cannot meet the ever increasing volume ofontology datasets Automatic ontology integration methodshave been developed to address this issue However as wewill see in the discussions of Section 13 these methods areoften heuristic or have not demonstrated the effectivenessin integrating a large volume of ontology datasets Thusour goal is to develop an ontology integration method thatis able to deliver optimal or close-to-optimal solutions forintegrating a large volume of ontology datasets (particularlyfrom the biomedical domain) As discussed above we focuson ontologies with hierarchical tree-like structures whichare often available in the biomedical domain In additionwe assume the ontology term closeness measurement isavailable This assumption is reasonable because many appli-cations are able to identify ontology term similarities viaadditional data sources For example a closeness matrixbetween two sets of biomedical terms can be generated byusingUMLS knowledge discoverymethods such as kDLS [7]Our problem can be formally described as follows

11 Problem Formulation The basic ontology integrationproblem in our work can be formulated as follows Givenontology tree structures 119879

119860and 119879

119861and a closeness matrix

119872119860119861 how can we efficiently generate an integrated ontology

tree structure 119879119860119861

meeting the following two basic criteria

(1) For any two vertices 119909 and 119910 in 119879119860(or 119879119861) the lowest

common ancestor LCA119879119860(119909 119910) (or LCA

119879119861(119909 119910)) is

contained by LCA119879119860119861(119909 119910)

(2) It holds that argmax119879119860119861119891(119879119860119861) = sum

119883isin119881(119879119860119861)119872119860119861(119883)

Here 119872119860119861(119883) is the entry value in the closeness

matrix for the corresponding two vertices (one from

119879119860and the other from 119879

119861) contained in the node 119883

119872119860119861(119883) = 0 if 119883 a node of 119879

119860119861 contains only one

vertex from 119879119860or 119879119861

We name 119891119879119860119861

the cohesion function of the integrated ontol-ogy119879119860119861

and its value is the overall cohesion score of integrat-ing 119879119860and 119879

119861into 119879

119860119861 Correspondingly we define function

119892(119879119860 119879119861) = max

119879119860119861(sumVisin119881(119879119860119861)119872119860119861(V)) as the maximum

cohesion function for integrating the ontologies 119879119860and 119879

119861

and its value is the maximum overall cohesion score (orsimplymaximum cohesion score) In a hierarchical ontologythe common part of any two terms can be described by theirlowest common ancestor For example in Figure 1 flu andhepatitis B are both infectious diseases and flu and cancerare both diseases Thus we use Criterion (1) to ensure thatthe basic logic of an ontology is preserved after integration

An example of integrating two hypothetical ontologiesthat satisfy Criterion (1) is given in Figure 1 To facilitate theunderstanding of our problem definition we also provideanother example of integrating two ontologies in Figure 2 Aswe can see in Figure 2 the lowest common ancestor of nodescontaining thyroid cancer and infectious disease is the nodecontaining cancer instead of disease We conclude that theintegration is a violation of Criterion (1) In fact we can easilysee that there aremultiple pairs of nodeswith incorrect lowestcommon ancestors in Figure 2

In Section 22 wewill extend the basic problemdefinitionto handle the integration of multiple (gt2) ontologiesThe twobasic criteria will be extended correspondingly

12 Main Contributions We made the following major con-tributions in this work

(i) We proposed a novel ontology integration problemthat optimizes the cohesion function We identifiedoptimal structures in this problem and developed

BioMed Research International 3

Disease

Infectious disease

Cancer

Skin cancer

Flu

Hepatitis B

Medicine

Radiation Vaccination

Brachy-therapy

Radio-isotape

Influenza vaccine

HBV vaccine

Thyroid cancer middot middot middot

Figure 2 Another example of integrating the two ontologies in Figure 1 This integration violates Criterion (1)

optimal as well as efficient approximation solutionsfor this problem

(ii) We extended the basic problem to handle the integra-tion of large number of ontologies and we developedboth greedy and fast approximation algorithms forthe extended problem

(iii) We studied the proposed algorithms on both real andsynthetic datasets and confirmed their effectiveness inintegrating large volume of ontology datasets

13 Related Work Automatic ontology generation and inte-gration are desirable in many applications and have beenstudied in the past decade Although available methods forautomatic ontology generation produce ontologies from agiven type of data such as gene networks [8] textual data [9]dictionary [10] and schemata [11 12] they do not contributeto the integration of different types of ontologies which willbring innovative results on annotationknowledge reuse andassociation studies To address this issue a number of studieshave been focused on ontology integration [13 14] and itsmedical domain applications [15] The ontology integrationmethods available and used in these works can be generallyclassified into three categories

Manually or Semiautomatic Setups In [6] the authors pre-sented a methodology for ontology integration by custom-tailored integration operations which include algebraic-specified 39 basic operations and 12 nonbasic operationsderived from them The authors identified a set of criteriasuch as modularize specialize and diversify each hierarchyfor guiding the knowledge integration In [16] the authorsdesigned a semiautomatic approach for ontology mergingThe ontology developers will be assisted by the system andguided to tasks needing their interventions

Using Machine Learning Methods Reference [17] describesan ontology mapping system GLUE that uses machinelearning techniques for building ontology mappings Specif-ically GLUE uses multiple learning strategies Each type ofinformation from the ontologies is handled by a different

learning strategy The authors demonstrate that GLUE workseffectively on taxonomy ontologies Similarly [18] also usedmultistrategy learning in matching pair identification How-ever the ontologies used in the experiments of [18] containless than 10 nodes Although [17] studied the integrationof larger ontologies in the experiments those ontologiescontain only around 34 to 176 nodesmuch smaller thanmanyontologies used in the biomedical field

Using Heuristic Approaches Many automatic ontology inte-gration methods [19 20] fall into this categoryThey performontology integrations by using heuristic approaches from dif-ferent perspectives For example [20] uses heuristic policiesfor selecting axioms and candidate merging pairs From aquite different angle [19] uses view-based query for guidingthe ontology integrations

Thesemethods have a fewmajorweaknesses including (1)lack of a systematicmeasurement to quantify the goodness forthe ontology integration (2) being generally heuristic withno theoretical results to show that the proposed integrationapproach is globally optimal or close to optimal (3) being notfor integrating large volume of ontologies These weaknessesmotivated us to develop efficient and near optimal solutionsfor integrating large ontology datasets

2 Methods

21 Integrating a Pair of Ontologies In this section wefocus on the basic problem of integrating two ontologies asformulated in Section 11 We will prove optimal structuresin the problem and propose an optimal and an efficientapproximation solution for this problem These solutions arealso the basis for solving the problem of integrating a largenumber of ontologies as described in Section 22

211 Brutal-Force and Heuristic Solutions Given Criteria(1) and (2) a brutal-force approach will pick up a best solu-tion fromall the solutions that startwith integration involvingat least one of the roots of the two ontology trees and iter-atively integrate their descendants Considering an extremecase where each ontology tree is a path of 119899 vertices we

4 BioMed Research International

push (0 0) into queue 119902 Integrating virtual roots of the two ontologieswhile |119902| gt 0 do(119886 119887) = 119901119900119901(119902)for 119894 isin children of 119886 on 119879

119860do

119898119886119909 119904119888119900119903119890 = minus1119905119900 119898119890119903119892119890 = 119873119880119871119871119905119900 119898119890119903119892119890 119903119900119900119905 = 119873119880119871119871for 119895 isin children of 119887 on 119879

119861do

if 119895 is chosen thencontinue The subtree rooted at 119895 can only be chosen once for integration as illustrated in Figure 3

elseidentify vertex 119896 in the subtree rooted at 119895 such that argmax

119896(119872119860119861(119894 119896)120573

distance(119896119895))

if 119872119860119861(119894 119896) gt 119898119886119909 119904119888119900119903119890 then

119898119886119909 119904119888119900119903119890 = 119872119860119861(119894 119896)

119905119900 119898119890119903119892119890 = 119896119905119900 119898119890119903119892119890 119903119900119900119905 = 119895

end ifend if

end forif 119905119900 119898119890119903119892119890 = 119873119880119871119871 thenbreak

elsesave merge pair (119894 119905119900 119898119890119903119892119890) in 119879

119860119861

mark 119905119900 119898119890119903119892119890 119903119900119900119905 as chosenpush (119894 119905119900 119898119890119903119892119890) into queue 119902

end ifend for

end whilereturn 119879

119860119861

Algorithm 1 HeuristicMerge(119879119860 119879119861119872119860119861)

conclude that such a brutal-force approach needs to pick upa best solution from an exponential number of solutionsThebrutal-force approach is clearly not acceptable for integratinglarge ontologies and may not even work for ontologies withonly a few dozens of vertices

A heuristic solution can be developed by following anidea similar to the above brutal-force approach Howeverinstead of trying all possibilities the heuristic solution willgreedilymerge vertices following the topological orderWhenselecting a matching vertex for vertex 119886 from ontology119879119860 the heuristic approach will greedily select a vertex 119887

from allowable candidates in 119879119861and iteratively apply such

selections to descendants of 119886 According to Criterion (1) if 119886is associated with 119887 then none of 119886rsquos descendants are allowedto be associated with vertices other than 119887rsquos descendants Inaddition if 119886rsquos child 1198861015840 is associated with 119887rsquos child 1198871015840 or itsdescendants then none of 119886rsquos other children are allowed tobe associated with 1198871015840 or its descendants any more Given thisa greedy choice may very easily end up in a local optimum bychoosing a best matching vertex at one step while denyingintegrating opportunities of other vertices that may lead to abetter final solution

It is easy to see that the deeper a vertex being chosen forintegration is the more integration opportunities are lost Toalleviate such a situation we propose a greedy approach byconsidering the relative depth (rdepth) of a chosen vertexwith regard to an allowable vertex closest to the root That is

given a vertex 119886 from 119879119860 a vertex 119887 from 119879

119861is chosen when

119872119860119861(119886 119887)120573

rdepth(119887) is maximized When 120573 = 1 the depthinformation does not take effective and when 120573 = infin eachvertex will only be associated with an allowable vertex closestto the root

Algorithm 1 describes the pseudocode of the heuristicintegration It starts by integrating virtual roots of the twoontologies After that the integration will be carried outiteratively from top to bottom by following Criterion (1) andthe heuristic strategy described above In the empirical studywe will see that the heuristic algorithm works better whenthe depth information is considered However in terms ofthe overall cohesion score it is no match for our optimal andapproximation solutions as described below

212 Optimal and Approximation Solutions By dividing theintegration of two trees into node merging and subtreeintegrations we have identified optimal structures in thebasic problem as stated by Lemmas 1 and 2 These optimalstructures make it possible for us to develop efficient algo-rithms (Algorithms 2 and 3) that achieve optimal or approx-imation solutions In the following we first describe thetwo important lemmas suggesting the optimal structures andtheir proofs before describing our proposed algorithms

Lemma 1 Let 119903119886be the root of tree119879

119860and 119903119887the root of tree119879

119861

Let119879119860minus119903119886

and119879119861minus119903119887

represent two sets of sub trees rooted at 119903119886of

BioMed Research International 5

Case (2a) Case (2b) Case (3a) Case (3b)

Case (1) Case (2) Case (3)

y2

y2

y2 y2 y2

y2

y1

y1

y1 y1y1

y1

Ty1

Ty2

x

xx

xx

x

ra rb

Txy1y2

tx

Figure 3 An illustration of three cases in the proof of Lemma 2

Sort vertices in 119879119860and 119879

119861in the topological order

for 119894 = |119881(119879119860)| to 0 do

for 119895 = |119881(119879119861)| to 0 do

119904119888119900119903119890 = 119872119860119861(119894 119895) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119861minus119895)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119895)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119894 119879119861minus119895)

119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909(119894 119895) = max(119904119888119900119903119890 119904119888119900119903119890119886 119904119888119900119903119890

119887)

end forend forreturn 119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909

Algorithm 2 BuildCohesionMatrix(119879119860 119879119861)

119879119860and 119903119887of119879119861 respectively119879

119860minus119903119886does not include 119903

119886and119879119861minus119903119887

does not include 119903119887 One has 119892(119879

119860 119879119861) = max(119872

119860119861(119903119886 119903119887) +

119892(119879119860minus119903119886 119879119861minus119903119887) 119892(119879119860 119879119861minus119903119887) 119892(119879119860minus119903119886 119879119861))

Proof We can divide the integration of tree 119879119860with tree 119879

119861

into two cases according to the merging of their roots

(1) The roots of 119879119860and 119879

119861are merged together

(2) The roots of 119879119860and 119879

119861are not merged together

For case (1) it is clear that the cohesion score is119872119860119861(119903119886 119903119887)+

119892(119879119860minus119903119886 119879119861minus119903119887)

For case (2) we conclude that either 119879119860is integrated with

119879119861minus119903119887

(119903119887is out of integration) or 119879

119861is integrated with 119879

119860minus119903119886

(119903119886is out of integration) Otherwise we will have a merged

tree 119879119860119861

with two roots 119903119886and 119903

119887 a contradiction to the

fact that 119879119860119861

is a tree Therefore the cohesion score is either119892(119879119860 119879119861minus119903119887) or 119892(119879

119860minus119903119886 119879119861)

Combining cases (1) and (2) and according to Criterion(2) we have

119892 (119879119860 119879119861) = max (119872

119860119861(119903119886 119903119887) + 119892 (119879

119860minus119903119886 119879119861minus119903119887)

119892 (119879119860 119879119861minus119903119887) 119892 (119879

119860minus119903119886 119879119861))

(1)

Lemma 2 Let 119879119860minus119903119886

and 119879119861minus119903119887

represent two sets of treesobtained by removing the root vertices 119903

119886from 119879

119860minus119903119886and 119903119887

from 119879119861minus119903119887

One has

119892 (119879119860minus119903119886 119879119861minus119903119887) = max( sum

(119879119909 119879119910)isin119877

(119892 (119879119909 119879119910))) (2)

Here 119879119909isin 119879119860minus119903119886

119879119910isin 119879119861minus119903119887

and 119877 is a matching of trees in119879119860minus119903119886

with trees in 119879119861minus119903119887

Proof To prove this lemma we first prove that for any tree119879119909isin 119879119860minus119903119886

it can be integrated with no more than one treein 119879119861minus119903119887

We will prove this claim by contradiction Assumethere are two trees 119879

1199101isin 119879119861minus119903119887

and 1198791199102isin 119879119861minus119903119887

and theyintegrate with a tree119879

119909isin 119879119860minus119903119886

into an integrated tree11987911990911991011199102

There are three cases for the root 119903 of 119879

11990911991011199102 as illustrated in

Figure 3

(1) 119903 contains only the root of 119879119909

(2) 119903 contains only the root of 1198791199101or 1198791199102

(3) 119903 contains the root of 119879119909and the root of 119879

1199101or 1198791199102

For case (1) the lowest common ancestor of the roots of 1198791199101

and 1198791199102in the integrated tree 119879

11990911991011199102will no longer contain

their lowest common ancestor in 119879119861 a contradiction to

6 BioMed Research International

push (0 0119873119880119871119871) into queue 119902while |119902| gt 0 do(119886 119887 119888) = 119901119900119901(119902)if 119886 = minus1 thenmake 119879

119861(119887) as a subtree of 119879

119860119861rooted at 119887 119879

119861(119887) is a sub tree of 119879

119861rooted at 119887

else if 119887 = minus1 thenmake 119879

119860(119886) as a subtree of 119879

119860119861rooted at 119886 119879

119860(119886) is a sub tree of 119879

119860rooted at 119886

elselet 119889 = (119886 119887) and make 119889 a child of 119888 on 119879

119860119861

119904119888119900119903119890 = 119872119860119861(119886 119887) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119861minus119887)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119887)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119886 119879119861minus119887)

if 119904119888119900119903119890 gt= 119904119888119900119903119890119886ampamp 119904119888119900119903119890 gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119861minus119887

with 119889 into 119902else if 119904119888119900119903119890

119886gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119887with 119889 into 119902

elsepush matching results of 119879

119886 119879119861minus119887

with 119889 into 119902end if

end ifend whilereturn 119879

119860119861

Algorithm 3 BuildNewOnto(cohesion matrix 119879119860 119879119861)

Criterion (1) For cases (2) and (3) the root of 1198791199102(or the root

of 1198791199101) will be the descendant of the root of 119879

1199101(or the

root of 1198791199102) in the integrated tree 119879

1199091199101 1199102 a contradiction to

Criterion (1) For integration involving more than two treesfrom 119879

119861minus119903119887 we can still follow the above procedure to reach

contradictions Thus the claim is provenWithout loss of generality we can see that for any

tree 119879119910isin 119879119861minus119903119887

it can be integrated with no more thanone tree in 119879

119860minus119903119886 Therefore the integration between 119879

119860minus119903119886

and 119879119861minus119903119887

corresponds to a matching in a weighted bipar-tite graph in which two sets of nodes represent treesfrom 119879

119860minus119903119886and 119879

119861minus119903119887 respectively and edges represent

corresponding cohesion scores According to Criterion (2)119892(119879119860minus119903119886 119879119861minus119903119887) = max(sum

(119879119909 119879119910)isin119877(119892(119879119909 119879119910))) and we con-

clude that 119892(119879119860minus119903119886 119879119861minus119903119887) corresponds to the weight of amax-

imum weighted matching in the above bipartite graph

Given Lemma 2 we can see that the following corollary iscorrect

Corollary 3 Define MaxMatch(119883 119884) = sum(119879119909 119879119910)isin119877

119892(119879119909 119879119910)

where 119877 is a maximum weighted matching of trees in forests119883 and 119884 given 119892(119879

119909 119879119910) for any tree pair (119879

119909 119879119910) isin 119883 times 119884

One concludes that for any two forests 119879119860and 119879

119861 119892(119879119860 119879119861) =

MaxMatch(119879119860 119879119861)

With Lemma 1 and Corollary 3 we are able to designan efficient dynamic programming algorithm achieving theglobal optimum for the ontology integration problem Thepseudocode for calculating the maximum cohesion score isdescribed in Algorithm 2 which visits ontology vertices inreverse topological order when filling up the cohesionmatrix

At the end of Algorithm 2 the cohesion matrix is filledup with optimal cohesion scores and the maximum cohesionscore is saved at entry (0 0) as described byTheorem 4

Theorem 4 It holds that cohesion matrix(119894 119895) = 119892(119879119860minus119894

119879119861minus119895) and cohesion matrix(0 0) = 119892(119879

119860 119879119861)

Proof We will prove this theorem by mathematical induc-tion

Let |119881(119879119860)| = 119899 and |119881(119879

119861)| = 119898 it is easy to see

that cohesion matrix(119899119898) is optimal because 119899 and 119898correspond to leaf nodes in the topological order Thus119879119860minus119899

and 119879119861minus119898

are empty sets and cohesion matrix(119899119898) =119872119860119861(119899119898) = 119892(119879

119899 119879119898)

When 119894 = 119899 and 119895 lt 119898 according to Lemma 1 to integrate119879119899with 119879

119895 either 119879

119899(the leaf vertex) is merged with 119879

119895or 119879119899

is integrated with 119879119861minus119895

The maximum score of integrating119879119899with 119879

119861minus119895is available at the time cohesion matrix(119899 119895) is

being calculated because of the reverse topological visitThusaccording to both Lemma 1 and Corollary 3 we concludethat 119892(119879

119899 119879119895) = max(119872

119860119861(119899 119895)MaxMatch(119879

119899 119879119861minus119895)) =

cohesion matrix(119899 119895) Similarly we can conclude that119892(119879119894 119879119898) = max(119872

119860119861(119894 119898)MaxMatch(119879

119860minus119894 119879119898)) =

cohesion matrix(119894 119898)When 119894 lt 119899 and 119895 lt 119898 again according to

Lemma 1 and Corollary 3 we have 119892(119879119894 119879119895) = max(score

score119886 score

119887) = cohesion matrix(119894 119895) Due to the reserve

topological visit MaxMatch(119879119860minus119894 119879119861minus119895) MaxMatch(119879

119860minus119894

119879119895) and MaxMatch(119879

119894 119879119861minus119895) are available at the time

cohesion matrix(119894 119895) is being calculated

Given the definition of 119892(119879119860 119879119861) Theorem 4 in fact

proves that the proposed approach achieves the global opti-mum However the global optimal solution is built upon

BioMed Research International 7

themaximumweightedmatching (recall Corollary 3) As dis-cussed in Section 213 the maximum weighted matching istime-consuming and therefore we propose an approximationsolution in that section

Although Algorithm 2 builds the cohesion matrix withoptimal cohesion scores it does not construct the integratedontological knowledge structure We may save the ontologyintegration details along with the cohesion scores Howeverthat will cost 119874(1198993) memory space (assuming each ontologyhas 119874(119899) vertices) and significantly reduce the capacity ofthe algorithm in handling large ontology integrations Quiteinterestingly we find that it is not necessary to save the inte-gration details in order to construct the integrated ontologyThe construction can be done by a process reverse to theconstruction of cohesionmatrix as described inAlgorithm 3

Algorithm 3 uses the cohesion matrix constructed byAlgorithm 2 and builds the integrated ontology tree still byfollowing Lemma 1 and Corollary 3 but in a reverse way ofAlgorithm 2 The construction is performed in a Breadth-First fashion which uses a queue 119902 to maintain triples Eachtriple (119886 119887 119888) is an association of three elements 119886 is thematched vertex from ontology 119879

119860 119887 is the matched vertex

from ontology 119879119861 and 119888 is their parent on the merged

ontology 119879119860119861 By following the basic idea of the proof of

Theorem 4 we can show that Algorithm 3 builds an optimalintegrated ontology with the cohesion matrix provided fromAlgorithm 2 We omit the proof for succinctness

213 Time Complexity Analysis and an Approximation Solu-tion Assume an ontology size is 119874(119899) The cohesion matrixhas 119874(1198992) entries to fill up The computation for each entryis a matching whose time complexity depends on the imple-mentation The maximum weighted matching takes 119874(1198993)using the famous Hungarian algorithm [21] and althoughit achieves optimum it is too costly for large ontologiesThe maximal weighted matching however takes 119874(1198992 log 119899)time and more importantly results in an overall 119874(1198992 log 119899)time complexity for Algorithm 2 The analysis is given inthe following For each matching the algorithm will accesspreviously filled entries and each entry will be accessed onlyonce and be involved only once in a sorting of 119874(log 119899)time This is because each entry corresponds to two verticeswhose cohesion score will be accessed when calculating thecohesion score of their parents Thus we conclude that thetotal time complexity of calculating the cohesionmatrix usingmaximal weightedmatching is119874(1198992+1198992 log 119899) = 119874(1198992 log 119899)Since building the new ontology has the same time com-plexity as building the cohesion matrix this is also the totaltime complexity for integrating two ontologies by maximalweightedmatchingThemaximal weightedmatching also hasa guaranteed lower bound on the results It achieves a (12)-approximation solution (ie the overall cohesion score willbe at least 12 of the optimal cohesion score) as pointed out in[22]

Since themaximalweightedmatching results in an overallgood performance on time complexity and approximationrate we used the maximal weighted matching in our empir-ical study for Algorithms 2 and 3 Readers may also choose

othermatching algorithms (such as the one described in [22])to achieve slightly better approximation rates However theweightedmatching is a replaceablemodule for our algorithmsand it is not the focus of this work to build a fast and close-to-optimal weighted matching algorithm

Compared to the time complexity of integrating ontolo-gies by the dynamic programming approach as describedin Algorithms 2 and 3 the heuristic approach described atthe beginning of this section also has 119874(1198992) in the worstcase However we conjecture that the heuristic approach hasa much smaller average time complexity because in eachstep the heuristic approach may exclude a large number ofmatching opportunities

22 Integrating Multiple Ontologies In the previous sectionwe proposedmethods for integrating two ontologies In somebiomedical applications [23 24] we are interested in theassociations involving more than two objects Integrationof multiple ontologies of these objects will generate aninnovative view on these complex relationships Similar tothe basic problem formulation we can formulate themultipleontology integration as follows

Given 119896 ontology trees 1198791 1198792 119879

119896and a closeness

matrix119872119894119895for any two trees 119879

119894and 119879

119895 how can we efficiently

generate an integrated ontology tree 11987912119896

meeting thefollowing criteria

(1) For any two vertices 119909 and 119910 in a tree 119879119894 their

lowest common ancestor LCA119879119894(119909 119910) is contained by

LCA11987912119896(119909 119910)

(2) It holds that argmax11987912119896119891(11987912119896

) =

sum119883isin119881(119879119860119861)

sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Here

119872120590(119906)120590(V)(119906 V) is the entry value in the closeness

matrix for two vertices 119906 and V (one from tree 119879120590(119906)

and the other from tree 119879120590(V)) contained in the node

119883 For a vertex V from an original ontology 120590(V) isdefined as its original ontology ID

Again we name the function 11989111987912119896

the cohesion of theintegrated ontology 119879

12119896 For each node 119883 in the inte-

grated ontology we define its weight as 119908119890119894119892ℎ119905(119883) =sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Correspondingly we define

function119892(1198791 1198792 119879

119896) = max

11987912119896(sum119883isin119881(11987912119896)

119908119890119894119892ℎ119905(119883))

as the maximum cohesion function for integrating theontologies 119879

1 1198792 119879

119896 As we can see in the above formula-

tion the overall cohesion score of integration is the summedweight of each node which is the sum of pairwise closenessscores

The formulation of multiple ontology integration is sim-ilar to the basic version and it is not difficult to showthat optimal structures described in Lemmas 1 and 2 canbe extended to a high dimension However the extensionof algorithms described in Section 212 for integrating twoontologies is not feasible for solving the multiple ontologyintegrationThis is because if we need to extend Algorithms 2and 3 to this problem we need to build a cohesionmatrix of 119896dimensions It implies that we need at least 119874(119899119896) operationsto fill up the score matrix assuming the size of an ontology

8 BioMed Research International

A B C D

A 10 6 8

B 10 7 5

C 6 7 9

D 8 5 9

AB C D

AB

C 9

D 9

lowastlowast

lowast

lowast mdash

mdash

mdash

mdash

mdash

mdash

mdash

(lowast) Entries to update

Figure 4 An example of the InterOntology matrixrsquos change at the first iteration in Algorithm 4 for integrating four ontologies

build the 119896 times 119896 InterOntology matrixfor 119894 = 1 to 119896 minus 1 doidentify the active tree pair ⟨119879

119883 119879119884⟩ that corresponds

to the highest score in the InterOntology matrixintegrate 119879

119883and 119879

119884into 119879

119883119884

mark 119879119883and 119879

119884as inactive

update relationship matricesupdate InterOntology matrix

end forreturn 119879

12119899

Algorithm 4 GreedyMultiInt(T = 1198791 1198792 119879

119896)

is 119874(119899) This is clearly not acceptable for high dimensionalontology integration

221 Greedy Approach From the above discussion we cansee that direct extension of Algorithms 2 and 3 for integratingtwo ontologies is practically not feasible for integrating alarge number of ontologies However we can still use thesealgorithms for integrating multiple ontologies by iterativelyintegrating two ontologies and generating a new closenessmatrix Given the ontologies 119879

1 1198792 119879

119896 we can first

integrate 1198791and 119879

2into 119879

12and then build the closeness

matrix between 11987912

and 1198793using the relationship matrices

between 1198791and 119879

2and between 119879

1and 119879

3 Specifically

assume 119883 is a node on the integrated ontology 11987912 and 119883

contains a vertex 119886 from 1198791and a vertex 119887 from 119879

2 Then

the entry (119883 119888) of the closeness matrix between 11987912

and 1198793

is 11987211987912 1198793(119883 119888) = 119872

1198791 1198793(119886 119888) + 119872

1198792 1198793(119887 119888) After the new

closeness matrix is generated we can continue integrating11987912

and 1198793into 119879

123and generating another new closeness

matrix We will eventually get the integrated ontology 11987912119896

by repeating the above process To facilitate the followingdiscussions we name the above approach the basic multipleintegration approach

Although the basic multiple integration approach canfinish integrating multiple ontologies it blindly integratesontologies without using any cohesion information betweenontologies that may lead to a better integration resultTo improve the basic multiple integration we propose agreedy approach that uses the cohesion information betweenontologies to guide the integration The basic steps of thegreedy approach are outlined in Algorithm 4 To facilitate

the understanding of Algorithm 4 we use Figure 4 toillustrate an example of the InterOntology matrixrsquos change atthe first iteration of integrating four ontologies A B C andD

The key idea in Algorithm 4 is to maintain an InterOn-tology matrix which guides the integration Initially thismatrix is filled with the overall cohesion score of everypair of ontologies In each step this matrix is updated withoverall cohesion scores between newly integrated ontologyand existing active ontologies The integration will take placebetween two active ontologies which have the highest scorein the InterOntology matrix

When we used the overall cohesion score between twoontologies to update the InterOntology matrix we observedan interesting phenomenon that the integration inmost casesis a process continuously expanding an integrated ontologyConsequently the greedy approach is likely to yield a resultsimilar to the basic approach

This phenomenon can be explained by the definition ofmaximum cohesion function which takes into account allpairwise closeness between merged terms Thus the moreontologies contained in an integrated ontology are the morelikely it will have high overall cohesion scores with otherontologies As a result it creates unfairness for the integrationselection To fix this issue we use the adjusted overall cohe-sion scores in updating the InterOntology matrix as follows

Given an ontology 119879119883and an ontology 119879

119884where 119883 and

119884 are nonempty sets of ontology IDs we define the adjustedcohesion score between 119879

119883and 119879

119884as 119860119889119895119862119900ℎ(119879

119883 119879119884) =

sum119885isin119881(119879119883119884)

sum119909isin119885119910isin119885120590(119909)isin119883120590(119910)isin119884

119872120590(119909)120590(119910)

(119909 119910)|119883||119884| where119879119883119884

is the integrated ontology built by Algorithms 2 and 3The adjusted cohesion score is in fact the weight increase byintegrating 119879

119883and 119879

119884 divided by the size of 119883 times the

size of 119884 that is119860119889119895119862119900ℎ(119879119883 119879119884) = sum

119885isin119881(119879119883119884)(119908119890119894119892ℎ119905(119885))minus

sum119883isin119881(119879119883)

(119908119890119894119892ℎ119905(119883)) minus sum119884isin119881(119879119884)

119908119890119894119892ℎ119905(119884)|119883||119884| For eachnode merging closeness scores will be added to the totalweight when the merging takes place between vertices fromontology set 119883 and vertices from ontology set 119884 Thus theweight increase by integrating 119879

119883and 119879

119884is proportional to

the number of ontologies in119883 times the number of ontologiesin 119884 and consequently we averaged the weight increase by|119883||119884|

222 Fast Approximation Algorithm Although the basicmultiple integration and the greedy multiple integrationapproaches discussed above are able to integrate multipleontologies none of them provide any guarantee on the results

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 2: Research Article Building Integrated Ontological Knowledge

2 BioMed Research International

Disease

Disease

Infectious disease

Infectious disease

Cancer

Cancer

Skin cancer

Skin cancer

Flu

Flu

Hepatitis B

Hepatitis B

Medicine

Medicine

Radiation

Radiation Vaccination

Vaccination

Brachy-therapy

Brachy-therapy

Radio-isotape

Radio-isotape

Influenza vaccine

Influenza vaccine

HBV vaccine

HBV vaccine

Thyroid cancer

Thyroid cancer

middot middot middot

middot middot middot

middot middot middot

Figure 1 A simple example of integrating two hypothetical ontologies

similar to those of term 119887 (119887 from ontology 119861)Then 119886may bea good choice to be associated with 119887

Early studies on ontology integration relied on domainexperts to manually set up the integration rules [6] Howeverthis approach cannot meet the ever increasing volume ofontology datasets Automatic ontology integration methodshave been developed to address this issue However as wewill see in the discussions of Section 13 these methods areoften heuristic or have not demonstrated the effectivenessin integrating a large volume of ontology datasets Thusour goal is to develop an ontology integration method thatis able to deliver optimal or close-to-optimal solutions forintegrating a large volume of ontology datasets (particularlyfrom the biomedical domain) As discussed above we focuson ontologies with hierarchical tree-like structures whichare often available in the biomedical domain In additionwe assume the ontology term closeness measurement isavailable This assumption is reasonable because many appli-cations are able to identify ontology term similarities viaadditional data sources For example a closeness matrixbetween two sets of biomedical terms can be generated byusingUMLS knowledge discoverymethods such as kDLS [7]Our problem can be formally described as follows

11 Problem Formulation The basic ontology integrationproblem in our work can be formulated as follows Givenontology tree structures 119879

119860and 119879

119861and a closeness matrix

119872119860119861 how can we efficiently generate an integrated ontology

tree structure 119879119860119861

meeting the following two basic criteria

(1) For any two vertices 119909 and 119910 in 119879119860(or 119879119861) the lowest

common ancestor LCA119879119860(119909 119910) (or LCA

119879119861(119909 119910)) is

contained by LCA119879119860119861(119909 119910)

(2) It holds that argmax119879119860119861119891(119879119860119861) = sum

119883isin119881(119879119860119861)119872119860119861(119883)

Here 119872119860119861(119883) is the entry value in the closeness

matrix for the corresponding two vertices (one from

119879119860and the other from 119879

119861) contained in the node 119883

119872119860119861(119883) = 0 if 119883 a node of 119879

119860119861 contains only one

vertex from 119879119860or 119879119861

We name 119891119879119860119861

the cohesion function of the integrated ontol-ogy119879119860119861

and its value is the overall cohesion score of integrat-ing 119879119860and 119879

119861into 119879

119860119861 Correspondingly we define function

119892(119879119860 119879119861) = max

119879119860119861(sumVisin119881(119879119860119861)119872119860119861(V)) as the maximum

cohesion function for integrating the ontologies 119879119860and 119879

119861

and its value is the maximum overall cohesion score (orsimplymaximum cohesion score) In a hierarchical ontologythe common part of any two terms can be described by theirlowest common ancestor For example in Figure 1 flu andhepatitis B are both infectious diseases and flu and cancerare both diseases Thus we use Criterion (1) to ensure thatthe basic logic of an ontology is preserved after integration

An example of integrating two hypothetical ontologiesthat satisfy Criterion (1) is given in Figure 1 To facilitate theunderstanding of our problem definition we also provideanother example of integrating two ontologies in Figure 2 Aswe can see in Figure 2 the lowest common ancestor of nodescontaining thyroid cancer and infectious disease is the nodecontaining cancer instead of disease We conclude that theintegration is a violation of Criterion (1) In fact we can easilysee that there aremultiple pairs of nodeswith incorrect lowestcommon ancestors in Figure 2

In Section 22 wewill extend the basic problemdefinitionto handle the integration of multiple (gt2) ontologiesThe twobasic criteria will be extended correspondingly

12 Main Contributions We made the following major con-tributions in this work

(i) We proposed a novel ontology integration problemthat optimizes the cohesion function We identifiedoptimal structures in this problem and developed

BioMed Research International 3

Disease

Infectious disease

Cancer

Skin cancer

Flu

Hepatitis B

Medicine

Radiation Vaccination

Brachy-therapy

Radio-isotape

Influenza vaccine

HBV vaccine

Thyroid cancer middot middot middot

Figure 2 Another example of integrating the two ontologies in Figure 1 This integration violates Criterion (1)

optimal as well as efficient approximation solutionsfor this problem

(ii) We extended the basic problem to handle the integra-tion of large number of ontologies and we developedboth greedy and fast approximation algorithms forthe extended problem

(iii) We studied the proposed algorithms on both real andsynthetic datasets and confirmed their effectiveness inintegrating large volume of ontology datasets

13 Related Work Automatic ontology generation and inte-gration are desirable in many applications and have beenstudied in the past decade Although available methods forautomatic ontology generation produce ontologies from agiven type of data such as gene networks [8] textual data [9]dictionary [10] and schemata [11 12] they do not contributeto the integration of different types of ontologies which willbring innovative results on annotationknowledge reuse andassociation studies To address this issue a number of studieshave been focused on ontology integration [13 14] and itsmedical domain applications [15] The ontology integrationmethods available and used in these works can be generallyclassified into three categories

Manually or Semiautomatic Setups In [6] the authors pre-sented a methodology for ontology integration by custom-tailored integration operations which include algebraic-specified 39 basic operations and 12 nonbasic operationsderived from them The authors identified a set of criteriasuch as modularize specialize and diversify each hierarchyfor guiding the knowledge integration In [16] the authorsdesigned a semiautomatic approach for ontology mergingThe ontology developers will be assisted by the system andguided to tasks needing their interventions

Using Machine Learning Methods Reference [17] describesan ontology mapping system GLUE that uses machinelearning techniques for building ontology mappings Specif-ically GLUE uses multiple learning strategies Each type ofinformation from the ontologies is handled by a different

learning strategy The authors demonstrate that GLUE workseffectively on taxonomy ontologies Similarly [18] also usedmultistrategy learning in matching pair identification How-ever the ontologies used in the experiments of [18] containless than 10 nodes Although [17] studied the integrationof larger ontologies in the experiments those ontologiescontain only around 34 to 176 nodesmuch smaller thanmanyontologies used in the biomedical field

Using Heuristic Approaches Many automatic ontology inte-gration methods [19 20] fall into this categoryThey performontology integrations by using heuristic approaches from dif-ferent perspectives For example [20] uses heuristic policiesfor selecting axioms and candidate merging pairs From aquite different angle [19] uses view-based query for guidingthe ontology integrations

Thesemethods have a fewmajorweaknesses including (1)lack of a systematicmeasurement to quantify the goodness forthe ontology integration (2) being generally heuristic withno theoretical results to show that the proposed integrationapproach is globally optimal or close to optimal (3) being notfor integrating large volume of ontologies These weaknessesmotivated us to develop efficient and near optimal solutionsfor integrating large ontology datasets

2 Methods

21 Integrating a Pair of Ontologies In this section wefocus on the basic problem of integrating two ontologies asformulated in Section 11 We will prove optimal structuresin the problem and propose an optimal and an efficientapproximation solution for this problem These solutions arealso the basis for solving the problem of integrating a largenumber of ontologies as described in Section 22

211 Brutal-Force and Heuristic Solutions Given Criteria(1) and (2) a brutal-force approach will pick up a best solu-tion fromall the solutions that startwith integration involvingat least one of the roots of the two ontology trees and iter-atively integrate their descendants Considering an extremecase where each ontology tree is a path of 119899 vertices we

4 BioMed Research International

push (0 0) into queue 119902 Integrating virtual roots of the two ontologieswhile |119902| gt 0 do(119886 119887) = 119901119900119901(119902)for 119894 isin children of 119886 on 119879

119860do

119898119886119909 119904119888119900119903119890 = minus1119905119900 119898119890119903119892119890 = 119873119880119871119871119905119900 119898119890119903119892119890 119903119900119900119905 = 119873119880119871119871for 119895 isin children of 119887 on 119879

119861do

if 119895 is chosen thencontinue The subtree rooted at 119895 can only be chosen once for integration as illustrated in Figure 3

elseidentify vertex 119896 in the subtree rooted at 119895 such that argmax

119896(119872119860119861(119894 119896)120573

distance(119896119895))

if 119872119860119861(119894 119896) gt 119898119886119909 119904119888119900119903119890 then

119898119886119909 119904119888119900119903119890 = 119872119860119861(119894 119896)

119905119900 119898119890119903119892119890 = 119896119905119900 119898119890119903119892119890 119903119900119900119905 = 119895

end ifend if

end forif 119905119900 119898119890119903119892119890 = 119873119880119871119871 thenbreak

elsesave merge pair (119894 119905119900 119898119890119903119892119890) in 119879

119860119861

mark 119905119900 119898119890119903119892119890 119903119900119900119905 as chosenpush (119894 119905119900 119898119890119903119892119890) into queue 119902

end ifend for

end whilereturn 119879

119860119861

Algorithm 1 HeuristicMerge(119879119860 119879119861119872119860119861)

conclude that such a brutal-force approach needs to pick upa best solution from an exponential number of solutionsThebrutal-force approach is clearly not acceptable for integratinglarge ontologies and may not even work for ontologies withonly a few dozens of vertices

A heuristic solution can be developed by following anidea similar to the above brutal-force approach Howeverinstead of trying all possibilities the heuristic solution willgreedilymerge vertices following the topological orderWhenselecting a matching vertex for vertex 119886 from ontology119879119860 the heuristic approach will greedily select a vertex 119887

from allowable candidates in 119879119861and iteratively apply such

selections to descendants of 119886 According to Criterion (1) if 119886is associated with 119887 then none of 119886rsquos descendants are allowedto be associated with vertices other than 119887rsquos descendants Inaddition if 119886rsquos child 1198861015840 is associated with 119887rsquos child 1198871015840 or itsdescendants then none of 119886rsquos other children are allowed tobe associated with 1198871015840 or its descendants any more Given thisa greedy choice may very easily end up in a local optimum bychoosing a best matching vertex at one step while denyingintegrating opportunities of other vertices that may lead to abetter final solution

It is easy to see that the deeper a vertex being chosen forintegration is the more integration opportunities are lost Toalleviate such a situation we propose a greedy approach byconsidering the relative depth (rdepth) of a chosen vertexwith regard to an allowable vertex closest to the root That is

given a vertex 119886 from 119879119860 a vertex 119887 from 119879

119861is chosen when

119872119860119861(119886 119887)120573

rdepth(119887) is maximized When 120573 = 1 the depthinformation does not take effective and when 120573 = infin eachvertex will only be associated with an allowable vertex closestto the root

Algorithm 1 describes the pseudocode of the heuristicintegration It starts by integrating virtual roots of the twoontologies After that the integration will be carried outiteratively from top to bottom by following Criterion (1) andthe heuristic strategy described above In the empirical studywe will see that the heuristic algorithm works better whenthe depth information is considered However in terms ofthe overall cohesion score it is no match for our optimal andapproximation solutions as described below

212 Optimal and Approximation Solutions By dividing theintegration of two trees into node merging and subtreeintegrations we have identified optimal structures in thebasic problem as stated by Lemmas 1 and 2 These optimalstructures make it possible for us to develop efficient algo-rithms (Algorithms 2 and 3) that achieve optimal or approx-imation solutions In the following we first describe thetwo important lemmas suggesting the optimal structures andtheir proofs before describing our proposed algorithms

Lemma 1 Let 119903119886be the root of tree119879

119860and 119903119887the root of tree119879

119861

Let119879119860minus119903119886

and119879119861minus119903119887

represent two sets of sub trees rooted at 119903119886of

BioMed Research International 5

Case (2a) Case (2b) Case (3a) Case (3b)

Case (1) Case (2) Case (3)

y2

y2

y2 y2 y2

y2

y1

y1

y1 y1y1

y1

Ty1

Ty2

x

xx

xx

x

ra rb

Txy1y2

tx

Figure 3 An illustration of three cases in the proof of Lemma 2

Sort vertices in 119879119860and 119879

119861in the topological order

for 119894 = |119881(119879119860)| to 0 do

for 119895 = |119881(119879119861)| to 0 do

119904119888119900119903119890 = 119872119860119861(119894 119895) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119861minus119895)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119895)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119894 119879119861minus119895)

119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909(119894 119895) = max(119904119888119900119903119890 119904119888119900119903119890119886 119904119888119900119903119890

119887)

end forend forreturn 119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909

Algorithm 2 BuildCohesionMatrix(119879119860 119879119861)

119879119860and 119903119887of119879119861 respectively119879

119860minus119903119886does not include 119903

119886and119879119861minus119903119887

does not include 119903119887 One has 119892(119879

119860 119879119861) = max(119872

119860119861(119903119886 119903119887) +

119892(119879119860minus119903119886 119879119861minus119903119887) 119892(119879119860 119879119861minus119903119887) 119892(119879119860minus119903119886 119879119861))

Proof We can divide the integration of tree 119879119860with tree 119879

119861

into two cases according to the merging of their roots

(1) The roots of 119879119860and 119879

119861are merged together

(2) The roots of 119879119860and 119879

119861are not merged together

For case (1) it is clear that the cohesion score is119872119860119861(119903119886 119903119887)+

119892(119879119860minus119903119886 119879119861minus119903119887)

For case (2) we conclude that either 119879119860is integrated with

119879119861minus119903119887

(119903119887is out of integration) or 119879

119861is integrated with 119879

119860minus119903119886

(119903119886is out of integration) Otherwise we will have a merged

tree 119879119860119861

with two roots 119903119886and 119903

119887 a contradiction to the

fact that 119879119860119861

is a tree Therefore the cohesion score is either119892(119879119860 119879119861minus119903119887) or 119892(119879

119860minus119903119886 119879119861)

Combining cases (1) and (2) and according to Criterion(2) we have

119892 (119879119860 119879119861) = max (119872

119860119861(119903119886 119903119887) + 119892 (119879

119860minus119903119886 119879119861minus119903119887)

119892 (119879119860 119879119861minus119903119887) 119892 (119879

119860minus119903119886 119879119861))

(1)

Lemma 2 Let 119879119860minus119903119886

and 119879119861minus119903119887

represent two sets of treesobtained by removing the root vertices 119903

119886from 119879

119860minus119903119886and 119903119887

from 119879119861minus119903119887

One has

119892 (119879119860minus119903119886 119879119861minus119903119887) = max( sum

(119879119909 119879119910)isin119877

(119892 (119879119909 119879119910))) (2)

Here 119879119909isin 119879119860minus119903119886

119879119910isin 119879119861minus119903119887

and 119877 is a matching of trees in119879119860minus119903119886

with trees in 119879119861minus119903119887

Proof To prove this lemma we first prove that for any tree119879119909isin 119879119860minus119903119886

it can be integrated with no more than one treein 119879119861minus119903119887

We will prove this claim by contradiction Assumethere are two trees 119879

1199101isin 119879119861minus119903119887

and 1198791199102isin 119879119861minus119903119887

and theyintegrate with a tree119879

119909isin 119879119860minus119903119886

into an integrated tree11987911990911991011199102

There are three cases for the root 119903 of 119879

11990911991011199102 as illustrated in

Figure 3

(1) 119903 contains only the root of 119879119909

(2) 119903 contains only the root of 1198791199101or 1198791199102

(3) 119903 contains the root of 119879119909and the root of 119879

1199101or 1198791199102

For case (1) the lowest common ancestor of the roots of 1198791199101

and 1198791199102in the integrated tree 119879

11990911991011199102will no longer contain

their lowest common ancestor in 119879119861 a contradiction to

6 BioMed Research International

push (0 0119873119880119871119871) into queue 119902while |119902| gt 0 do(119886 119887 119888) = 119901119900119901(119902)if 119886 = minus1 thenmake 119879

119861(119887) as a subtree of 119879

119860119861rooted at 119887 119879

119861(119887) is a sub tree of 119879

119861rooted at 119887

else if 119887 = minus1 thenmake 119879

119860(119886) as a subtree of 119879

119860119861rooted at 119886 119879

119860(119886) is a sub tree of 119879

119860rooted at 119886

elselet 119889 = (119886 119887) and make 119889 a child of 119888 on 119879

119860119861

119904119888119900119903119890 = 119872119860119861(119886 119887) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119861minus119887)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119887)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119886 119879119861minus119887)

if 119904119888119900119903119890 gt= 119904119888119900119903119890119886ampamp 119904119888119900119903119890 gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119861minus119887

with 119889 into 119902else if 119904119888119900119903119890

119886gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119887with 119889 into 119902

elsepush matching results of 119879

119886 119879119861minus119887

with 119889 into 119902end if

end ifend whilereturn 119879

119860119861

Algorithm 3 BuildNewOnto(cohesion matrix 119879119860 119879119861)

Criterion (1) For cases (2) and (3) the root of 1198791199102(or the root

of 1198791199101) will be the descendant of the root of 119879

1199101(or the

root of 1198791199102) in the integrated tree 119879

1199091199101 1199102 a contradiction to

Criterion (1) For integration involving more than two treesfrom 119879

119861minus119903119887 we can still follow the above procedure to reach

contradictions Thus the claim is provenWithout loss of generality we can see that for any

tree 119879119910isin 119879119861minus119903119887

it can be integrated with no more thanone tree in 119879

119860minus119903119886 Therefore the integration between 119879

119860minus119903119886

and 119879119861minus119903119887

corresponds to a matching in a weighted bipar-tite graph in which two sets of nodes represent treesfrom 119879

119860minus119903119886and 119879

119861minus119903119887 respectively and edges represent

corresponding cohesion scores According to Criterion (2)119892(119879119860minus119903119886 119879119861minus119903119887) = max(sum

(119879119909 119879119910)isin119877(119892(119879119909 119879119910))) and we con-

clude that 119892(119879119860minus119903119886 119879119861minus119903119887) corresponds to the weight of amax-

imum weighted matching in the above bipartite graph

Given Lemma 2 we can see that the following corollary iscorrect

Corollary 3 Define MaxMatch(119883 119884) = sum(119879119909 119879119910)isin119877

119892(119879119909 119879119910)

where 119877 is a maximum weighted matching of trees in forests119883 and 119884 given 119892(119879

119909 119879119910) for any tree pair (119879

119909 119879119910) isin 119883 times 119884

One concludes that for any two forests 119879119860and 119879

119861 119892(119879119860 119879119861) =

MaxMatch(119879119860 119879119861)

With Lemma 1 and Corollary 3 we are able to designan efficient dynamic programming algorithm achieving theglobal optimum for the ontology integration problem Thepseudocode for calculating the maximum cohesion score isdescribed in Algorithm 2 which visits ontology vertices inreverse topological order when filling up the cohesionmatrix

At the end of Algorithm 2 the cohesion matrix is filledup with optimal cohesion scores and the maximum cohesionscore is saved at entry (0 0) as described byTheorem 4

Theorem 4 It holds that cohesion matrix(119894 119895) = 119892(119879119860minus119894

119879119861minus119895) and cohesion matrix(0 0) = 119892(119879

119860 119879119861)

Proof We will prove this theorem by mathematical induc-tion

Let |119881(119879119860)| = 119899 and |119881(119879

119861)| = 119898 it is easy to see

that cohesion matrix(119899119898) is optimal because 119899 and 119898correspond to leaf nodes in the topological order Thus119879119860minus119899

and 119879119861minus119898

are empty sets and cohesion matrix(119899119898) =119872119860119861(119899119898) = 119892(119879

119899 119879119898)

When 119894 = 119899 and 119895 lt 119898 according to Lemma 1 to integrate119879119899with 119879

119895 either 119879

119899(the leaf vertex) is merged with 119879

119895or 119879119899

is integrated with 119879119861minus119895

The maximum score of integrating119879119899with 119879

119861minus119895is available at the time cohesion matrix(119899 119895) is

being calculated because of the reverse topological visitThusaccording to both Lemma 1 and Corollary 3 we concludethat 119892(119879

119899 119879119895) = max(119872

119860119861(119899 119895)MaxMatch(119879

119899 119879119861minus119895)) =

cohesion matrix(119899 119895) Similarly we can conclude that119892(119879119894 119879119898) = max(119872

119860119861(119894 119898)MaxMatch(119879

119860minus119894 119879119898)) =

cohesion matrix(119894 119898)When 119894 lt 119899 and 119895 lt 119898 again according to

Lemma 1 and Corollary 3 we have 119892(119879119894 119879119895) = max(score

score119886 score

119887) = cohesion matrix(119894 119895) Due to the reserve

topological visit MaxMatch(119879119860minus119894 119879119861minus119895) MaxMatch(119879

119860minus119894

119879119895) and MaxMatch(119879

119894 119879119861minus119895) are available at the time

cohesion matrix(119894 119895) is being calculated

Given the definition of 119892(119879119860 119879119861) Theorem 4 in fact

proves that the proposed approach achieves the global opti-mum However the global optimal solution is built upon

BioMed Research International 7

themaximumweightedmatching (recall Corollary 3) As dis-cussed in Section 213 the maximum weighted matching istime-consuming and therefore we propose an approximationsolution in that section

Although Algorithm 2 builds the cohesion matrix withoptimal cohesion scores it does not construct the integratedontological knowledge structure We may save the ontologyintegration details along with the cohesion scores Howeverthat will cost 119874(1198993) memory space (assuming each ontologyhas 119874(119899) vertices) and significantly reduce the capacity ofthe algorithm in handling large ontology integrations Quiteinterestingly we find that it is not necessary to save the inte-gration details in order to construct the integrated ontologyThe construction can be done by a process reverse to theconstruction of cohesionmatrix as described inAlgorithm 3

Algorithm 3 uses the cohesion matrix constructed byAlgorithm 2 and builds the integrated ontology tree still byfollowing Lemma 1 and Corollary 3 but in a reverse way ofAlgorithm 2 The construction is performed in a Breadth-First fashion which uses a queue 119902 to maintain triples Eachtriple (119886 119887 119888) is an association of three elements 119886 is thematched vertex from ontology 119879

119860 119887 is the matched vertex

from ontology 119879119861 and 119888 is their parent on the merged

ontology 119879119860119861 By following the basic idea of the proof of

Theorem 4 we can show that Algorithm 3 builds an optimalintegrated ontology with the cohesion matrix provided fromAlgorithm 2 We omit the proof for succinctness

213 Time Complexity Analysis and an Approximation Solu-tion Assume an ontology size is 119874(119899) The cohesion matrixhas 119874(1198992) entries to fill up The computation for each entryis a matching whose time complexity depends on the imple-mentation The maximum weighted matching takes 119874(1198993)using the famous Hungarian algorithm [21] and althoughit achieves optimum it is too costly for large ontologiesThe maximal weighted matching however takes 119874(1198992 log 119899)time and more importantly results in an overall 119874(1198992 log 119899)time complexity for Algorithm 2 The analysis is given inthe following For each matching the algorithm will accesspreviously filled entries and each entry will be accessed onlyonce and be involved only once in a sorting of 119874(log 119899)time This is because each entry corresponds to two verticeswhose cohesion score will be accessed when calculating thecohesion score of their parents Thus we conclude that thetotal time complexity of calculating the cohesionmatrix usingmaximal weightedmatching is119874(1198992+1198992 log 119899) = 119874(1198992 log 119899)Since building the new ontology has the same time com-plexity as building the cohesion matrix this is also the totaltime complexity for integrating two ontologies by maximalweightedmatchingThemaximal weightedmatching also hasa guaranteed lower bound on the results It achieves a (12)-approximation solution (ie the overall cohesion score willbe at least 12 of the optimal cohesion score) as pointed out in[22]

Since themaximalweightedmatching results in an overallgood performance on time complexity and approximationrate we used the maximal weighted matching in our empir-ical study for Algorithms 2 and 3 Readers may also choose

othermatching algorithms (such as the one described in [22])to achieve slightly better approximation rates However theweightedmatching is a replaceablemodule for our algorithmsand it is not the focus of this work to build a fast and close-to-optimal weighted matching algorithm

Compared to the time complexity of integrating ontolo-gies by the dynamic programming approach as describedin Algorithms 2 and 3 the heuristic approach described atthe beginning of this section also has 119874(1198992) in the worstcase However we conjecture that the heuristic approach hasa much smaller average time complexity because in eachstep the heuristic approach may exclude a large number ofmatching opportunities

22 Integrating Multiple Ontologies In the previous sectionwe proposedmethods for integrating two ontologies In somebiomedical applications [23 24] we are interested in theassociations involving more than two objects Integrationof multiple ontologies of these objects will generate aninnovative view on these complex relationships Similar tothe basic problem formulation we can formulate themultipleontology integration as follows

Given 119896 ontology trees 1198791 1198792 119879

119896and a closeness

matrix119872119894119895for any two trees 119879

119894and 119879

119895 how can we efficiently

generate an integrated ontology tree 11987912119896

meeting thefollowing criteria

(1) For any two vertices 119909 and 119910 in a tree 119879119894 their

lowest common ancestor LCA119879119894(119909 119910) is contained by

LCA11987912119896(119909 119910)

(2) It holds that argmax11987912119896119891(11987912119896

) =

sum119883isin119881(119879119860119861)

sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Here

119872120590(119906)120590(V)(119906 V) is the entry value in the closeness

matrix for two vertices 119906 and V (one from tree 119879120590(119906)

and the other from tree 119879120590(V)) contained in the node

119883 For a vertex V from an original ontology 120590(V) isdefined as its original ontology ID

Again we name the function 11989111987912119896

the cohesion of theintegrated ontology 119879

12119896 For each node 119883 in the inte-

grated ontology we define its weight as 119908119890119894119892ℎ119905(119883) =sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Correspondingly we define

function119892(1198791 1198792 119879

119896) = max

11987912119896(sum119883isin119881(11987912119896)

119908119890119894119892ℎ119905(119883))

as the maximum cohesion function for integrating theontologies 119879

1 1198792 119879

119896 As we can see in the above formula-

tion the overall cohesion score of integration is the summedweight of each node which is the sum of pairwise closenessscores

The formulation of multiple ontology integration is sim-ilar to the basic version and it is not difficult to showthat optimal structures described in Lemmas 1 and 2 canbe extended to a high dimension However the extensionof algorithms described in Section 212 for integrating twoontologies is not feasible for solving the multiple ontologyintegrationThis is because if we need to extend Algorithms 2and 3 to this problem we need to build a cohesionmatrix of 119896dimensions It implies that we need at least 119874(119899119896) operationsto fill up the score matrix assuming the size of an ontology

8 BioMed Research International

A B C D

A 10 6 8

B 10 7 5

C 6 7 9

D 8 5 9

AB C D

AB

C 9

D 9

lowastlowast

lowast

lowast mdash

mdash

mdash

mdash

mdash

mdash

mdash

(lowast) Entries to update

Figure 4 An example of the InterOntology matrixrsquos change at the first iteration in Algorithm 4 for integrating four ontologies

build the 119896 times 119896 InterOntology matrixfor 119894 = 1 to 119896 minus 1 doidentify the active tree pair ⟨119879

119883 119879119884⟩ that corresponds

to the highest score in the InterOntology matrixintegrate 119879

119883and 119879

119884into 119879

119883119884

mark 119879119883and 119879

119884as inactive

update relationship matricesupdate InterOntology matrix

end forreturn 119879

12119899

Algorithm 4 GreedyMultiInt(T = 1198791 1198792 119879

119896)

is 119874(119899) This is clearly not acceptable for high dimensionalontology integration

221 Greedy Approach From the above discussion we cansee that direct extension of Algorithms 2 and 3 for integratingtwo ontologies is practically not feasible for integrating alarge number of ontologies However we can still use thesealgorithms for integrating multiple ontologies by iterativelyintegrating two ontologies and generating a new closenessmatrix Given the ontologies 119879

1 1198792 119879

119896 we can first

integrate 1198791and 119879

2into 119879

12and then build the closeness

matrix between 11987912

and 1198793using the relationship matrices

between 1198791and 119879

2and between 119879

1and 119879

3 Specifically

assume 119883 is a node on the integrated ontology 11987912 and 119883

contains a vertex 119886 from 1198791and a vertex 119887 from 119879

2 Then

the entry (119883 119888) of the closeness matrix between 11987912

and 1198793

is 11987211987912 1198793(119883 119888) = 119872

1198791 1198793(119886 119888) + 119872

1198792 1198793(119887 119888) After the new

closeness matrix is generated we can continue integrating11987912

and 1198793into 119879

123and generating another new closeness

matrix We will eventually get the integrated ontology 11987912119896

by repeating the above process To facilitate the followingdiscussions we name the above approach the basic multipleintegration approach

Although the basic multiple integration approach canfinish integrating multiple ontologies it blindly integratesontologies without using any cohesion information betweenontologies that may lead to a better integration resultTo improve the basic multiple integration we propose agreedy approach that uses the cohesion information betweenontologies to guide the integration The basic steps of thegreedy approach are outlined in Algorithm 4 To facilitate

the understanding of Algorithm 4 we use Figure 4 toillustrate an example of the InterOntology matrixrsquos change atthe first iteration of integrating four ontologies A B C andD

The key idea in Algorithm 4 is to maintain an InterOn-tology matrix which guides the integration Initially thismatrix is filled with the overall cohesion score of everypair of ontologies In each step this matrix is updated withoverall cohesion scores between newly integrated ontologyand existing active ontologies The integration will take placebetween two active ontologies which have the highest scorein the InterOntology matrix

When we used the overall cohesion score between twoontologies to update the InterOntology matrix we observedan interesting phenomenon that the integration inmost casesis a process continuously expanding an integrated ontologyConsequently the greedy approach is likely to yield a resultsimilar to the basic approach

This phenomenon can be explained by the definition ofmaximum cohesion function which takes into account allpairwise closeness between merged terms Thus the moreontologies contained in an integrated ontology are the morelikely it will have high overall cohesion scores with otherontologies As a result it creates unfairness for the integrationselection To fix this issue we use the adjusted overall cohe-sion scores in updating the InterOntology matrix as follows

Given an ontology 119879119883and an ontology 119879

119884where 119883 and

119884 are nonempty sets of ontology IDs we define the adjustedcohesion score between 119879

119883and 119879

119884as 119860119889119895119862119900ℎ(119879

119883 119879119884) =

sum119885isin119881(119879119883119884)

sum119909isin119885119910isin119885120590(119909)isin119883120590(119910)isin119884

119872120590(119909)120590(119910)

(119909 119910)|119883||119884| where119879119883119884

is the integrated ontology built by Algorithms 2 and 3The adjusted cohesion score is in fact the weight increase byintegrating 119879

119883and 119879

119884 divided by the size of 119883 times the

size of 119884 that is119860119889119895119862119900ℎ(119879119883 119879119884) = sum

119885isin119881(119879119883119884)(119908119890119894119892ℎ119905(119885))minus

sum119883isin119881(119879119883)

(119908119890119894119892ℎ119905(119883)) minus sum119884isin119881(119879119884)

119908119890119894119892ℎ119905(119884)|119883||119884| For eachnode merging closeness scores will be added to the totalweight when the merging takes place between vertices fromontology set 119883 and vertices from ontology set 119884 Thus theweight increase by integrating 119879

119883and 119879

119884is proportional to

the number of ontologies in119883 times the number of ontologiesin 119884 and consequently we averaged the weight increase by|119883||119884|

222 Fast Approximation Algorithm Although the basicmultiple integration and the greedy multiple integrationapproaches discussed above are able to integrate multipleontologies none of them provide any guarantee on the results

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 3: Research Article Building Integrated Ontological Knowledge

BioMed Research International 3

Disease

Infectious disease

Cancer

Skin cancer

Flu

Hepatitis B

Medicine

Radiation Vaccination

Brachy-therapy

Radio-isotape

Influenza vaccine

HBV vaccine

Thyroid cancer middot middot middot

Figure 2 Another example of integrating the two ontologies in Figure 1 This integration violates Criterion (1)

optimal as well as efficient approximation solutionsfor this problem

(ii) We extended the basic problem to handle the integra-tion of large number of ontologies and we developedboth greedy and fast approximation algorithms forthe extended problem

(iii) We studied the proposed algorithms on both real andsynthetic datasets and confirmed their effectiveness inintegrating large volume of ontology datasets

13 Related Work Automatic ontology generation and inte-gration are desirable in many applications and have beenstudied in the past decade Although available methods forautomatic ontology generation produce ontologies from agiven type of data such as gene networks [8] textual data [9]dictionary [10] and schemata [11 12] they do not contributeto the integration of different types of ontologies which willbring innovative results on annotationknowledge reuse andassociation studies To address this issue a number of studieshave been focused on ontology integration [13 14] and itsmedical domain applications [15] The ontology integrationmethods available and used in these works can be generallyclassified into three categories

Manually or Semiautomatic Setups In [6] the authors pre-sented a methodology for ontology integration by custom-tailored integration operations which include algebraic-specified 39 basic operations and 12 nonbasic operationsderived from them The authors identified a set of criteriasuch as modularize specialize and diversify each hierarchyfor guiding the knowledge integration In [16] the authorsdesigned a semiautomatic approach for ontology mergingThe ontology developers will be assisted by the system andguided to tasks needing their interventions

Using Machine Learning Methods Reference [17] describesan ontology mapping system GLUE that uses machinelearning techniques for building ontology mappings Specif-ically GLUE uses multiple learning strategies Each type ofinformation from the ontologies is handled by a different

learning strategy The authors demonstrate that GLUE workseffectively on taxonomy ontologies Similarly [18] also usedmultistrategy learning in matching pair identification How-ever the ontologies used in the experiments of [18] containless than 10 nodes Although [17] studied the integrationof larger ontologies in the experiments those ontologiescontain only around 34 to 176 nodesmuch smaller thanmanyontologies used in the biomedical field

Using Heuristic Approaches Many automatic ontology inte-gration methods [19 20] fall into this categoryThey performontology integrations by using heuristic approaches from dif-ferent perspectives For example [20] uses heuristic policiesfor selecting axioms and candidate merging pairs From aquite different angle [19] uses view-based query for guidingthe ontology integrations

Thesemethods have a fewmajorweaknesses including (1)lack of a systematicmeasurement to quantify the goodness forthe ontology integration (2) being generally heuristic withno theoretical results to show that the proposed integrationapproach is globally optimal or close to optimal (3) being notfor integrating large volume of ontologies These weaknessesmotivated us to develop efficient and near optimal solutionsfor integrating large ontology datasets

2 Methods

21 Integrating a Pair of Ontologies In this section wefocus on the basic problem of integrating two ontologies asformulated in Section 11 We will prove optimal structuresin the problem and propose an optimal and an efficientapproximation solution for this problem These solutions arealso the basis for solving the problem of integrating a largenumber of ontologies as described in Section 22

211 Brutal-Force and Heuristic Solutions Given Criteria(1) and (2) a brutal-force approach will pick up a best solu-tion fromall the solutions that startwith integration involvingat least one of the roots of the two ontology trees and iter-atively integrate their descendants Considering an extremecase where each ontology tree is a path of 119899 vertices we

4 BioMed Research International

push (0 0) into queue 119902 Integrating virtual roots of the two ontologieswhile |119902| gt 0 do(119886 119887) = 119901119900119901(119902)for 119894 isin children of 119886 on 119879

119860do

119898119886119909 119904119888119900119903119890 = minus1119905119900 119898119890119903119892119890 = 119873119880119871119871119905119900 119898119890119903119892119890 119903119900119900119905 = 119873119880119871119871for 119895 isin children of 119887 on 119879

119861do

if 119895 is chosen thencontinue The subtree rooted at 119895 can only be chosen once for integration as illustrated in Figure 3

elseidentify vertex 119896 in the subtree rooted at 119895 such that argmax

119896(119872119860119861(119894 119896)120573

distance(119896119895))

if 119872119860119861(119894 119896) gt 119898119886119909 119904119888119900119903119890 then

119898119886119909 119904119888119900119903119890 = 119872119860119861(119894 119896)

119905119900 119898119890119903119892119890 = 119896119905119900 119898119890119903119892119890 119903119900119900119905 = 119895

end ifend if

end forif 119905119900 119898119890119903119892119890 = 119873119880119871119871 thenbreak

elsesave merge pair (119894 119905119900 119898119890119903119892119890) in 119879

119860119861

mark 119905119900 119898119890119903119892119890 119903119900119900119905 as chosenpush (119894 119905119900 119898119890119903119892119890) into queue 119902

end ifend for

end whilereturn 119879

119860119861

Algorithm 1 HeuristicMerge(119879119860 119879119861119872119860119861)

conclude that such a brutal-force approach needs to pick upa best solution from an exponential number of solutionsThebrutal-force approach is clearly not acceptable for integratinglarge ontologies and may not even work for ontologies withonly a few dozens of vertices

A heuristic solution can be developed by following anidea similar to the above brutal-force approach Howeverinstead of trying all possibilities the heuristic solution willgreedilymerge vertices following the topological orderWhenselecting a matching vertex for vertex 119886 from ontology119879119860 the heuristic approach will greedily select a vertex 119887

from allowable candidates in 119879119861and iteratively apply such

selections to descendants of 119886 According to Criterion (1) if 119886is associated with 119887 then none of 119886rsquos descendants are allowedto be associated with vertices other than 119887rsquos descendants Inaddition if 119886rsquos child 1198861015840 is associated with 119887rsquos child 1198871015840 or itsdescendants then none of 119886rsquos other children are allowed tobe associated with 1198871015840 or its descendants any more Given thisa greedy choice may very easily end up in a local optimum bychoosing a best matching vertex at one step while denyingintegrating opportunities of other vertices that may lead to abetter final solution

It is easy to see that the deeper a vertex being chosen forintegration is the more integration opportunities are lost Toalleviate such a situation we propose a greedy approach byconsidering the relative depth (rdepth) of a chosen vertexwith regard to an allowable vertex closest to the root That is

given a vertex 119886 from 119879119860 a vertex 119887 from 119879

119861is chosen when

119872119860119861(119886 119887)120573

rdepth(119887) is maximized When 120573 = 1 the depthinformation does not take effective and when 120573 = infin eachvertex will only be associated with an allowable vertex closestto the root

Algorithm 1 describes the pseudocode of the heuristicintegration It starts by integrating virtual roots of the twoontologies After that the integration will be carried outiteratively from top to bottom by following Criterion (1) andthe heuristic strategy described above In the empirical studywe will see that the heuristic algorithm works better whenthe depth information is considered However in terms ofthe overall cohesion score it is no match for our optimal andapproximation solutions as described below

212 Optimal and Approximation Solutions By dividing theintegration of two trees into node merging and subtreeintegrations we have identified optimal structures in thebasic problem as stated by Lemmas 1 and 2 These optimalstructures make it possible for us to develop efficient algo-rithms (Algorithms 2 and 3) that achieve optimal or approx-imation solutions In the following we first describe thetwo important lemmas suggesting the optimal structures andtheir proofs before describing our proposed algorithms

Lemma 1 Let 119903119886be the root of tree119879

119860and 119903119887the root of tree119879

119861

Let119879119860minus119903119886

and119879119861minus119903119887

represent two sets of sub trees rooted at 119903119886of

BioMed Research International 5

Case (2a) Case (2b) Case (3a) Case (3b)

Case (1) Case (2) Case (3)

y2

y2

y2 y2 y2

y2

y1

y1

y1 y1y1

y1

Ty1

Ty2

x

xx

xx

x

ra rb

Txy1y2

tx

Figure 3 An illustration of three cases in the proof of Lemma 2

Sort vertices in 119879119860and 119879

119861in the topological order

for 119894 = |119881(119879119860)| to 0 do

for 119895 = |119881(119879119861)| to 0 do

119904119888119900119903119890 = 119872119860119861(119894 119895) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119861minus119895)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119895)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119894 119879119861minus119895)

119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909(119894 119895) = max(119904119888119900119903119890 119904119888119900119903119890119886 119904119888119900119903119890

119887)

end forend forreturn 119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909

Algorithm 2 BuildCohesionMatrix(119879119860 119879119861)

119879119860and 119903119887of119879119861 respectively119879

119860minus119903119886does not include 119903

119886and119879119861minus119903119887

does not include 119903119887 One has 119892(119879

119860 119879119861) = max(119872

119860119861(119903119886 119903119887) +

119892(119879119860minus119903119886 119879119861minus119903119887) 119892(119879119860 119879119861minus119903119887) 119892(119879119860minus119903119886 119879119861))

Proof We can divide the integration of tree 119879119860with tree 119879

119861

into two cases according to the merging of their roots

(1) The roots of 119879119860and 119879

119861are merged together

(2) The roots of 119879119860and 119879

119861are not merged together

For case (1) it is clear that the cohesion score is119872119860119861(119903119886 119903119887)+

119892(119879119860minus119903119886 119879119861minus119903119887)

For case (2) we conclude that either 119879119860is integrated with

119879119861minus119903119887

(119903119887is out of integration) or 119879

119861is integrated with 119879

119860minus119903119886

(119903119886is out of integration) Otherwise we will have a merged

tree 119879119860119861

with two roots 119903119886and 119903

119887 a contradiction to the

fact that 119879119860119861

is a tree Therefore the cohesion score is either119892(119879119860 119879119861minus119903119887) or 119892(119879

119860minus119903119886 119879119861)

Combining cases (1) and (2) and according to Criterion(2) we have

119892 (119879119860 119879119861) = max (119872

119860119861(119903119886 119903119887) + 119892 (119879

119860minus119903119886 119879119861minus119903119887)

119892 (119879119860 119879119861minus119903119887) 119892 (119879

119860minus119903119886 119879119861))

(1)

Lemma 2 Let 119879119860minus119903119886

and 119879119861minus119903119887

represent two sets of treesobtained by removing the root vertices 119903

119886from 119879

119860minus119903119886and 119903119887

from 119879119861minus119903119887

One has

119892 (119879119860minus119903119886 119879119861minus119903119887) = max( sum

(119879119909 119879119910)isin119877

(119892 (119879119909 119879119910))) (2)

Here 119879119909isin 119879119860minus119903119886

119879119910isin 119879119861minus119903119887

and 119877 is a matching of trees in119879119860minus119903119886

with trees in 119879119861minus119903119887

Proof To prove this lemma we first prove that for any tree119879119909isin 119879119860minus119903119886

it can be integrated with no more than one treein 119879119861minus119903119887

We will prove this claim by contradiction Assumethere are two trees 119879

1199101isin 119879119861minus119903119887

and 1198791199102isin 119879119861minus119903119887

and theyintegrate with a tree119879

119909isin 119879119860minus119903119886

into an integrated tree11987911990911991011199102

There are three cases for the root 119903 of 119879

11990911991011199102 as illustrated in

Figure 3

(1) 119903 contains only the root of 119879119909

(2) 119903 contains only the root of 1198791199101or 1198791199102

(3) 119903 contains the root of 119879119909and the root of 119879

1199101or 1198791199102

For case (1) the lowest common ancestor of the roots of 1198791199101

and 1198791199102in the integrated tree 119879

11990911991011199102will no longer contain

their lowest common ancestor in 119879119861 a contradiction to

6 BioMed Research International

push (0 0119873119880119871119871) into queue 119902while |119902| gt 0 do(119886 119887 119888) = 119901119900119901(119902)if 119886 = minus1 thenmake 119879

119861(119887) as a subtree of 119879

119860119861rooted at 119887 119879

119861(119887) is a sub tree of 119879

119861rooted at 119887

else if 119887 = minus1 thenmake 119879

119860(119886) as a subtree of 119879

119860119861rooted at 119886 119879

119860(119886) is a sub tree of 119879

119860rooted at 119886

elselet 119889 = (119886 119887) and make 119889 a child of 119888 on 119879

119860119861

119904119888119900119903119890 = 119872119860119861(119886 119887) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119861minus119887)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119887)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119886 119879119861minus119887)

if 119904119888119900119903119890 gt= 119904119888119900119903119890119886ampamp 119904119888119900119903119890 gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119861minus119887

with 119889 into 119902else if 119904119888119900119903119890

119886gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119887with 119889 into 119902

elsepush matching results of 119879

119886 119879119861minus119887

with 119889 into 119902end if

end ifend whilereturn 119879

119860119861

Algorithm 3 BuildNewOnto(cohesion matrix 119879119860 119879119861)

Criterion (1) For cases (2) and (3) the root of 1198791199102(or the root

of 1198791199101) will be the descendant of the root of 119879

1199101(or the

root of 1198791199102) in the integrated tree 119879

1199091199101 1199102 a contradiction to

Criterion (1) For integration involving more than two treesfrom 119879

119861minus119903119887 we can still follow the above procedure to reach

contradictions Thus the claim is provenWithout loss of generality we can see that for any

tree 119879119910isin 119879119861minus119903119887

it can be integrated with no more thanone tree in 119879

119860minus119903119886 Therefore the integration between 119879

119860minus119903119886

and 119879119861minus119903119887

corresponds to a matching in a weighted bipar-tite graph in which two sets of nodes represent treesfrom 119879

119860minus119903119886and 119879

119861minus119903119887 respectively and edges represent

corresponding cohesion scores According to Criterion (2)119892(119879119860minus119903119886 119879119861minus119903119887) = max(sum

(119879119909 119879119910)isin119877(119892(119879119909 119879119910))) and we con-

clude that 119892(119879119860minus119903119886 119879119861minus119903119887) corresponds to the weight of amax-

imum weighted matching in the above bipartite graph

Given Lemma 2 we can see that the following corollary iscorrect

Corollary 3 Define MaxMatch(119883 119884) = sum(119879119909 119879119910)isin119877

119892(119879119909 119879119910)

where 119877 is a maximum weighted matching of trees in forests119883 and 119884 given 119892(119879

119909 119879119910) for any tree pair (119879

119909 119879119910) isin 119883 times 119884

One concludes that for any two forests 119879119860and 119879

119861 119892(119879119860 119879119861) =

MaxMatch(119879119860 119879119861)

With Lemma 1 and Corollary 3 we are able to designan efficient dynamic programming algorithm achieving theglobal optimum for the ontology integration problem Thepseudocode for calculating the maximum cohesion score isdescribed in Algorithm 2 which visits ontology vertices inreverse topological order when filling up the cohesionmatrix

At the end of Algorithm 2 the cohesion matrix is filledup with optimal cohesion scores and the maximum cohesionscore is saved at entry (0 0) as described byTheorem 4

Theorem 4 It holds that cohesion matrix(119894 119895) = 119892(119879119860minus119894

119879119861minus119895) and cohesion matrix(0 0) = 119892(119879

119860 119879119861)

Proof We will prove this theorem by mathematical induc-tion

Let |119881(119879119860)| = 119899 and |119881(119879

119861)| = 119898 it is easy to see

that cohesion matrix(119899119898) is optimal because 119899 and 119898correspond to leaf nodes in the topological order Thus119879119860minus119899

and 119879119861minus119898

are empty sets and cohesion matrix(119899119898) =119872119860119861(119899119898) = 119892(119879

119899 119879119898)

When 119894 = 119899 and 119895 lt 119898 according to Lemma 1 to integrate119879119899with 119879

119895 either 119879

119899(the leaf vertex) is merged with 119879

119895or 119879119899

is integrated with 119879119861minus119895

The maximum score of integrating119879119899with 119879

119861minus119895is available at the time cohesion matrix(119899 119895) is

being calculated because of the reverse topological visitThusaccording to both Lemma 1 and Corollary 3 we concludethat 119892(119879

119899 119879119895) = max(119872

119860119861(119899 119895)MaxMatch(119879

119899 119879119861minus119895)) =

cohesion matrix(119899 119895) Similarly we can conclude that119892(119879119894 119879119898) = max(119872

119860119861(119894 119898)MaxMatch(119879

119860minus119894 119879119898)) =

cohesion matrix(119894 119898)When 119894 lt 119899 and 119895 lt 119898 again according to

Lemma 1 and Corollary 3 we have 119892(119879119894 119879119895) = max(score

score119886 score

119887) = cohesion matrix(119894 119895) Due to the reserve

topological visit MaxMatch(119879119860minus119894 119879119861minus119895) MaxMatch(119879

119860minus119894

119879119895) and MaxMatch(119879

119894 119879119861minus119895) are available at the time

cohesion matrix(119894 119895) is being calculated

Given the definition of 119892(119879119860 119879119861) Theorem 4 in fact

proves that the proposed approach achieves the global opti-mum However the global optimal solution is built upon

BioMed Research International 7

themaximumweightedmatching (recall Corollary 3) As dis-cussed in Section 213 the maximum weighted matching istime-consuming and therefore we propose an approximationsolution in that section

Although Algorithm 2 builds the cohesion matrix withoptimal cohesion scores it does not construct the integratedontological knowledge structure We may save the ontologyintegration details along with the cohesion scores Howeverthat will cost 119874(1198993) memory space (assuming each ontologyhas 119874(119899) vertices) and significantly reduce the capacity ofthe algorithm in handling large ontology integrations Quiteinterestingly we find that it is not necessary to save the inte-gration details in order to construct the integrated ontologyThe construction can be done by a process reverse to theconstruction of cohesionmatrix as described inAlgorithm 3

Algorithm 3 uses the cohesion matrix constructed byAlgorithm 2 and builds the integrated ontology tree still byfollowing Lemma 1 and Corollary 3 but in a reverse way ofAlgorithm 2 The construction is performed in a Breadth-First fashion which uses a queue 119902 to maintain triples Eachtriple (119886 119887 119888) is an association of three elements 119886 is thematched vertex from ontology 119879

119860 119887 is the matched vertex

from ontology 119879119861 and 119888 is their parent on the merged

ontology 119879119860119861 By following the basic idea of the proof of

Theorem 4 we can show that Algorithm 3 builds an optimalintegrated ontology with the cohesion matrix provided fromAlgorithm 2 We omit the proof for succinctness

213 Time Complexity Analysis and an Approximation Solu-tion Assume an ontology size is 119874(119899) The cohesion matrixhas 119874(1198992) entries to fill up The computation for each entryis a matching whose time complexity depends on the imple-mentation The maximum weighted matching takes 119874(1198993)using the famous Hungarian algorithm [21] and althoughit achieves optimum it is too costly for large ontologiesThe maximal weighted matching however takes 119874(1198992 log 119899)time and more importantly results in an overall 119874(1198992 log 119899)time complexity for Algorithm 2 The analysis is given inthe following For each matching the algorithm will accesspreviously filled entries and each entry will be accessed onlyonce and be involved only once in a sorting of 119874(log 119899)time This is because each entry corresponds to two verticeswhose cohesion score will be accessed when calculating thecohesion score of their parents Thus we conclude that thetotal time complexity of calculating the cohesionmatrix usingmaximal weightedmatching is119874(1198992+1198992 log 119899) = 119874(1198992 log 119899)Since building the new ontology has the same time com-plexity as building the cohesion matrix this is also the totaltime complexity for integrating two ontologies by maximalweightedmatchingThemaximal weightedmatching also hasa guaranteed lower bound on the results It achieves a (12)-approximation solution (ie the overall cohesion score willbe at least 12 of the optimal cohesion score) as pointed out in[22]

Since themaximalweightedmatching results in an overallgood performance on time complexity and approximationrate we used the maximal weighted matching in our empir-ical study for Algorithms 2 and 3 Readers may also choose

othermatching algorithms (such as the one described in [22])to achieve slightly better approximation rates However theweightedmatching is a replaceablemodule for our algorithmsand it is not the focus of this work to build a fast and close-to-optimal weighted matching algorithm

Compared to the time complexity of integrating ontolo-gies by the dynamic programming approach as describedin Algorithms 2 and 3 the heuristic approach described atthe beginning of this section also has 119874(1198992) in the worstcase However we conjecture that the heuristic approach hasa much smaller average time complexity because in eachstep the heuristic approach may exclude a large number ofmatching opportunities

22 Integrating Multiple Ontologies In the previous sectionwe proposedmethods for integrating two ontologies In somebiomedical applications [23 24] we are interested in theassociations involving more than two objects Integrationof multiple ontologies of these objects will generate aninnovative view on these complex relationships Similar tothe basic problem formulation we can formulate themultipleontology integration as follows

Given 119896 ontology trees 1198791 1198792 119879

119896and a closeness

matrix119872119894119895for any two trees 119879

119894and 119879

119895 how can we efficiently

generate an integrated ontology tree 11987912119896

meeting thefollowing criteria

(1) For any two vertices 119909 and 119910 in a tree 119879119894 their

lowest common ancestor LCA119879119894(119909 119910) is contained by

LCA11987912119896(119909 119910)

(2) It holds that argmax11987912119896119891(11987912119896

) =

sum119883isin119881(119879119860119861)

sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Here

119872120590(119906)120590(V)(119906 V) is the entry value in the closeness

matrix for two vertices 119906 and V (one from tree 119879120590(119906)

and the other from tree 119879120590(V)) contained in the node

119883 For a vertex V from an original ontology 120590(V) isdefined as its original ontology ID

Again we name the function 11989111987912119896

the cohesion of theintegrated ontology 119879

12119896 For each node 119883 in the inte-

grated ontology we define its weight as 119908119890119894119892ℎ119905(119883) =sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Correspondingly we define

function119892(1198791 1198792 119879

119896) = max

11987912119896(sum119883isin119881(11987912119896)

119908119890119894119892ℎ119905(119883))

as the maximum cohesion function for integrating theontologies 119879

1 1198792 119879

119896 As we can see in the above formula-

tion the overall cohesion score of integration is the summedweight of each node which is the sum of pairwise closenessscores

The formulation of multiple ontology integration is sim-ilar to the basic version and it is not difficult to showthat optimal structures described in Lemmas 1 and 2 canbe extended to a high dimension However the extensionof algorithms described in Section 212 for integrating twoontologies is not feasible for solving the multiple ontologyintegrationThis is because if we need to extend Algorithms 2and 3 to this problem we need to build a cohesionmatrix of 119896dimensions It implies that we need at least 119874(119899119896) operationsto fill up the score matrix assuming the size of an ontology

8 BioMed Research International

A B C D

A 10 6 8

B 10 7 5

C 6 7 9

D 8 5 9

AB C D

AB

C 9

D 9

lowastlowast

lowast

lowast mdash

mdash

mdash

mdash

mdash

mdash

mdash

(lowast) Entries to update

Figure 4 An example of the InterOntology matrixrsquos change at the first iteration in Algorithm 4 for integrating four ontologies

build the 119896 times 119896 InterOntology matrixfor 119894 = 1 to 119896 minus 1 doidentify the active tree pair ⟨119879

119883 119879119884⟩ that corresponds

to the highest score in the InterOntology matrixintegrate 119879

119883and 119879

119884into 119879

119883119884

mark 119879119883and 119879

119884as inactive

update relationship matricesupdate InterOntology matrix

end forreturn 119879

12119899

Algorithm 4 GreedyMultiInt(T = 1198791 1198792 119879

119896)

is 119874(119899) This is clearly not acceptable for high dimensionalontology integration

221 Greedy Approach From the above discussion we cansee that direct extension of Algorithms 2 and 3 for integratingtwo ontologies is practically not feasible for integrating alarge number of ontologies However we can still use thesealgorithms for integrating multiple ontologies by iterativelyintegrating two ontologies and generating a new closenessmatrix Given the ontologies 119879

1 1198792 119879

119896 we can first

integrate 1198791and 119879

2into 119879

12and then build the closeness

matrix between 11987912

and 1198793using the relationship matrices

between 1198791and 119879

2and between 119879

1and 119879

3 Specifically

assume 119883 is a node on the integrated ontology 11987912 and 119883

contains a vertex 119886 from 1198791and a vertex 119887 from 119879

2 Then

the entry (119883 119888) of the closeness matrix between 11987912

and 1198793

is 11987211987912 1198793(119883 119888) = 119872

1198791 1198793(119886 119888) + 119872

1198792 1198793(119887 119888) After the new

closeness matrix is generated we can continue integrating11987912

and 1198793into 119879

123and generating another new closeness

matrix We will eventually get the integrated ontology 11987912119896

by repeating the above process To facilitate the followingdiscussions we name the above approach the basic multipleintegration approach

Although the basic multiple integration approach canfinish integrating multiple ontologies it blindly integratesontologies without using any cohesion information betweenontologies that may lead to a better integration resultTo improve the basic multiple integration we propose agreedy approach that uses the cohesion information betweenontologies to guide the integration The basic steps of thegreedy approach are outlined in Algorithm 4 To facilitate

the understanding of Algorithm 4 we use Figure 4 toillustrate an example of the InterOntology matrixrsquos change atthe first iteration of integrating four ontologies A B C andD

The key idea in Algorithm 4 is to maintain an InterOn-tology matrix which guides the integration Initially thismatrix is filled with the overall cohesion score of everypair of ontologies In each step this matrix is updated withoverall cohesion scores between newly integrated ontologyand existing active ontologies The integration will take placebetween two active ontologies which have the highest scorein the InterOntology matrix

When we used the overall cohesion score between twoontologies to update the InterOntology matrix we observedan interesting phenomenon that the integration inmost casesis a process continuously expanding an integrated ontologyConsequently the greedy approach is likely to yield a resultsimilar to the basic approach

This phenomenon can be explained by the definition ofmaximum cohesion function which takes into account allpairwise closeness between merged terms Thus the moreontologies contained in an integrated ontology are the morelikely it will have high overall cohesion scores with otherontologies As a result it creates unfairness for the integrationselection To fix this issue we use the adjusted overall cohe-sion scores in updating the InterOntology matrix as follows

Given an ontology 119879119883and an ontology 119879

119884where 119883 and

119884 are nonempty sets of ontology IDs we define the adjustedcohesion score between 119879

119883and 119879

119884as 119860119889119895119862119900ℎ(119879

119883 119879119884) =

sum119885isin119881(119879119883119884)

sum119909isin119885119910isin119885120590(119909)isin119883120590(119910)isin119884

119872120590(119909)120590(119910)

(119909 119910)|119883||119884| where119879119883119884

is the integrated ontology built by Algorithms 2 and 3The adjusted cohesion score is in fact the weight increase byintegrating 119879

119883and 119879

119884 divided by the size of 119883 times the

size of 119884 that is119860119889119895119862119900ℎ(119879119883 119879119884) = sum

119885isin119881(119879119883119884)(119908119890119894119892ℎ119905(119885))minus

sum119883isin119881(119879119883)

(119908119890119894119892ℎ119905(119883)) minus sum119884isin119881(119879119884)

119908119890119894119892ℎ119905(119884)|119883||119884| For eachnode merging closeness scores will be added to the totalweight when the merging takes place between vertices fromontology set 119883 and vertices from ontology set 119884 Thus theweight increase by integrating 119879

119883and 119879

119884is proportional to

the number of ontologies in119883 times the number of ontologiesin 119884 and consequently we averaged the weight increase by|119883||119884|

222 Fast Approximation Algorithm Although the basicmultiple integration and the greedy multiple integrationapproaches discussed above are able to integrate multipleontologies none of them provide any guarantee on the results

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 4: Research Article Building Integrated Ontological Knowledge

4 BioMed Research International

push (0 0) into queue 119902 Integrating virtual roots of the two ontologieswhile |119902| gt 0 do(119886 119887) = 119901119900119901(119902)for 119894 isin children of 119886 on 119879

119860do

119898119886119909 119904119888119900119903119890 = minus1119905119900 119898119890119903119892119890 = 119873119880119871119871119905119900 119898119890119903119892119890 119903119900119900119905 = 119873119880119871119871for 119895 isin children of 119887 on 119879

119861do

if 119895 is chosen thencontinue The subtree rooted at 119895 can only be chosen once for integration as illustrated in Figure 3

elseidentify vertex 119896 in the subtree rooted at 119895 such that argmax

119896(119872119860119861(119894 119896)120573

distance(119896119895))

if 119872119860119861(119894 119896) gt 119898119886119909 119904119888119900119903119890 then

119898119886119909 119904119888119900119903119890 = 119872119860119861(119894 119896)

119905119900 119898119890119903119892119890 = 119896119905119900 119898119890119903119892119890 119903119900119900119905 = 119895

end ifend if

end forif 119905119900 119898119890119903119892119890 = 119873119880119871119871 thenbreak

elsesave merge pair (119894 119905119900 119898119890119903119892119890) in 119879

119860119861

mark 119905119900 119898119890119903119892119890 119903119900119900119905 as chosenpush (119894 119905119900 119898119890119903119892119890) into queue 119902

end ifend for

end whilereturn 119879

119860119861

Algorithm 1 HeuristicMerge(119879119860 119879119861119872119860119861)

conclude that such a brutal-force approach needs to pick upa best solution from an exponential number of solutionsThebrutal-force approach is clearly not acceptable for integratinglarge ontologies and may not even work for ontologies withonly a few dozens of vertices

A heuristic solution can be developed by following anidea similar to the above brutal-force approach Howeverinstead of trying all possibilities the heuristic solution willgreedilymerge vertices following the topological orderWhenselecting a matching vertex for vertex 119886 from ontology119879119860 the heuristic approach will greedily select a vertex 119887

from allowable candidates in 119879119861and iteratively apply such

selections to descendants of 119886 According to Criterion (1) if 119886is associated with 119887 then none of 119886rsquos descendants are allowedto be associated with vertices other than 119887rsquos descendants Inaddition if 119886rsquos child 1198861015840 is associated with 119887rsquos child 1198871015840 or itsdescendants then none of 119886rsquos other children are allowed tobe associated with 1198871015840 or its descendants any more Given thisa greedy choice may very easily end up in a local optimum bychoosing a best matching vertex at one step while denyingintegrating opportunities of other vertices that may lead to abetter final solution

It is easy to see that the deeper a vertex being chosen forintegration is the more integration opportunities are lost Toalleviate such a situation we propose a greedy approach byconsidering the relative depth (rdepth) of a chosen vertexwith regard to an allowable vertex closest to the root That is

given a vertex 119886 from 119879119860 a vertex 119887 from 119879

119861is chosen when

119872119860119861(119886 119887)120573

rdepth(119887) is maximized When 120573 = 1 the depthinformation does not take effective and when 120573 = infin eachvertex will only be associated with an allowable vertex closestto the root

Algorithm 1 describes the pseudocode of the heuristicintegration It starts by integrating virtual roots of the twoontologies After that the integration will be carried outiteratively from top to bottom by following Criterion (1) andthe heuristic strategy described above In the empirical studywe will see that the heuristic algorithm works better whenthe depth information is considered However in terms ofthe overall cohesion score it is no match for our optimal andapproximation solutions as described below

212 Optimal and Approximation Solutions By dividing theintegration of two trees into node merging and subtreeintegrations we have identified optimal structures in thebasic problem as stated by Lemmas 1 and 2 These optimalstructures make it possible for us to develop efficient algo-rithms (Algorithms 2 and 3) that achieve optimal or approx-imation solutions In the following we first describe thetwo important lemmas suggesting the optimal structures andtheir proofs before describing our proposed algorithms

Lemma 1 Let 119903119886be the root of tree119879

119860and 119903119887the root of tree119879

119861

Let119879119860minus119903119886

and119879119861minus119903119887

represent two sets of sub trees rooted at 119903119886of

BioMed Research International 5

Case (2a) Case (2b) Case (3a) Case (3b)

Case (1) Case (2) Case (3)

y2

y2

y2 y2 y2

y2

y1

y1

y1 y1y1

y1

Ty1

Ty2

x

xx

xx

x

ra rb

Txy1y2

tx

Figure 3 An illustration of three cases in the proof of Lemma 2

Sort vertices in 119879119860and 119879

119861in the topological order

for 119894 = |119881(119879119860)| to 0 do

for 119895 = |119881(119879119861)| to 0 do

119904119888119900119903119890 = 119872119860119861(119894 119895) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119861minus119895)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119895)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119894 119879119861minus119895)

119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909(119894 119895) = max(119904119888119900119903119890 119904119888119900119903119890119886 119904119888119900119903119890

119887)

end forend forreturn 119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909

Algorithm 2 BuildCohesionMatrix(119879119860 119879119861)

119879119860and 119903119887of119879119861 respectively119879

119860minus119903119886does not include 119903

119886and119879119861minus119903119887

does not include 119903119887 One has 119892(119879

119860 119879119861) = max(119872

119860119861(119903119886 119903119887) +

119892(119879119860minus119903119886 119879119861minus119903119887) 119892(119879119860 119879119861minus119903119887) 119892(119879119860minus119903119886 119879119861))

Proof We can divide the integration of tree 119879119860with tree 119879

119861

into two cases according to the merging of their roots

(1) The roots of 119879119860and 119879

119861are merged together

(2) The roots of 119879119860and 119879

119861are not merged together

For case (1) it is clear that the cohesion score is119872119860119861(119903119886 119903119887)+

119892(119879119860minus119903119886 119879119861minus119903119887)

For case (2) we conclude that either 119879119860is integrated with

119879119861minus119903119887

(119903119887is out of integration) or 119879

119861is integrated with 119879

119860minus119903119886

(119903119886is out of integration) Otherwise we will have a merged

tree 119879119860119861

with two roots 119903119886and 119903

119887 a contradiction to the

fact that 119879119860119861

is a tree Therefore the cohesion score is either119892(119879119860 119879119861minus119903119887) or 119892(119879

119860minus119903119886 119879119861)

Combining cases (1) and (2) and according to Criterion(2) we have

119892 (119879119860 119879119861) = max (119872

119860119861(119903119886 119903119887) + 119892 (119879

119860minus119903119886 119879119861minus119903119887)

119892 (119879119860 119879119861minus119903119887) 119892 (119879

119860minus119903119886 119879119861))

(1)

Lemma 2 Let 119879119860minus119903119886

and 119879119861minus119903119887

represent two sets of treesobtained by removing the root vertices 119903

119886from 119879

119860minus119903119886and 119903119887

from 119879119861minus119903119887

One has

119892 (119879119860minus119903119886 119879119861minus119903119887) = max( sum

(119879119909 119879119910)isin119877

(119892 (119879119909 119879119910))) (2)

Here 119879119909isin 119879119860minus119903119886

119879119910isin 119879119861minus119903119887

and 119877 is a matching of trees in119879119860minus119903119886

with trees in 119879119861minus119903119887

Proof To prove this lemma we first prove that for any tree119879119909isin 119879119860minus119903119886

it can be integrated with no more than one treein 119879119861minus119903119887

We will prove this claim by contradiction Assumethere are two trees 119879

1199101isin 119879119861minus119903119887

and 1198791199102isin 119879119861minus119903119887

and theyintegrate with a tree119879

119909isin 119879119860minus119903119886

into an integrated tree11987911990911991011199102

There are three cases for the root 119903 of 119879

11990911991011199102 as illustrated in

Figure 3

(1) 119903 contains only the root of 119879119909

(2) 119903 contains only the root of 1198791199101or 1198791199102

(3) 119903 contains the root of 119879119909and the root of 119879

1199101or 1198791199102

For case (1) the lowest common ancestor of the roots of 1198791199101

and 1198791199102in the integrated tree 119879

11990911991011199102will no longer contain

their lowest common ancestor in 119879119861 a contradiction to

6 BioMed Research International

push (0 0119873119880119871119871) into queue 119902while |119902| gt 0 do(119886 119887 119888) = 119901119900119901(119902)if 119886 = minus1 thenmake 119879

119861(119887) as a subtree of 119879

119860119861rooted at 119887 119879

119861(119887) is a sub tree of 119879

119861rooted at 119887

else if 119887 = minus1 thenmake 119879

119860(119886) as a subtree of 119879

119860119861rooted at 119886 119879

119860(119886) is a sub tree of 119879

119860rooted at 119886

elselet 119889 = (119886 119887) and make 119889 a child of 119888 on 119879

119860119861

119904119888119900119903119890 = 119872119860119861(119886 119887) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119861minus119887)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119887)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119886 119879119861minus119887)

if 119904119888119900119903119890 gt= 119904119888119900119903119890119886ampamp 119904119888119900119903119890 gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119861minus119887

with 119889 into 119902else if 119904119888119900119903119890

119886gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119887with 119889 into 119902

elsepush matching results of 119879

119886 119879119861minus119887

with 119889 into 119902end if

end ifend whilereturn 119879

119860119861

Algorithm 3 BuildNewOnto(cohesion matrix 119879119860 119879119861)

Criterion (1) For cases (2) and (3) the root of 1198791199102(or the root

of 1198791199101) will be the descendant of the root of 119879

1199101(or the

root of 1198791199102) in the integrated tree 119879

1199091199101 1199102 a contradiction to

Criterion (1) For integration involving more than two treesfrom 119879

119861minus119903119887 we can still follow the above procedure to reach

contradictions Thus the claim is provenWithout loss of generality we can see that for any

tree 119879119910isin 119879119861minus119903119887

it can be integrated with no more thanone tree in 119879

119860minus119903119886 Therefore the integration between 119879

119860minus119903119886

and 119879119861minus119903119887

corresponds to a matching in a weighted bipar-tite graph in which two sets of nodes represent treesfrom 119879

119860minus119903119886and 119879

119861minus119903119887 respectively and edges represent

corresponding cohesion scores According to Criterion (2)119892(119879119860minus119903119886 119879119861minus119903119887) = max(sum

(119879119909 119879119910)isin119877(119892(119879119909 119879119910))) and we con-

clude that 119892(119879119860minus119903119886 119879119861minus119903119887) corresponds to the weight of amax-

imum weighted matching in the above bipartite graph

Given Lemma 2 we can see that the following corollary iscorrect

Corollary 3 Define MaxMatch(119883 119884) = sum(119879119909 119879119910)isin119877

119892(119879119909 119879119910)

where 119877 is a maximum weighted matching of trees in forests119883 and 119884 given 119892(119879

119909 119879119910) for any tree pair (119879

119909 119879119910) isin 119883 times 119884

One concludes that for any two forests 119879119860and 119879

119861 119892(119879119860 119879119861) =

MaxMatch(119879119860 119879119861)

With Lemma 1 and Corollary 3 we are able to designan efficient dynamic programming algorithm achieving theglobal optimum for the ontology integration problem Thepseudocode for calculating the maximum cohesion score isdescribed in Algorithm 2 which visits ontology vertices inreverse topological order when filling up the cohesionmatrix

At the end of Algorithm 2 the cohesion matrix is filledup with optimal cohesion scores and the maximum cohesionscore is saved at entry (0 0) as described byTheorem 4

Theorem 4 It holds that cohesion matrix(119894 119895) = 119892(119879119860minus119894

119879119861minus119895) and cohesion matrix(0 0) = 119892(119879

119860 119879119861)

Proof We will prove this theorem by mathematical induc-tion

Let |119881(119879119860)| = 119899 and |119881(119879

119861)| = 119898 it is easy to see

that cohesion matrix(119899119898) is optimal because 119899 and 119898correspond to leaf nodes in the topological order Thus119879119860minus119899

and 119879119861minus119898

are empty sets and cohesion matrix(119899119898) =119872119860119861(119899119898) = 119892(119879

119899 119879119898)

When 119894 = 119899 and 119895 lt 119898 according to Lemma 1 to integrate119879119899with 119879

119895 either 119879

119899(the leaf vertex) is merged with 119879

119895or 119879119899

is integrated with 119879119861minus119895

The maximum score of integrating119879119899with 119879

119861minus119895is available at the time cohesion matrix(119899 119895) is

being calculated because of the reverse topological visitThusaccording to both Lemma 1 and Corollary 3 we concludethat 119892(119879

119899 119879119895) = max(119872

119860119861(119899 119895)MaxMatch(119879

119899 119879119861minus119895)) =

cohesion matrix(119899 119895) Similarly we can conclude that119892(119879119894 119879119898) = max(119872

119860119861(119894 119898)MaxMatch(119879

119860minus119894 119879119898)) =

cohesion matrix(119894 119898)When 119894 lt 119899 and 119895 lt 119898 again according to

Lemma 1 and Corollary 3 we have 119892(119879119894 119879119895) = max(score

score119886 score

119887) = cohesion matrix(119894 119895) Due to the reserve

topological visit MaxMatch(119879119860minus119894 119879119861minus119895) MaxMatch(119879

119860minus119894

119879119895) and MaxMatch(119879

119894 119879119861minus119895) are available at the time

cohesion matrix(119894 119895) is being calculated

Given the definition of 119892(119879119860 119879119861) Theorem 4 in fact

proves that the proposed approach achieves the global opti-mum However the global optimal solution is built upon

BioMed Research International 7

themaximumweightedmatching (recall Corollary 3) As dis-cussed in Section 213 the maximum weighted matching istime-consuming and therefore we propose an approximationsolution in that section

Although Algorithm 2 builds the cohesion matrix withoptimal cohesion scores it does not construct the integratedontological knowledge structure We may save the ontologyintegration details along with the cohesion scores Howeverthat will cost 119874(1198993) memory space (assuming each ontologyhas 119874(119899) vertices) and significantly reduce the capacity ofthe algorithm in handling large ontology integrations Quiteinterestingly we find that it is not necessary to save the inte-gration details in order to construct the integrated ontologyThe construction can be done by a process reverse to theconstruction of cohesionmatrix as described inAlgorithm 3

Algorithm 3 uses the cohesion matrix constructed byAlgorithm 2 and builds the integrated ontology tree still byfollowing Lemma 1 and Corollary 3 but in a reverse way ofAlgorithm 2 The construction is performed in a Breadth-First fashion which uses a queue 119902 to maintain triples Eachtriple (119886 119887 119888) is an association of three elements 119886 is thematched vertex from ontology 119879

119860 119887 is the matched vertex

from ontology 119879119861 and 119888 is their parent on the merged

ontology 119879119860119861 By following the basic idea of the proof of

Theorem 4 we can show that Algorithm 3 builds an optimalintegrated ontology with the cohesion matrix provided fromAlgorithm 2 We omit the proof for succinctness

213 Time Complexity Analysis and an Approximation Solu-tion Assume an ontology size is 119874(119899) The cohesion matrixhas 119874(1198992) entries to fill up The computation for each entryis a matching whose time complexity depends on the imple-mentation The maximum weighted matching takes 119874(1198993)using the famous Hungarian algorithm [21] and althoughit achieves optimum it is too costly for large ontologiesThe maximal weighted matching however takes 119874(1198992 log 119899)time and more importantly results in an overall 119874(1198992 log 119899)time complexity for Algorithm 2 The analysis is given inthe following For each matching the algorithm will accesspreviously filled entries and each entry will be accessed onlyonce and be involved only once in a sorting of 119874(log 119899)time This is because each entry corresponds to two verticeswhose cohesion score will be accessed when calculating thecohesion score of their parents Thus we conclude that thetotal time complexity of calculating the cohesionmatrix usingmaximal weightedmatching is119874(1198992+1198992 log 119899) = 119874(1198992 log 119899)Since building the new ontology has the same time com-plexity as building the cohesion matrix this is also the totaltime complexity for integrating two ontologies by maximalweightedmatchingThemaximal weightedmatching also hasa guaranteed lower bound on the results It achieves a (12)-approximation solution (ie the overall cohesion score willbe at least 12 of the optimal cohesion score) as pointed out in[22]

Since themaximalweightedmatching results in an overallgood performance on time complexity and approximationrate we used the maximal weighted matching in our empir-ical study for Algorithms 2 and 3 Readers may also choose

othermatching algorithms (such as the one described in [22])to achieve slightly better approximation rates However theweightedmatching is a replaceablemodule for our algorithmsand it is not the focus of this work to build a fast and close-to-optimal weighted matching algorithm

Compared to the time complexity of integrating ontolo-gies by the dynamic programming approach as describedin Algorithms 2 and 3 the heuristic approach described atthe beginning of this section also has 119874(1198992) in the worstcase However we conjecture that the heuristic approach hasa much smaller average time complexity because in eachstep the heuristic approach may exclude a large number ofmatching opportunities

22 Integrating Multiple Ontologies In the previous sectionwe proposedmethods for integrating two ontologies In somebiomedical applications [23 24] we are interested in theassociations involving more than two objects Integrationof multiple ontologies of these objects will generate aninnovative view on these complex relationships Similar tothe basic problem formulation we can formulate themultipleontology integration as follows

Given 119896 ontology trees 1198791 1198792 119879

119896and a closeness

matrix119872119894119895for any two trees 119879

119894and 119879

119895 how can we efficiently

generate an integrated ontology tree 11987912119896

meeting thefollowing criteria

(1) For any two vertices 119909 and 119910 in a tree 119879119894 their

lowest common ancestor LCA119879119894(119909 119910) is contained by

LCA11987912119896(119909 119910)

(2) It holds that argmax11987912119896119891(11987912119896

) =

sum119883isin119881(119879119860119861)

sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Here

119872120590(119906)120590(V)(119906 V) is the entry value in the closeness

matrix for two vertices 119906 and V (one from tree 119879120590(119906)

and the other from tree 119879120590(V)) contained in the node

119883 For a vertex V from an original ontology 120590(V) isdefined as its original ontology ID

Again we name the function 11989111987912119896

the cohesion of theintegrated ontology 119879

12119896 For each node 119883 in the inte-

grated ontology we define its weight as 119908119890119894119892ℎ119905(119883) =sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Correspondingly we define

function119892(1198791 1198792 119879

119896) = max

11987912119896(sum119883isin119881(11987912119896)

119908119890119894119892ℎ119905(119883))

as the maximum cohesion function for integrating theontologies 119879

1 1198792 119879

119896 As we can see in the above formula-

tion the overall cohesion score of integration is the summedweight of each node which is the sum of pairwise closenessscores

The formulation of multiple ontology integration is sim-ilar to the basic version and it is not difficult to showthat optimal structures described in Lemmas 1 and 2 canbe extended to a high dimension However the extensionof algorithms described in Section 212 for integrating twoontologies is not feasible for solving the multiple ontologyintegrationThis is because if we need to extend Algorithms 2and 3 to this problem we need to build a cohesionmatrix of 119896dimensions It implies that we need at least 119874(119899119896) operationsto fill up the score matrix assuming the size of an ontology

8 BioMed Research International

A B C D

A 10 6 8

B 10 7 5

C 6 7 9

D 8 5 9

AB C D

AB

C 9

D 9

lowastlowast

lowast

lowast mdash

mdash

mdash

mdash

mdash

mdash

mdash

(lowast) Entries to update

Figure 4 An example of the InterOntology matrixrsquos change at the first iteration in Algorithm 4 for integrating four ontologies

build the 119896 times 119896 InterOntology matrixfor 119894 = 1 to 119896 minus 1 doidentify the active tree pair ⟨119879

119883 119879119884⟩ that corresponds

to the highest score in the InterOntology matrixintegrate 119879

119883and 119879

119884into 119879

119883119884

mark 119879119883and 119879

119884as inactive

update relationship matricesupdate InterOntology matrix

end forreturn 119879

12119899

Algorithm 4 GreedyMultiInt(T = 1198791 1198792 119879

119896)

is 119874(119899) This is clearly not acceptable for high dimensionalontology integration

221 Greedy Approach From the above discussion we cansee that direct extension of Algorithms 2 and 3 for integratingtwo ontologies is practically not feasible for integrating alarge number of ontologies However we can still use thesealgorithms for integrating multiple ontologies by iterativelyintegrating two ontologies and generating a new closenessmatrix Given the ontologies 119879

1 1198792 119879

119896 we can first

integrate 1198791and 119879

2into 119879

12and then build the closeness

matrix between 11987912

and 1198793using the relationship matrices

between 1198791and 119879

2and between 119879

1and 119879

3 Specifically

assume 119883 is a node on the integrated ontology 11987912 and 119883

contains a vertex 119886 from 1198791and a vertex 119887 from 119879

2 Then

the entry (119883 119888) of the closeness matrix between 11987912

and 1198793

is 11987211987912 1198793(119883 119888) = 119872

1198791 1198793(119886 119888) + 119872

1198792 1198793(119887 119888) After the new

closeness matrix is generated we can continue integrating11987912

and 1198793into 119879

123and generating another new closeness

matrix We will eventually get the integrated ontology 11987912119896

by repeating the above process To facilitate the followingdiscussions we name the above approach the basic multipleintegration approach

Although the basic multiple integration approach canfinish integrating multiple ontologies it blindly integratesontologies without using any cohesion information betweenontologies that may lead to a better integration resultTo improve the basic multiple integration we propose agreedy approach that uses the cohesion information betweenontologies to guide the integration The basic steps of thegreedy approach are outlined in Algorithm 4 To facilitate

the understanding of Algorithm 4 we use Figure 4 toillustrate an example of the InterOntology matrixrsquos change atthe first iteration of integrating four ontologies A B C andD

The key idea in Algorithm 4 is to maintain an InterOn-tology matrix which guides the integration Initially thismatrix is filled with the overall cohesion score of everypair of ontologies In each step this matrix is updated withoverall cohesion scores between newly integrated ontologyand existing active ontologies The integration will take placebetween two active ontologies which have the highest scorein the InterOntology matrix

When we used the overall cohesion score between twoontologies to update the InterOntology matrix we observedan interesting phenomenon that the integration inmost casesis a process continuously expanding an integrated ontologyConsequently the greedy approach is likely to yield a resultsimilar to the basic approach

This phenomenon can be explained by the definition ofmaximum cohesion function which takes into account allpairwise closeness between merged terms Thus the moreontologies contained in an integrated ontology are the morelikely it will have high overall cohesion scores with otherontologies As a result it creates unfairness for the integrationselection To fix this issue we use the adjusted overall cohe-sion scores in updating the InterOntology matrix as follows

Given an ontology 119879119883and an ontology 119879

119884where 119883 and

119884 are nonempty sets of ontology IDs we define the adjustedcohesion score between 119879

119883and 119879

119884as 119860119889119895119862119900ℎ(119879

119883 119879119884) =

sum119885isin119881(119879119883119884)

sum119909isin119885119910isin119885120590(119909)isin119883120590(119910)isin119884

119872120590(119909)120590(119910)

(119909 119910)|119883||119884| where119879119883119884

is the integrated ontology built by Algorithms 2 and 3The adjusted cohesion score is in fact the weight increase byintegrating 119879

119883and 119879

119884 divided by the size of 119883 times the

size of 119884 that is119860119889119895119862119900ℎ(119879119883 119879119884) = sum

119885isin119881(119879119883119884)(119908119890119894119892ℎ119905(119885))minus

sum119883isin119881(119879119883)

(119908119890119894119892ℎ119905(119883)) minus sum119884isin119881(119879119884)

119908119890119894119892ℎ119905(119884)|119883||119884| For eachnode merging closeness scores will be added to the totalweight when the merging takes place between vertices fromontology set 119883 and vertices from ontology set 119884 Thus theweight increase by integrating 119879

119883and 119879

119884is proportional to

the number of ontologies in119883 times the number of ontologiesin 119884 and consequently we averaged the weight increase by|119883||119884|

222 Fast Approximation Algorithm Although the basicmultiple integration and the greedy multiple integrationapproaches discussed above are able to integrate multipleontologies none of them provide any guarantee on the results

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 5: Research Article Building Integrated Ontological Knowledge

BioMed Research International 5

Case (2a) Case (2b) Case (3a) Case (3b)

Case (1) Case (2) Case (3)

y2

y2

y2 y2 y2

y2

y1

y1

y1 y1y1

y1

Ty1

Ty2

x

xx

xx

x

ra rb

Txy1y2

tx

Figure 3 An illustration of three cases in the proof of Lemma 2

Sort vertices in 119879119860and 119879

119861in the topological order

for 119894 = |119881(119879119860)| to 0 do

for 119895 = |119881(119879119861)| to 0 do

119904119888119900119903119890 = 119872119860119861(119894 119895) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119861minus119895)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119894 119879119895)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119894 119879119861minus119895)

119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909(119894 119895) = max(119904119888119900119903119890 119904119888119900119903119890119886 119904119888119900119903119890

119887)

end forend forreturn 119888119900ℎ119890119904119894119900119899 119898119886119905119903119894119909

Algorithm 2 BuildCohesionMatrix(119879119860 119879119861)

119879119860and 119903119887of119879119861 respectively119879

119860minus119903119886does not include 119903

119886and119879119861minus119903119887

does not include 119903119887 One has 119892(119879

119860 119879119861) = max(119872

119860119861(119903119886 119903119887) +

119892(119879119860minus119903119886 119879119861minus119903119887) 119892(119879119860 119879119861minus119903119887) 119892(119879119860minus119903119886 119879119861))

Proof We can divide the integration of tree 119879119860with tree 119879

119861

into two cases according to the merging of their roots

(1) The roots of 119879119860and 119879

119861are merged together

(2) The roots of 119879119860and 119879

119861are not merged together

For case (1) it is clear that the cohesion score is119872119860119861(119903119886 119903119887)+

119892(119879119860minus119903119886 119879119861minus119903119887)

For case (2) we conclude that either 119879119860is integrated with

119879119861minus119903119887

(119903119887is out of integration) or 119879

119861is integrated with 119879

119860minus119903119886

(119903119886is out of integration) Otherwise we will have a merged

tree 119879119860119861

with two roots 119903119886and 119903

119887 a contradiction to the

fact that 119879119860119861

is a tree Therefore the cohesion score is either119892(119879119860 119879119861minus119903119887) or 119892(119879

119860minus119903119886 119879119861)

Combining cases (1) and (2) and according to Criterion(2) we have

119892 (119879119860 119879119861) = max (119872

119860119861(119903119886 119903119887) + 119892 (119879

119860minus119903119886 119879119861minus119903119887)

119892 (119879119860 119879119861minus119903119887) 119892 (119879

119860minus119903119886 119879119861))

(1)

Lemma 2 Let 119879119860minus119903119886

and 119879119861minus119903119887

represent two sets of treesobtained by removing the root vertices 119903

119886from 119879

119860minus119903119886and 119903119887

from 119879119861minus119903119887

One has

119892 (119879119860minus119903119886 119879119861minus119903119887) = max( sum

(119879119909 119879119910)isin119877

(119892 (119879119909 119879119910))) (2)

Here 119879119909isin 119879119860minus119903119886

119879119910isin 119879119861minus119903119887

and 119877 is a matching of trees in119879119860minus119903119886

with trees in 119879119861minus119903119887

Proof To prove this lemma we first prove that for any tree119879119909isin 119879119860minus119903119886

it can be integrated with no more than one treein 119879119861minus119903119887

We will prove this claim by contradiction Assumethere are two trees 119879

1199101isin 119879119861minus119903119887

and 1198791199102isin 119879119861minus119903119887

and theyintegrate with a tree119879

119909isin 119879119860minus119903119886

into an integrated tree11987911990911991011199102

There are three cases for the root 119903 of 119879

11990911991011199102 as illustrated in

Figure 3

(1) 119903 contains only the root of 119879119909

(2) 119903 contains only the root of 1198791199101or 1198791199102

(3) 119903 contains the root of 119879119909and the root of 119879

1199101or 1198791199102

For case (1) the lowest common ancestor of the roots of 1198791199101

and 1198791199102in the integrated tree 119879

11990911991011199102will no longer contain

their lowest common ancestor in 119879119861 a contradiction to

6 BioMed Research International

push (0 0119873119880119871119871) into queue 119902while |119902| gt 0 do(119886 119887 119888) = 119901119900119901(119902)if 119886 = minus1 thenmake 119879

119861(119887) as a subtree of 119879

119860119861rooted at 119887 119879

119861(119887) is a sub tree of 119879

119861rooted at 119887

else if 119887 = minus1 thenmake 119879

119860(119886) as a subtree of 119879

119860119861rooted at 119886 119879

119860(119886) is a sub tree of 119879

119860rooted at 119886

elselet 119889 = (119886 119887) and make 119889 a child of 119888 on 119879

119860119861

119904119888119900119903119890 = 119872119860119861(119886 119887) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119861minus119887)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119887)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119886 119879119861minus119887)

if 119904119888119900119903119890 gt= 119904119888119900119903119890119886ampamp 119904119888119900119903119890 gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119861minus119887

with 119889 into 119902else if 119904119888119900119903119890

119886gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119887with 119889 into 119902

elsepush matching results of 119879

119886 119879119861minus119887

with 119889 into 119902end if

end ifend whilereturn 119879

119860119861

Algorithm 3 BuildNewOnto(cohesion matrix 119879119860 119879119861)

Criterion (1) For cases (2) and (3) the root of 1198791199102(or the root

of 1198791199101) will be the descendant of the root of 119879

1199101(or the

root of 1198791199102) in the integrated tree 119879

1199091199101 1199102 a contradiction to

Criterion (1) For integration involving more than two treesfrom 119879

119861minus119903119887 we can still follow the above procedure to reach

contradictions Thus the claim is provenWithout loss of generality we can see that for any

tree 119879119910isin 119879119861minus119903119887

it can be integrated with no more thanone tree in 119879

119860minus119903119886 Therefore the integration between 119879

119860minus119903119886

and 119879119861minus119903119887

corresponds to a matching in a weighted bipar-tite graph in which two sets of nodes represent treesfrom 119879

119860minus119903119886and 119879

119861minus119903119887 respectively and edges represent

corresponding cohesion scores According to Criterion (2)119892(119879119860minus119903119886 119879119861minus119903119887) = max(sum

(119879119909 119879119910)isin119877(119892(119879119909 119879119910))) and we con-

clude that 119892(119879119860minus119903119886 119879119861minus119903119887) corresponds to the weight of amax-

imum weighted matching in the above bipartite graph

Given Lemma 2 we can see that the following corollary iscorrect

Corollary 3 Define MaxMatch(119883 119884) = sum(119879119909 119879119910)isin119877

119892(119879119909 119879119910)

where 119877 is a maximum weighted matching of trees in forests119883 and 119884 given 119892(119879

119909 119879119910) for any tree pair (119879

119909 119879119910) isin 119883 times 119884

One concludes that for any two forests 119879119860and 119879

119861 119892(119879119860 119879119861) =

MaxMatch(119879119860 119879119861)

With Lemma 1 and Corollary 3 we are able to designan efficient dynamic programming algorithm achieving theglobal optimum for the ontology integration problem Thepseudocode for calculating the maximum cohesion score isdescribed in Algorithm 2 which visits ontology vertices inreverse topological order when filling up the cohesionmatrix

At the end of Algorithm 2 the cohesion matrix is filledup with optimal cohesion scores and the maximum cohesionscore is saved at entry (0 0) as described byTheorem 4

Theorem 4 It holds that cohesion matrix(119894 119895) = 119892(119879119860minus119894

119879119861minus119895) and cohesion matrix(0 0) = 119892(119879

119860 119879119861)

Proof We will prove this theorem by mathematical induc-tion

Let |119881(119879119860)| = 119899 and |119881(119879

119861)| = 119898 it is easy to see

that cohesion matrix(119899119898) is optimal because 119899 and 119898correspond to leaf nodes in the topological order Thus119879119860minus119899

and 119879119861minus119898

are empty sets and cohesion matrix(119899119898) =119872119860119861(119899119898) = 119892(119879

119899 119879119898)

When 119894 = 119899 and 119895 lt 119898 according to Lemma 1 to integrate119879119899with 119879

119895 either 119879

119899(the leaf vertex) is merged with 119879

119895or 119879119899

is integrated with 119879119861minus119895

The maximum score of integrating119879119899with 119879

119861minus119895is available at the time cohesion matrix(119899 119895) is

being calculated because of the reverse topological visitThusaccording to both Lemma 1 and Corollary 3 we concludethat 119892(119879

119899 119879119895) = max(119872

119860119861(119899 119895)MaxMatch(119879

119899 119879119861minus119895)) =

cohesion matrix(119899 119895) Similarly we can conclude that119892(119879119894 119879119898) = max(119872

119860119861(119894 119898)MaxMatch(119879

119860minus119894 119879119898)) =

cohesion matrix(119894 119898)When 119894 lt 119899 and 119895 lt 119898 again according to

Lemma 1 and Corollary 3 we have 119892(119879119894 119879119895) = max(score

score119886 score

119887) = cohesion matrix(119894 119895) Due to the reserve

topological visit MaxMatch(119879119860minus119894 119879119861minus119895) MaxMatch(119879

119860minus119894

119879119895) and MaxMatch(119879

119894 119879119861minus119895) are available at the time

cohesion matrix(119894 119895) is being calculated

Given the definition of 119892(119879119860 119879119861) Theorem 4 in fact

proves that the proposed approach achieves the global opti-mum However the global optimal solution is built upon

BioMed Research International 7

themaximumweightedmatching (recall Corollary 3) As dis-cussed in Section 213 the maximum weighted matching istime-consuming and therefore we propose an approximationsolution in that section

Although Algorithm 2 builds the cohesion matrix withoptimal cohesion scores it does not construct the integratedontological knowledge structure We may save the ontologyintegration details along with the cohesion scores Howeverthat will cost 119874(1198993) memory space (assuming each ontologyhas 119874(119899) vertices) and significantly reduce the capacity ofthe algorithm in handling large ontology integrations Quiteinterestingly we find that it is not necessary to save the inte-gration details in order to construct the integrated ontologyThe construction can be done by a process reverse to theconstruction of cohesionmatrix as described inAlgorithm 3

Algorithm 3 uses the cohesion matrix constructed byAlgorithm 2 and builds the integrated ontology tree still byfollowing Lemma 1 and Corollary 3 but in a reverse way ofAlgorithm 2 The construction is performed in a Breadth-First fashion which uses a queue 119902 to maintain triples Eachtriple (119886 119887 119888) is an association of three elements 119886 is thematched vertex from ontology 119879

119860 119887 is the matched vertex

from ontology 119879119861 and 119888 is their parent on the merged

ontology 119879119860119861 By following the basic idea of the proof of

Theorem 4 we can show that Algorithm 3 builds an optimalintegrated ontology with the cohesion matrix provided fromAlgorithm 2 We omit the proof for succinctness

213 Time Complexity Analysis and an Approximation Solu-tion Assume an ontology size is 119874(119899) The cohesion matrixhas 119874(1198992) entries to fill up The computation for each entryis a matching whose time complexity depends on the imple-mentation The maximum weighted matching takes 119874(1198993)using the famous Hungarian algorithm [21] and althoughit achieves optimum it is too costly for large ontologiesThe maximal weighted matching however takes 119874(1198992 log 119899)time and more importantly results in an overall 119874(1198992 log 119899)time complexity for Algorithm 2 The analysis is given inthe following For each matching the algorithm will accesspreviously filled entries and each entry will be accessed onlyonce and be involved only once in a sorting of 119874(log 119899)time This is because each entry corresponds to two verticeswhose cohesion score will be accessed when calculating thecohesion score of their parents Thus we conclude that thetotal time complexity of calculating the cohesionmatrix usingmaximal weightedmatching is119874(1198992+1198992 log 119899) = 119874(1198992 log 119899)Since building the new ontology has the same time com-plexity as building the cohesion matrix this is also the totaltime complexity for integrating two ontologies by maximalweightedmatchingThemaximal weightedmatching also hasa guaranteed lower bound on the results It achieves a (12)-approximation solution (ie the overall cohesion score willbe at least 12 of the optimal cohesion score) as pointed out in[22]

Since themaximalweightedmatching results in an overallgood performance on time complexity and approximationrate we used the maximal weighted matching in our empir-ical study for Algorithms 2 and 3 Readers may also choose

othermatching algorithms (such as the one described in [22])to achieve slightly better approximation rates However theweightedmatching is a replaceablemodule for our algorithmsand it is not the focus of this work to build a fast and close-to-optimal weighted matching algorithm

Compared to the time complexity of integrating ontolo-gies by the dynamic programming approach as describedin Algorithms 2 and 3 the heuristic approach described atthe beginning of this section also has 119874(1198992) in the worstcase However we conjecture that the heuristic approach hasa much smaller average time complexity because in eachstep the heuristic approach may exclude a large number ofmatching opportunities

22 Integrating Multiple Ontologies In the previous sectionwe proposedmethods for integrating two ontologies In somebiomedical applications [23 24] we are interested in theassociations involving more than two objects Integrationof multiple ontologies of these objects will generate aninnovative view on these complex relationships Similar tothe basic problem formulation we can formulate themultipleontology integration as follows

Given 119896 ontology trees 1198791 1198792 119879

119896and a closeness

matrix119872119894119895for any two trees 119879

119894and 119879

119895 how can we efficiently

generate an integrated ontology tree 11987912119896

meeting thefollowing criteria

(1) For any two vertices 119909 and 119910 in a tree 119879119894 their

lowest common ancestor LCA119879119894(119909 119910) is contained by

LCA11987912119896(119909 119910)

(2) It holds that argmax11987912119896119891(11987912119896

) =

sum119883isin119881(119879119860119861)

sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Here

119872120590(119906)120590(V)(119906 V) is the entry value in the closeness

matrix for two vertices 119906 and V (one from tree 119879120590(119906)

and the other from tree 119879120590(V)) contained in the node

119883 For a vertex V from an original ontology 120590(V) isdefined as its original ontology ID

Again we name the function 11989111987912119896

the cohesion of theintegrated ontology 119879

12119896 For each node 119883 in the inte-

grated ontology we define its weight as 119908119890119894119892ℎ119905(119883) =sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Correspondingly we define

function119892(1198791 1198792 119879

119896) = max

11987912119896(sum119883isin119881(11987912119896)

119908119890119894119892ℎ119905(119883))

as the maximum cohesion function for integrating theontologies 119879

1 1198792 119879

119896 As we can see in the above formula-

tion the overall cohesion score of integration is the summedweight of each node which is the sum of pairwise closenessscores

The formulation of multiple ontology integration is sim-ilar to the basic version and it is not difficult to showthat optimal structures described in Lemmas 1 and 2 canbe extended to a high dimension However the extensionof algorithms described in Section 212 for integrating twoontologies is not feasible for solving the multiple ontologyintegrationThis is because if we need to extend Algorithms 2and 3 to this problem we need to build a cohesionmatrix of 119896dimensions It implies that we need at least 119874(119899119896) operationsto fill up the score matrix assuming the size of an ontology

8 BioMed Research International

A B C D

A 10 6 8

B 10 7 5

C 6 7 9

D 8 5 9

AB C D

AB

C 9

D 9

lowastlowast

lowast

lowast mdash

mdash

mdash

mdash

mdash

mdash

mdash

(lowast) Entries to update

Figure 4 An example of the InterOntology matrixrsquos change at the first iteration in Algorithm 4 for integrating four ontologies

build the 119896 times 119896 InterOntology matrixfor 119894 = 1 to 119896 minus 1 doidentify the active tree pair ⟨119879

119883 119879119884⟩ that corresponds

to the highest score in the InterOntology matrixintegrate 119879

119883and 119879

119884into 119879

119883119884

mark 119879119883and 119879

119884as inactive

update relationship matricesupdate InterOntology matrix

end forreturn 119879

12119899

Algorithm 4 GreedyMultiInt(T = 1198791 1198792 119879

119896)

is 119874(119899) This is clearly not acceptable for high dimensionalontology integration

221 Greedy Approach From the above discussion we cansee that direct extension of Algorithms 2 and 3 for integratingtwo ontologies is practically not feasible for integrating alarge number of ontologies However we can still use thesealgorithms for integrating multiple ontologies by iterativelyintegrating two ontologies and generating a new closenessmatrix Given the ontologies 119879

1 1198792 119879

119896 we can first

integrate 1198791and 119879

2into 119879

12and then build the closeness

matrix between 11987912

and 1198793using the relationship matrices

between 1198791and 119879

2and between 119879

1and 119879

3 Specifically

assume 119883 is a node on the integrated ontology 11987912 and 119883

contains a vertex 119886 from 1198791and a vertex 119887 from 119879

2 Then

the entry (119883 119888) of the closeness matrix between 11987912

and 1198793

is 11987211987912 1198793(119883 119888) = 119872

1198791 1198793(119886 119888) + 119872

1198792 1198793(119887 119888) After the new

closeness matrix is generated we can continue integrating11987912

and 1198793into 119879

123and generating another new closeness

matrix We will eventually get the integrated ontology 11987912119896

by repeating the above process To facilitate the followingdiscussions we name the above approach the basic multipleintegration approach

Although the basic multiple integration approach canfinish integrating multiple ontologies it blindly integratesontologies without using any cohesion information betweenontologies that may lead to a better integration resultTo improve the basic multiple integration we propose agreedy approach that uses the cohesion information betweenontologies to guide the integration The basic steps of thegreedy approach are outlined in Algorithm 4 To facilitate

the understanding of Algorithm 4 we use Figure 4 toillustrate an example of the InterOntology matrixrsquos change atthe first iteration of integrating four ontologies A B C andD

The key idea in Algorithm 4 is to maintain an InterOn-tology matrix which guides the integration Initially thismatrix is filled with the overall cohesion score of everypair of ontologies In each step this matrix is updated withoverall cohesion scores between newly integrated ontologyand existing active ontologies The integration will take placebetween two active ontologies which have the highest scorein the InterOntology matrix

When we used the overall cohesion score between twoontologies to update the InterOntology matrix we observedan interesting phenomenon that the integration inmost casesis a process continuously expanding an integrated ontologyConsequently the greedy approach is likely to yield a resultsimilar to the basic approach

This phenomenon can be explained by the definition ofmaximum cohesion function which takes into account allpairwise closeness between merged terms Thus the moreontologies contained in an integrated ontology are the morelikely it will have high overall cohesion scores with otherontologies As a result it creates unfairness for the integrationselection To fix this issue we use the adjusted overall cohe-sion scores in updating the InterOntology matrix as follows

Given an ontology 119879119883and an ontology 119879

119884where 119883 and

119884 are nonempty sets of ontology IDs we define the adjustedcohesion score between 119879

119883and 119879

119884as 119860119889119895119862119900ℎ(119879

119883 119879119884) =

sum119885isin119881(119879119883119884)

sum119909isin119885119910isin119885120590(119909)isin119883120590(119910)isin119884

119872120590(119909)120590(119910)

(119909 119910)|119883||119884| where119879119883119884

is the integrated ontology built by Algorithms 2 and 3The adjusted cohesion score is in fact the weight increase byintegrating 119879

119883and 119879

119884 divided by the size of 119883 times the

size of 119884 that is119860119889119895119862119900ℎ(119879119883 119879119884) = sum

119885isin119881(119879119883119884)(119908119890119894119892ℎ119905(119885))minus

sum119883isin119881(119879119883)

(119908119890119894119892ℎ119905(119883)) minus sum119884isin119881(119879119884)

119908119890119894119892ℎ119905(119884)|119883||119884| For eachnode merging closeness scores will be added to the totalweight when the merging takes place between vertices fromontology set 119883 and vertices from ontology set 119884 Thus theweight increase by integrating 119879

119883and 119879

119884is proportional to

the number of ontologies in119883 times the number of ontologiesin 119884 and consequently we averaged the weight increase by|119883||119884|

222 Fast Approximation Algorithm Although the basicmultiple integration and the greedy multiple integrationapproaches discussed above are able to integrate multipleontologies none of them provide any guarantee on the results

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 6: Research Article Building Integrated Ontological Knowledge

6 BioMed Research International

push (0 0119873119880119871119871) into queue 119902while |119902| gt 0 do(119886 119887 119888) = 119901119900119901(119902)if 119886 = minus1 thenmake 119879

119861(119887) as a subtree of 119879

119860119861rooted at 119887 119879

119861(119887) is a sub tree of 119879

119861rooted at 119887

else if 119887 = minus1 thenmake 119879

119860(119886) as a subtree of 119879

119860119861rooted at 119886 119879

119860(119886) is a sub tree of 119879

119860rooted at 119886

elselet 119889 = (119886 119887) and make 119889 a child of 119888 on 119879

119860119861

119904119888119900119903119890 = 119872119860119861(119886 119887) + 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119861minus119887)

119904119888119900119903119890119886= 119872119886119909119872119886119905119888ℎ(119879

119860minus119886 119879119887)

119904119888119900119903119890119887= 119872119886119909119872119886119905119888ℎ(119879

119886 119879119861minus119887)

if 119904119888119900119903119890 gt= 119904119888119900119903119890119886ampamp 119904119888119900119903119890 gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119861minus119887

with 119889 into 119902else if 119904119888119900119903119890

119886gt= 119904119888119900119903119890

119887then

push matching results of 119879119860minus119886 119879119887with 119889 into 119902

elsepush matching results of 119879

119886 119879119861minus119887

with 119889 into 119902end if

end ifend whilereturn 119879

119860119861

Algorithm 3 BuildNewOnto(cohesion matrix 119879119860 119879119861)

Criterion (1) For cases (2) and (3) the root of 1198791199102(or the root

of 1198791199101) will be the descendant of the root of 119879

1199101(or the

root of 1198791199102) in the integrated tree 119879

1199091199101 1199102 a contradiction to

Criterion (1) For integration involving more than two treesfrom 119879

119861minus119903119887 we can still follow the above procedure to reach

contradictions Thus the claim is provenWithout loss of generality we can see that for any

tree 119879119910isin 119879119861minus119903119887

it can be integrated with no more thanone tree in 119879

119860minus119903119886 Therefore the integration between 119879

119860minus119903119886

and 119879119861minus119903119887

corresponds to a matching in a weighted bipar-tite graph in which two sets of nodes represent treesfrom 119879

119860minus119903119886and 119879

119861minus119903119887 respectively and edges represent

corresponding cohesion scores According to Criterion (2)119892(119879119860minus119903119886 119879119861minus119903119887) = max(sum

(119879119909 119879119910)isin119877(119892(119879119909 119879119910))) and we con-

clude that 119892(119879119860minus119903119886 119879119861minus119903119887) corresponds to the weight of amax-

imum weighted matching in the above bipartite graph

Given Lemma 2 we can see that the following corollary iscorrect

Corollary 3 Define MaxMatch(119883 119884) = sum(119879119909 119879119910)isin119877

119892(119879119909 119879119910)

where 119877 is a maximum weighted matching of trees in forests119883 and 119884 given 119892(119879

119909 119879119910) for any tree pair (119879

119909 119879119910) isin 119883 times 119884

One concludes that for any two forests 119879119860and 119879

119861 119892(119879119860 119879119861) =

MaxMatch(119879119860 119879119861)

With Lemma 1 and Corollary 3 we are able to designan efficient dynamic programming algorithm achieving theglobal optimum for the ontology integration problem Thepseudocode for calculating the maximum cohesion score isdescribed in Algorithm 2 which visits ontology vertices inreverse topological order when filling up the cohesionmatrix

At the end of Algorithm 2 the cohesion matrix is filledup with optimal cohesion scores and the maximum cohesionscore is saved at entry (0 0) as described byTheorem 4

Theorem 4 It holds that cohesion matrix(119894 119895) = 119892(119879119860minus119894

119879119861minus119895) and cohesion matrix(0 0) = 119892(119879

119860 119879119861)

Proof We will prove this theorem by mathematical induc-tion

Let |119881(119879119860)| = 119899 and |119881(119879

119861)| = 119898 it is easy to see

that cohesion matrix(119899119898) is optimal because 119899 and 119898correspond to leaf nodes in the topological order Thus119879119860minus119899

and 119879119861minus119898

are empty sets and cohesion matrix(119899119898) =119872119860119861(119899119898) = 119892(119879

119899 119879119898)

When 119894 = 119899 and 119895 lt 119898 according to Lemma 1 to integrate119879119899with 119879

119895 either 119879

119899(the leaf vertex) is merged with 119879

119895or 119879119899

is integrated with 119879119861minus119895

The maximum score of integrating119879119899with 119879

119861minus119895is available at the time cohesion matrix(119899 119895) is

being calculated because of the reverse topological visitThusaccording to both Lemma 1 and Corollary 3 we concludethat 119892(119879

119899 119879119895) = max(119872

119860119861(119899 119895)MaxMatch(119879

119899 119879119861minus119895)) =

cohesion matrix(119899 119895) Similarly we can conclude that119892(119879119894 119879119898) = max(119872

119860119861(119894 119898)MaxMatch(119879

119860minus119894 119879119898)) =

cohesion matrix(119894 119898)When 119894 lt 119899 and 119895 lt 119898 again according to

Lemma 1 and Corollary 3 we have 119892(119879119894 119879119895) = max(score

score119886 score

119887) = cohesion matrix(119894 119895) Due to the reserve

topological visit MaxMatch(119879119860minus119894 119879119861minus119895) MaxMatch(119879

119860minus119894

119879119895) and MaxMatch(119879

119894 119879119861minus119895) are available at the time

cohesion matrix(119894 119895) is being calculated

Given the definition of 119892(119879119860 119879119861) Theorem 4 in fact

proves that the proposed approach achieves the global opti-mum However the global optimal solution is built upon

BioMed Research International 7

themaximumweightedmatching (recall Corollary 3) As dis-cussed in Section 213 the maximum weighted matching istime-consuming and therefore we propose an approximationsolution in that section

Although Algorithm 2 builds the cohesion matrix withoptimal cohesion scores it does not construct the integratedontological knowledge structure We may save the ontologyintegration details along with the cohesion scores Howeverthat will cost 119874(1198993) memory space (assuming each ontologyhas 119874(119899) vertices) and significantly reduce the capacity ofthe algorithm in handling large ontology integrations Quiteinterestingly we find that it is not necessary to save the inte-gration details in order to construct the integrated ontologyThe construction can be done by a process reverse to theconstruction of cohesionmatrix as described inAlgorithm 3

Algorithm 3 uses the cohesion matrix constructed byAlgorithm 2 and builds the integrated ontology tree still byfollowing Lemma 1 and Corollary 3 but in a reverse way ofAlgorithm 2 The construction is performed in a Breadth-First fashion which uses a queue 119902 to maintain triples Eachtriple (119886 119887 119888) is an association of three elements 119886 is thematched vertex from ontology 119879

119860 119887 is the matched vertex

from ontology 119879119861 and 119888 is their parent on the merged

ontology 119879119860119861 By following the basic idea of the proof of

Theorem 4 we can show that Algorithm 3 builds an optimalintegrated ontology with the cohesion matrix provided fromAlgorithm 2 We omit the proof for succinctness

213 Time Complexity Analysis and an Approximation Solu-tion Assume an ontology size is 119874(119899) The cohesion matrixhas 119874(1198992) entries to fill up The computation for each entryis a matching whose time complexity depends on the imple-mentation The maximum weighted matching takes 119874(1198993)using the famous Hungarian algorithm [21] and althoughit achieves optimum it is too costly for large ontologiesThe maximal weighted matching however takes 119874(1198992 log 119899)time and more importantly results in an overall 119874(1198992 log 119899)time complexity for Algorithm 2 The analysis is given inthe following For each matching the algorithm will accesspreviously filled entries and each entry will be accessed onlyonce and be involved only once in a sorting of 119874(log 119899)time This is because each entry corresponds to two verticeswhose cohesion score will be accessed when calculating thecohesion score of their parents Thus we conclude that thetotal time complexity of calculating the cohesionmatrix usingmaximal weightedmatching is119874(1198992+1198992 log 119899) = 119874(1198992 log 119899)Since building the new ontology has the same time com-plexity as building the cohesion matrix this is also the totaltime complexity for integrating two ontologies by maximalweightedmatchingThemaximal weightedmatching also hasa guaranteed lower bound on the results It achieves a (12)-approximation solution (ie the overall cohesion score willbe at least 12 of the optimal cohesion score) as pointed out in[22]

Since themaximalweightedmatching results in an overallgood performance on time complexity and approximationrate we used the maximal weighted matching in our empir-ical study for Algorithms 2 and 3 Readers may also choose

othermatching algorithms (such as the one described in [22])to achieve slightly better approximation rates However theweightedmatching is a replaceablemodule for our algorithmsand it is not the focus of this work to build a fast and close-to-optimal weighted matching algorithm

Compared to the time complexity of integrating ontolo-gies by the dynamic programming approach as describedin Algorithms 2 and 3 the heuristic approach described atthe beginning of this section also has 119874(1198992) in the worstcase However we conjecture that the heuristic approach hasa much smaller average time complexity because in eachstep the heuristic approach may exclude a large number ofmatching opportunities

22 Integrating Multiple Ontologies In the previous sectionwe proposedmethods for integrating two ontologies In somebiomedical applications [23 24] we are interested in theassociations involving more than two objects Integrationof multiple ontologies of these objects will generate aninnovative view on these complex relationships Similar tothe basic problem formulation we can formulate themultipleontology integration as follows

Given 119896 ontology trees 1198791 1198792 119879

119896and a closeness

matrix119872119894119895for any two trees 119879

119894and 119879

119895 how can we efficiently

generate an integrated ontology tree 11987912119896

meeting thefollowing criteria

(1) For any two vertices 119909 and 119910 in a tree 119879119894 their

lowest common ancestor LCA119879119894(119909 119910) is contained by

LCA11987912119896(119909 119910)

(2) It holds that argmax11987912119896119891(11987912119896

) =

sum119883isin119881(119879119860119861)

sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Here

119872120590(119906)120590(V)(119906 V) is the entry value in the closeness

matrix for two vertices 119906 and V (one from tree 119879120590(119906)

and the other from tree 119879120590(V)) contained in the node

119883 For a vertex V from an original ontology 120590(V) isdefined as its original ontology ID

Again we name the function 11989111987912119896

the cohesion of theintegrated ontology 119879

12119896 For each node 119883 in the inte-

grated ontology we define its weight as 119908119890119894119892ℎ119905(119883) =sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Correspondingly we define

function119892(1198791 1198792 119879

119896) = max

11987912119896(sum119883isin119881(11987912119896)

119908119890119894119892ℎ119905(119883))

as the maximum cohesion function for integrating theontologies 119879

1 1198792 119879

119896 As we can see in the above formula-

tion the overall cohesion score of integration is the summedweight of each node which is the sum of pairwise closenessscores

The formulation of multiple ontology integration is sim-ilar to the basic version and it is not difficult to showthat optimal structures described in Lemmas 1 and 2 canbe extended to a high dimension However the extensionof algorithms described in Section 212 for integrating twoontologies is not feasible for solving the multiple ontologyintegrationThis is because if we need to extend Algorithms 2and 3 to this problem we need to build a cohesionmatrix of 119896dimensions It implies that we need at least 119874(119899119896) operationsto fill up the score matrix assuming the size of an ontology

8 BioMed Research International

A B C D

A 10 6 8

B 10 7 5

C 6 7 9

D 8 5 9

AB C D

AB

C 9

D 9

lowastlowast

lowast

lowast mdash

mdash

mdash

mdash

mdash

mdash

mdash

(lowast) Entries to update

Figure 4 An example of the InterOntology matrixrsquos change at the first iteration in Algorithm 4 for integrating four ontologies

build the 119896 times 119896 InterOntology matrixfor 119894 = 1 to 119896 minus 1 doidentify the active tree pair ⟨119879

119883 119879119884⟩ that corresponds

to the highest score in the InterOntology matrixintegrate 119879

119883and 119879

119884into 119879

119883119884

mark 119879119883and 119879

119884as inactive

update relationship matricesupdate InterOntology matrix

end forreturn 119879

12119899

Algorithm 4 GreedyMultiInt(T = 1198791 1198792 119879

119896)

is 119874(119899) This is clearly not acceptable for high dimensionalontology integration

221 Greedy Approach From the above discussion we cansee that direct extension of Algorithms 2 and 3 for integratingtwo ontologies is practically not feasible for integrating alarge number of ontologies However we can still use thesealgorithms for integrating multiple ontologies by iterativelyintegrating two ontologies and generating a new closenessmatrix Given the ontologies 119879

1 1198792 119879

119896 we can first

integrate 1198791and 119879

2into 119879

12and then build the closeness

matrix between 11987912

and 1198793using the relationship matrices

between 1198791and 119879

2and between 119879

1and 119879

3 Specifically

assume 119883 is a node on the integrated ontology 11987912 and 119883

contains a vertex 119886 from 1198791and a vertex 119887 from 119879

2 Then

the entry (119883 119888) of the closeness matrix between 11987912

and 1198793

is 11987211987912 1198793(119883 119888) = 119872

1198791 1198793(119886 119888) + 119872

1198792 1198793(119887 119888) After the new

closeness matrix is generated we can continue integrating11987912

and 1198793into 119879

123and generating another new closeness

matrix We will eventually get the integrated ontology 11987912119896

by repeating the above process To facilitate the followingdiscussions we name the above approach the basic multipleintegration approach

Although the basic multiple integration approach canfinish integrating multiple ontologies it blindly integratesontologies without using any cohesion information betweenontologies that may lead to a better integration resultTo improve the basic multiple integration we propose agreedy approach that uses the cohesion information betweenontologies to guide the integration The basic steps of thegreedy approach are outlined in Algorithm 4 To facilitate

the understanding of Algorithm 4 we use Figure 4 toillustrate an example of the InterOntology matrixrsquos change atthe first iteration of integrating four ontologies A B C andD

The key idea in Algorithm 4 is to maintain an InterOn-tology matrix which guides the integration Initially thismatrix is filled with the overall cohesion score of everypair of ontologies In each step this matrix is updated withoverall cohesion scores between newly integrated ontologyand existing active ontologies The integration will take placebetween two active ontologies which have the highest scorein the InterOntology matrix

When we used the overall cohesion score between twoontologies to update the InterOntology matrix we observedan interesting phenomenon that the integration inmost casesis a process continuously expanding an integrated ontologyConsequently the greedy approach is likely to yield a resultsimilar to the basic approach

This phenomenon can be explained by the definition ofmaximum cohesion function which takes into account allpairwise closeness between merged terms Thus the moreontologies contained in an integrated ontology are the morelikely it will have high overall cohesion scores with otherontologies As a result it creates unfairness for the integrationselection To fix this issue we use the adjusted overall cohe-sion scores in updating the InterOntology matrix as follows

Given an ontology 119879119883and an ontology 119879

119884where 119883 and

119884 are nonempty sets of ontology IDs we define the adjustedcohesion score between 119879

119883and 119879

119884as 119860119889119895119862119900ℎ(119879

119883 119879119884) =

sum119885isin119881(119879119883119884)

sum119909isin119885119910isin119885120590(119909)isin119883120590(119910)isin119884

119872120590(119909)120590(119910)

(119909 119910)|119883||119884| where119879119883119884

is the integrated ontology built by Algorithms 2 and 3The adjusted cohesion score is in fact the weight increase byintegrating 119879

119883and 119879

119884 divided by the size of 119883 times the

size of 119884 that is119860119889119895119862119900ℎ(119879119883 119879119884) = sum

119885isin119881(119879119883119884)(119908119890119894119892ℎ119905(119885))minus

sum119883isin119881(119879119883)

(119908119890119894119892ℎ119905(119883)) minus sum119884isin119881(119879119884)

119908119890119894119892ℎ119905(119884)|119883||119884| For eachnode merging closeness scores will be added to the totalweight when the merging takes place between vertices fromontology set 119883 and vertices from ontology set 119884 Thus theweight increase by integrating 119879

119883and 119879

119884is proportional to

the number of ontologies in119883 times the number of ontologiesin 119884 and consequently we averaged the weight increase by|119883||119884|

222 Fast Approximation Algorithm Although the basicmultiple integration and the greedy multiple integrationapproaches discussed above are able to integrate multipleontologies none of them provide any guarantee on the results

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 7: Research Article Building Integrated Ontological Knowledge

BioMed Research International 7

themaximumweightedmatching (recall Corollary 3) As dis-cussed in Section 213 the maximum weighted matching istime-consuming and therefore we propose an approximationsolution in that section

Although Algorithm 2 builds the cohesion matrix withoptimal cohesion scores it does not construct the integratedontological knowledge structure We may save the ontologyintegration details along with the cohesion scores Howeverthat will cost 119874(1198993) memory space (assuming each ontologyhas 119874(119899) vertices) and significantly reduce the capacity ofthe algorithm in handling large ontology integrations Quiteinterestingly we find that it is not necessary to save the inte-gration details in order to construct the integrated ontologyThe construction can be done by a process reverse to theconstruction of cohesionmatrix as described inAlgorithm 3

Algorithm 3 uses the cohesion matrix constructed byAlgorithm 2 and builds the integrated ontology tree still byfollowing Lemma 1 and Corollary 3 but in a reverse way ofAlgorithm 2 The construction is performed in a Breadth-First fashion which uses a queue 119902 to maintain triples Eachtriple (119886 119887 119888) is an association of three elements 119886 is thematched vertex from ontology 119879

119860 119887 is the matched vertex

from ontology 119879119861 and 119888 is their parent on the merged

ontology 119879119860119861 By following the basic idea of the proof of

Theorem 4 we can show that Algorithm 3 builds an optimalintegrated ontology with the cohesion matrix provided fromAlgorithm 2 We omit the proof for succinctness

213 Time Complexity Analysis and an Approximation Solu-tion Assume an ontology size is 119874(119899) The cohesion matrixhas 119874(1198992) entries to fill up The computation for each entryis a matching whose time complexity depends on the imple-mentation The maximum weighted matching takes 119874(1198993)using the famous Hungarian algorithm [21] and althoughit achieves optimum it is too costly for large ontologiesThe maximal weighted matching however takes 119874(1198992 log 119899)time and more importantly results in an overall 119874(1198992 log 119899)time complexity for Algorithm 2 The analysis is given inthe following For each matching the algorithm will accesspreviously filled entries and each entry will be accessed onlyonce and be involved only once in a sorting of 119874(log 119899)time This is because each entry corresponds to two verticeswhose cohesion score will be accessed when calculating thecohesion score of their parents Thus we conclude that thetotal time complexity of calculating the cohesionmatrix usingmaximal weightedmatching is119874(1198992+1198992 log 119899) = 119874(1198992 log 119899)Since building the new ontology has the same time com-plexity as building the cohesion matrix this is also the totaltime complexity for integrating two ontologies by maximalweightedmatchingThemaximal weightedmatching also hasa guaranteed lower bound on the results It achieves a (12)-approximation solution (ie the overall cohesion score willbe at least 12 of the optimal cohesion score) as pointed out in[22]

Since themaximalweightedmatching results in an overallgood performance on time complexity and approximationrate we used the maximal weighted matching in our empir-ical study for Algorithms 2 and 3 Readers may also choose

othermatching algorithms (such as the one described in [22])to achieve slightly better approximation rates However theweightedmatching is a replaceablemodule for our algorithmsand it is not the focus of this work to build a fast and close-to-optimal weighted matching algorithm

Compared to the time complexity of integrating ontolo-gies by the dynamic programming approach as describedin Algorithms 2 and 3 the heuristic approach described atthe beginning of this section also has 119874(1198992) in the worstcase However we conjecture that the heuristic approach hasa much smaller average time complexity because in eachstep the heuristic approach may exclude a large number ofmatching opportunities

22 Integrating Multiple Ontologies In the previous sectionwe proposedmethods for integrating two ontologies In somebiomedical applications [23 24] we are interested in theassociations involving more than two objects Integrationof multiple ontologies of these objects will generate aninnovative view on these complex relationships Similar tothe basic problem formulation we can formulate themultipleontology integration as follows

Given 119896 ontology trees 1198791 1198792 119879

119896and a closeness

matrix119872119894119895for any two trees 119879

119894and 119879

119895 how can we efficiently

generate an integrated ontology tree 11987912119896

meeting thefollowing criteria

(1) For any two vertices 119909 and 119910 in a tree 119879119894 their

lowest common ancestor LCA119879119894(119909 119910) is contained by

LCA11987912119896(119909 119910)

(2) It holds that argmax11987912119896119891(11987912119896

) =

sum119883isin119881(119879119860119861)

sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Here

119872120590(119906)120590(V)(119906 V) is the entry value in the closeness

matrix for two vertices 119906 and V (one from tree 119879120590(119906)

and the other from tree 119879120590(V)) contained in the node

119883 For a vertex V from an original ontology 120590(V) isdefined as its original ontology ID

Again we name the function 11989111987912119896

the cohesion of theintegrated ontology 119879

12119896 For each node 119883 in the inte-

grated ontology we define its weight as 119908119890119894119892ℎ119905(119883) =sum119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V)(119906 V) Correspondingly we define

function119892(1198791 1198792 119879

119896) = max

11987912119896(sum119883isin119881(11987912119896)

119908119890119894119892ℎ119905(119883))

as the maximum cohesion function for integrating theontologies 119879

1 1198792 119879

119896 As we can see in the above formula-

tion the overall cohesion score of integration is the summedweight of each node which is the sum of pairwise closenessscores

The formulation of multiple ontology integration is sim-ilar to the basic version and it is not difficult to showthat optimal structures described in Lemmas 1 and 2 canbe extended to a high dimension However the extensionof algorithms described in Section 212 for integrating twoontologies is not feasible for solving the multiple ontologyintegrationThis is because if we need to extend Algorithms 2and 3 to this problem we need to build a cohesionmatrix of 119896dimensions It implies that we need at least 119874(119899119896) operationsto fill up the score matrix assuming the size of an ontology

8 BioMed Research International

A B C D

A 10 6 8

B 10 7 5

C 6 7 9

D 8 5 9

AB C D

AB

C 9

D 9

lowastlowast

lowast

lowast mdash

mdash

mdash

mdash

mdash

mdash

mdash

(lowast) Entries to update

Figure 4 An example of the InterOntology matrixrsquos change at the first iteration in Algorithm 4 for integrating four ontologies

build the 119896 times 119896 InterOntology matrixfor 119894 = 1 to 119896 minus 1 doidentify the active tree pair ⟨119879

119883 119879119884⟩ that corresponds

to the highest score in the InterOntology matrixintegrate 119879

119883and 119879

119884into 119879

119883119884

mark 119879119883and 119879

119884as inactive

update relationship matricesupdate InterOntology matrix

end forreturn 119879

12119899

Algorithm 4 GreedyMultiInt(T = 1198791 1198792 119879

119896)

is 119874(119899) This is clearly not acceptable for high dimensionalontology integration

221 Greedy Approach From the above discussion we cansee that direct extension of Algorithms 2 and 3 for integratingtwo ontologies is practically not feasible for integrating alarge number of ontologies However we can still use thesealgorithms for integrating multiple ontologies by iterativelyintegrating two ontologies and generating a new closenessmatrix Given the ontologies 119879

1 1198792 119879

119896 we can first

integrate 1198791and 119879

2into 119879

12and then build the closeness

matrix between 11987912

and 1198793using the relationship matrices

between 1198791and 119879

2and between 119879

1and 119879

3 Specifically

assume 119883 is a node on the integrated ontology 11987912 and 119883

contains a vertex 119886 from 1198791and a vertex 119887 from 119879

2 Then

the entry (119883 119888) of the closeness matrix between 11987912

and 1198793

is 11987211987912 1198793(119883 119888) = 119872

1198791 1198793(119886 119888) + 119872

1198792 1198793(119887 119888) After the new

closeness matrix is generated we can continue integrating11987912

and 1198793into 119879

123and generating another new closeness

matrix We will eventually get the integrated ontology 11987912119896

by repeating the above process To facilitate the followingdiscussions we name the above approach the basic multipleintegration approach

Although the basic multiple integration approach canfinish integrating multiple ontologies it blindly integratesontologies without using any cohesion information betweenontologies that may lead to a better integration resultTo improve the basic multiple integration we propose agreedy approach that uses the cohesion information betweenontologies to guide the integration The basic steps of thegreedy approach are outlined in Algorithm 4 To facilitate

the understanding of Algorithm 4 we use Figure 4 toillustrate an example of the InterOntology matrixrsquos change atthe first iteration of integrating four ontologies A B C andD

The key idea in Algorithm 4 is to maintain an InterOn-tology matrix which guides the integration Initially thismatrix is filled with the overall cohesion score of everypair of ontologies In each step this matrix is updated withoverall cohesion scores between newly integrated ontologyand existing active ontologies The integration will take placebetween two active ontologies which have the highest scorein the InterOntology matrix

When we used the overall cohesion score between twoontologies to update the InterOntology matrix we observedan interesting phenomenon that the integration inmost casesis a process continuously expanding an integrated ontologyConsequently the greedy approach is likely to yield a resultsimilar to the basic approach

This phenomenon can be explained by the definition ofmaximum cohesion function which takes into account allpairwise closeness between merged terms Thus the moreontologies contained in an integrated ontology are the morelikely it will have high overall cohesion scores with otherontologies As a result it creates unfairness for the integrationselection To fix this issue we use the adjusted overall cohe-sion scores in updating the InterOntology matrix as follows

Given an ontology 119879119883and an ontology 119879

119884where 119883 and

119884 are nonempty sets of ontology IDs we define the adjustedcohesion score between 119879

119883and 119879

119884as 119860119889119895119862119900ℎ(119879

119883 119879119884) =

sum119885isin119881(119879119883119884)

sum119909isin119885119910isin119885120590(119909)isin119883120590(119910)isin119884

119872120590(119909)120590(119910)

(119909 119910)|119883||119884| where119879119883119884

is the integrated ontology built by Algorithms 2 and 3The adjusted cohesion score is in fact the weight increase byintegrating 119879

119883and 119879

119884 divided by the size of 119883 times the

size of 119884 that is119860119889119895119862119900ℎ(119879119883 119879119884) = sum

119885isin119881(119879119883119884)(119908119890119894119892ℎ119905(119885))minus

sum119883isin119881(119879119883)

(119908119890119894119892ℎ119905(119883)) minus sum119884isin119881(119879119884)

119908119890119894119892ℎ119905(119884)|119883||119884| For eachnode merging closeness scores will be added to the totalweight when the merging takes place between vertices fromontology set 119883 and vertices from ontology set 119884 Thus theweight increase by integrating 119879

119883and 119879

119884is proportional to

the number of ontologies in119883 times the number of ontologiesin 119884 and consequently we averaged the weight increase by|119883||119884|

222 Fast Approximation Algorithm Although the basicmultiple integration and the greedy multiple integrationapproaches discussed above are able to integrate multipleontologies none of them provide any guarantee on the results

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 8: Research Article Building Integrated Ontological Knowledge

8 BioMed Research International

A B C D

A 10 6 8

B 10 7 5

C 6 7 9

D 8 5 9

AB C D

AB

C 9

D 9

lowastlowast

lowast

lowast mdash

mdash

mdash

mdash

mdash

mdash

mdash

(lowast) Entries to update

Figure 4 An example of the InterOntology matrixrsquos change at the first iteration in Algorithm 4 for integrating four ontologies

build the 119896 times 119896 InterOntology matrixfor 119894 = 1 to 119896 minus 1 doidentify the active tree pair ⟨119879

119883 119879119884⟩ that corresponds

to the highest score in the InterOntology matrixintegrate 119879

119883and 119879

119884into 119879

119883119884

mark 119879119883and 119879

119884as inactive

update relationship matricesupdate InterOntology matrix

end forreturn 119879

12119899

Algorithm 4 GreedyMultiInt(T = 1198791 1198792 119879

119896)

is 119874(119899) This is clearly not acceptable for high dimensionalontology integration

221 Greedy Approach From the above discussion we cansee that direct extension of Algorithms 2 and 3 for integratingtwo ontologies is practically not feasible for integrating alarge number of ontologies However we can still use thesealgorithms for integrating multiple ontologies by iterativelyintegrating two ontologies and generating a new closenessmatrix Given the ontologies 119879

1 1198792 119879

119896 we can first

integrate 1198791and 119879

2into 119879

12and then build the closeness

matrix between 11987912

and 1198793using the relationship matrices

between 1198791and 119879

2and between 119879

1and 119879

3 Specifically

assume 119883 is a node on the integrated ontology 11987912 and 119883

contains a vertex 119886 from 1198791and a vertex 119887 from 119879

2 Then

the entry (119883 119888) of the closeness matrix between 11987912

and 1198793

is 11987211987912 1198793(119883 119888) = 119872

1198791 1198793(119886 119888) + 119872

1198792 1198793(119887 119888) After the new

closeness matrix is generated we can continue integrating11987912

and 1198793into 119879

123and generating another new closeness

matrix We will eventually get the integrated ontology 11987912119896

by repeating the above process To facilitate the followingdiscussions we name the above approach the basic multipleintegration approach

Although the basic multiple integration approach canfinish integrating multiple ontologies it blindly integratesontologies without using any cohesion information betweenontologies that may lead to a better integration resultTo improve the basic multiple integration we propose agreedy approach that uses the cohesion information betweenontologies to guide the integration The basic steps of thegreedy approach are outlined in Algorithm 4 To facilitate

the understanding of Algorithm 4 we use Figure 4 toillustrate an example of the InterOntology matrixrsquos change atthe first iteration of integrating four ontologies A B C andD

The key idea in Algorithm 4 is to maintain an InterOn-tology matrix which guides the integration Initially thismatrix is filled with the overall cohesion score of everypair of ontologies In each step this matrix is updated withoverall cohesion scores between newly integrated ontologyand existing active ontologies The integration will take placebetween two active ontologies which have the highest scorein the InterOntology matrix

When we used the overall cohesion score between twoontologies to update the InterOntology matrix we observedan interesting phenomenon that the integration inmost casesis a process continuously expanding an integrated ontologyConsequently the greedy approach is likely to yield a resultsimilar to the basic approach

This phenomenon can be explained by the definition ofmaximum cohesion function which takes into account allpairwise closeness between merged terms Thus the moreontologies contained in an integrated ontology are the morelikely it will have high overall cohesion scores with otherontologies As a result it creates unfairness for the integrationselection To fix this issue we use the adjusted overall cohe-sion scores in updating the InterOntology matrix as follows

Given an ontology 119879119883and an ontology 119879

119884where 119883 and

119884 are nonempty sets of ontology IDs we define the adjustedcohesion score between 119879

119883and 119879

119884as 119860119889119895119862119900ℎ(119879

119883 119879119884) =

sum119885isin119881(119879119883119884)

sum119909isin119885119910isin119885120590(119909)isin119883120590(119910)isin119884

119872120590(119909)120590(119910)

(119909 119910)|119883||119884| where119879119883119884

is the integrated ontology built by Algorithms 2 and 3The adjusted cohesion score is in fact the weight increase byintegrating 119879

119883and 119879

119884 divided by the size of 119883 times the

size of 119884 that is119860119889119895119862119900ℎ(119879119883 119879119884) = sum

119885isin119881(119879119883119884)(119908119890119894119892ℎ119905(119885))minus

sum119883isin119881(119879119883)

(119908119890119894119892ℎ119905(119883)) minus sum119884isin119881(119879119884)

119908119890119894119892ℎ119905(119884)|119883||119884| For eachnode merging closeness scores will be added to the totalweight when the merging takes place between vertices fromontology set 119883 and vertices from ontology set 119884 Thus theweight increase by integrating 119879

119883and 119879

119884is proportional to

the number of ontologies in119883 times the number of ontologiesin 119884 and consequently we averaged the weight increase by|119883||119884|

222 Fast Approximation Algorithm Although the basicmultiple integration and the greedy multiple integrationapproaches discussed above are able to integrate multipleontologies none of them provide any guarantee on the results

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 9: Research Article Building Integrated Ontological Knowledge

BioMed Research International 9

A

A

7

10 8

9

6

5

9

D

C

B D

C

B

ge 7

ge8

Figure 5 An illustration of vertexedge contraction and weight updates in an iteration of Algorithm 5 Each vertex represents an ontology

for 119894 = 1 to 119896 minus 1 dofor 119895 = 119894 + 1 to 119896 dopush ⟨119892(119879

119894 119879119895) 119894 119895⟩ into a set 119878 ordered by the first element in descending order

end forend forwhile |119878| gt 0 do⟨119911 119909 119910⟩ = 119901119900119901(119878)if adding (119909 119910) does not form a cycle inG thenadding (119909 119910) toGintegrate 119879

119862(119909)and 119879

119862(119910)into 119879

119862(119909)cup119862(119910) 119862(V) is a set of vertices including V that form a connected component inG

end ifend whilereturn 119879

12119899

Algorithm 5 FastMultiInt(T = 1198791 1198792 119879

119896)

in comparison with the optimal solutions By studyingthe maximum cohesion scores between ontologies undera graph setting we identified an approximation structureand developed an approximation algorithm for integratingmultiple ontologiesWename it fast approximation algorithmbecause it not only has a lower bound on the results but alsoruns faster than the greedy multiple integration algorithmproposed above

The fast approximation algorithm for integratingmultipleontologies is sketched in Algorithm 5 It only calculates themaximum cohesion score between every pair of ontologiesonce during the initial stage and uses this informationthroughout the integration process even after it becomesstaleMore importantly this approach not only saves the timefor recalculating the maximum cohesion scores but also pro-vides a lower bound guarantee as stated inTheorem 5 whosecorrectness is built on two important lemmas (Lemmas 6 and7) which will be described subsequently

Theorem 5 The tree weight of the integrated tree 11987912119896

obtained by FastMultiInt algorithm is at least 1(119896 minus 1) theweight of the optimal solution

Proof We will use Lemmas 6 and 7 to prove this theoremThe proofs of Lemmas 6 and 7 are provided after the proofof this theorem To facilitate the proof of this theorem we

build a fully connected weighted graph G in which eachnode corresponds to a tree for integration and the weightof each edge corresponds to the weight increase (initiallythis is the cohesion score) for integrating the correspondingtrees According to Lemma 6 119892(119879

1 1198792 119879

119896) (ie optimal

cohesion score) is no more than the summed weight of edgesinG (Claim 1)

Given G the integration by Algorithm 5 is a processof 119896 minus 1 node contractions After each contraction theadjacent edge weights (cohesion scores) will be updatedaccordingly According to Lemma 7 the weight of an updatededge will only increase over (or at least remain the same as)the maximum weight of the two contracted edges (Figure 5provides an illustration of vertexedge contraction andweightupdates) Thus the overall cohesion score of the integrationby Algorithm 5 is no less than the weight of the maximumspanning tree ofG (Claim 2)

It is easy to see that the weight of a maximum spanningtree is no less than 1(119896minus1) of the summed edge weight ofGgiven the simple observation that each edge in G is either anedge of the maximum spanning tree or adjacent to an edge ofthe maximum spanning tree with an equal or heavier weight(Claim 3)

Combining Claims 1 2 and 3 we complete the proof ofthis theorem

Lemma 6 It holds that 119892(1198791 1198792 119879

119896) le sum1le119894lt119895le119896

119892(119879119894 119879119895)

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 10: Research Article Building Integrated Ontological Knowledge

10 BioMed Research International

Proof According to the problem definition

119892 (1198791 1198792 119879

119896)

= max( sum

119883isin119881(11987912119896)

119908119890119894119892ℎ119905 (119883))

= max( sum

119883isin119881(11987912119896)

sum

119906isin119883Visin119883120590(119906)lt120590(V)119872120590(119906)120590(V) (119906 V))

= ( sum

1le119894lt119895le119896

11989111987912119896

(119879119894119895)) le sum

1le119894lt119895le119896

119892 (119879119894 119879119895)

(3)

11989111987912119896(119879119894119895) is the cohesion score of 119879

119894119895whose integration is

induced from 11987912119896

Lemma 7 It holds that 119892(119879119875119876 119879119878) minus 119891(119879

119875119876) ge max(119892(119879

119875

119879119878) 119892(119879119876 119879119878))

Proof According to the problem definition integrating 119879119875119876

with 119879119878will result in an integrated tree 119879

119880where 119880 = 119875 cup

119876 cup 119878 and 119892(119879119875119876 119879119878) = max

119879119880(sum119883isin119881(119879119880)

119908119890119894119892ℎ119905(119883)) =

max119879119880(119891(119879119875119878) + 119891(119879

119876119878) + 119891(119879

119875119876)) where 119879

119875119878 119879119876119878 and

119879119875119876

are induced from 119879119880 Since 119879

119875119876has been determined

we have 119892(119879119875119876 119879119878) = max

119879119880(119891(119879119875119878) + 119891(119879

119876119878)) + 119891(119879

119875119876)

Without loss of generally let us assume max(119891(119879119875119878)) ge

max(119891(119879119876119880)) Then by restricting the integration between

119879119875and 119879

119878in 119879119880following the integration that leads to

argmax119879119875119878119891(119879119875119878) we will get a cohesion score no less than

119892(119879119875 119879119878) Thus we complete the proof for 119892(119879

119875119876 119879119878) minus

119891(119879119875119876) ge max(119892(119879

119875 119879119878) 119892(119879119876 119879119878))

223 Time Complexity Analysis For the fast approximationalgorithm (Algorithm 5) the time complexity for generatinggraph G (calculating the overall cohesion score for everypair of ontologies) is119874(11989621198992 log 119899) assumingwe usemaximalweighted matching Each integration will take 119874(1198992 log 119899)with an update of at most 119896 closeness matrices which takes119874(119896lowast119899

2) There are at most 119896 integrations therefore the total

time complexity is still 119874(11989621198992 log 119899)Following the above analysis we conclude that the time

complexity of the greedy multiple integration algorithm isthe same as the fast approximation algorithm However itrequires updating of InterOntology matrix which takes anexcessive 119874(11989621198992 log 119899) time The empirical study shows thatthe fast approximation algorithm is much faster than thegreedy multiple integration algorithm

Finally it is easy to see that the basic multiple integrationapproach takes 119874(1198961198992 log 119899) time and is the fastest but itsoverall cohesion scores are the worst as we will see in theempirical study

224 Limitations Integrating multiple ontologies may facetwo potential problems in real applications First how canwe efficiently generate a closeness matrix for every pair ofontologies to be integrated Our current method kDLS or

onGrid is efficient for generating the closenessmatrix for onepair of ontologies in most cases but not efficient enough forgenerating closeness matrices for many pairs of ontologiesSecond not every pair of ontologies can be meaningfullyintegrated It remains a problem to efficiently identify thefeasibility of integrating a pair of ontologies Therefore themain purpose of Section 22 is to demonstrate that ourproposed approach can be extended to integrate multipleontologies and we use synesthetic datasets in Section 33 tostudy the performance of algorithms proposed in Section 22

3 Results and Discussion

We would like to study the performances of the proposedontology integration methods by experiments on both realand synthetic datasets We implemented five approaches inC++

(1) Heuristic heuristic approach for integrating twoontologies as described in Section 211

(2) Approximate approximate approach (Algorithms 2and 3) with maximal weighted matching for guaran-teeing the (12)-approximation rate

(3) Basic basic multiple integration approach as de-scribed in Section 221

(4) Greedy greedy multiple integration approach(Algorithm 4)

(5) FastApproximate fast approximation multipleintegration approach (Algorithm 5)

In the following we report our study on the performancesof (1) and (2) for integrating two ontologies on real datasetsand (3) (4) and (5) for integrating multiple ontologies onsynthetic datasets All the experiments are carried out on aLinux cluster with 24GHz AMD Opteron processors

31 Integrating a Pair of Ontologies The knowledge of drug-gene relationships is desirable in many pharmacology appli-cations [25 26] By integrating the gene ontology and thedrug ontology we will be able to obtain rich information onthe associations between drugs and genes under the ontologystructures Thus in this set of experiments we simulate realworld knowledge discovery applications by integrating tworeal ontologies gene ontology (GO) and National Drug FileReference Terminology (NDFRT) Both were obtained fromthe Unified Medical Language System (version 2012AA)The closeness matrices between GO terms and drug termswere generated using onGrid [27] with a 4-neighborhoodbroadcast range (ie 119896 = 4 with regard to [7]) onGridfollows the kDLS approach [7] and measures the closenessbetween two concepts based on the discovered paths (withlength greater than one) between them However unlikekDLS onGrid takes into consideration of concept semantictypes in the closeness measurement In the study performedin [27] the advantages of onGrid over kDLS are wellillustrated

The overall cohesion scores of Heuristic andApproximate on integrating GO and NDFRT are listed

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 11: Research Article Building Integrated Ontological Knowledge

BioMed Research International 11

Table 1 Cohesion scores of integrating real datasets

Depth GO term number NDFRT term number Cohesion scoresHeuristic 120573 = 1 Heuristic 120573 = 6 Heuristic 120573 = 100 Approximate

3 66 6004 00505331 0229958 021999 1246964 710 6972 0290392 0284363 146585 938355 5355 14582 0285923 137714 0528289 3390566 16231 32841 0307941 0341588 0673406 740293

Table 2 Top 5 matched terms by the Approximate algorithm (depth = 6)

Rank GO terms NDFRT terms Closeness score1 C1135918 smooth muscle contractile fiber C0282606 muscle neoplasms 1632052 C0010813 cytokinesis C0086376 GTP-binding proteins 1159673 C0027747 axon terminus C0030584 parovarian cyst 1133524 C1155065 T cell activation C0007082 carcinoembryonic antigen 1008795 C0007595 cell growth C0294028 BRCA2 protein 0945284

Table 3 Top 5 matched terms by the Heuristic algorithm (depth = 6 120573 = 100)

Rank GO terms NDFRT terms Closeness score1 C0031845 biological process C0042890 VITAMINS 02448622 C1166607 cellular component C1657248 apoptosome 01428573 C0027540 tissue death C0065932 MENADIOL 004342194 C0025519 metabolic process C0042849 VITAMIN B 003702485 C0030012 oxidation-reduction process C0027996 NICOTINIC ACID 00327109

in Table 1 To observe the integration over the ontologysize change in each experiment we use the ontology treestructure from the root to the specified depth (first columnin Table 1) for integration The sizes of the ontology termsinvolved in the integration are listed in the second and thirdcolumns of Table 1

Recall in Section 211 120573rdepth(119887) is used to regulate theselection of vertices from high depths Thus we tested theHeuristic under 120573 = 1 (depth information is nullified)and 120573 = 100 (the vertex depth plays a critical role inthe selection) Since nonleaf vertices in these datasets havearound 6 children on average we heuristically add a set ofexperiments by setting 120573 = 6 so that 120573rdepth(119887) will be close tothe number of vertices excluded from the future integration

From Table 1 we can see that Heuristic performs betterwhen using the depth information to regulate the selectionof vertices However Approximate is much better thanHeuristic at all settings Compared to the best cohesionscores of Heuristic in each row of Table 1 Approximateconstructs an integrated ontology with the overall cohe-sion score ranging from 54 times to 1099 times that ofHeuristicThis clearly demonstrates the effectiveness of theproposed Approximate approach Nevertheless the heuris-tic approach has a much faster average running time as aresult of excluding a large number of matching opportunitiesin each step

Although the running time of Approximate is longerthan Heuristic it takes less than two hours to finishintegrating two ontologies with about 16119896 and 33119896 vertices

Most of the biomedical ontologies are smaller than or similarto these sizes and Approximate approach will benefit theassociation study of these ontologies For extremely largeontology pairs in which Approximate is unable to finish theintegrationwithin a reasonable time Heuristicmay providea quick view on their integration

32 Understanding the Merged Ontology Terms To under-stand what terms are merged in integrating real ontologieswe use the integration of GO and NDFRT at depth 6 asan example Tables 2 and 3 list the top 5 pairs of mergedterms (sorted by their closeness scores) by Heuristicand Approximate respectively As mentioned abovethese scores are from the closeness matrix generatedby onGrid based on the discovered paths betweenthem For example ldquoC1155065T-Cell Activation minus minusis physiologic effect of chemical or drug minus minus gt C0393002Carcinoembryonic Antigen Peptide 1 minus minus has target minus minus gtC0007082Carcinoembryonic Antigenrdquo is such a path

From Tables 2 and 3 we can observe that theApproximate algorithm merges terms with much highersimilar scores than the Heuristic algorithm Quiteinterestingly we observed that the top ranked merging inTable 3 is between ldquobiological processrdquo and ldquoVITAMINSrdquoThe ldquobiological processrdquo is an abstract term which is veryclose to the root of the GO ontology Such a fact suggests thatthe top level terms will likely preempt the merging choicesover their descendants As a result of this greedy approach

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 12: Research Article Building Integrated Ontological Knowledge

12 BioMed Research International

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

100 200 300 400 500 600 700 800 900 1000

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Individual ontology size

FastApproximateBasic

Greedy

Figure 6 The change of overall cohesion score over the increase ofthe size of each ontology The number of ontologies is fixed at 10

0

100000

200000

300000

400000

500000

600000

700000

800000

10 20 40 70 80 90 100

Ove

rall

cohe

sion

scor

e of i

nteg

ratio

n

Number of ontologies 30 50 60

FastApproximateBasic

Greedy

Figure 7 The change of overall cohesion score over the increase ofthe number of ontologies The size of each ontology is fixed at 100

the Heuristic algorithm will end at a local optimum whichis far from being optimal

A snapshot of ontology integration by Approximate asshown in Figure 10 provides a good insight on the algorithmwork In each bracket of two merged terms the left partis the closeness score and the right part is the cohesionscore We can observe that most closeness scores are zero orclose to zero while the corresponding cohesion scores aremuch higher This is understandable because the snapshot isprimarily on the top level terms of both ontologies For theseterms they have a large number of subclass (descendant)terms and optimizing the integration of their subclass termsfar outweighs integrating of themselves The result of such

0

200

400

600

800

1000

1200

1400

1600

100 200 300 400 500 600 700 800 900 1000

Inte

grat

ion

time (

s)

Individual ontology size

FastApproximateBasic

Greedy

Figure 8 The change of integration time over the increase of thesize of each ontology The number of ontologies is fixed at 10

0

500

1000

1500

2000

2500In

tegr

atio

n tim

e (s)

10 20 40 70 80 90 100Number of ontologies

30 50 60

FastApproximateBasic

Greedy

Figure 9 The change of integration time over the increase of thenumber of ontologies The size of each ontology is fixed at 100

integration provides novel knowledge of association betweenontology terms That is even if two terms are not that closeaccording to some closeness measurement they can be struc-turally associated under their ontology context For examplethe GO term ldquobiological processrdquo is merged with the NDFRTterm ldquochemical ingredientsrdquo even their closeness score iszero from the onGrid output However such integrationis interesting because it shows that the merging is tryingto link the chemical compounds with the biologicalcellularprocesses so that corresponding associations between thecellular processes and chemical structures can be establishedThis demonstrates the purpose of integrating two ontologiesthat is identifying associations with respect to both termsimilarities and structural contexts

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 13: Research Article Building Integrated Ontological Knowledge

BioMed Research International 13

component

Biologicalprocess Cellular

SignalingNeoplasms

National Drug File- Reference Terminology

Chemical ingredients

Diseases manifestations or physiologic states

CarbohydratesResponse to

stimulusOrganic chemicals

Cell part Extracellular region part

Skin and connective

tissue diseases

[0 7403]

[0 4833] [0 2315]

[00042 262] [00016 215][014 899] [0 307]

Amino acids peptides and

proteins

Cellular process Proteins

Cell adhesion involved in heart morphogenesis

Zyxin

Synapse Nervous system diseases

Asymmetric synapse Brain

neoplasmsSymmetric

synapse Granulomatousdisease chronic

[0036 2009]

[0023 0023]

[0034 0034] [0023 0023] [0006 0006]

[0098 062]

ontology

middot middot middot

middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middot middot middot middot middot middot middot

middot middot middotmiddot middot middot

Gene

NeuronalRNA

granule

Traumanervoussystem

Figure 10 A snapshot of ontology integration between GO and NDFRT by Approximate (depth = 6) Blue nodes are terms from GO andred nodes are terms from NDFRT Solid lines are edges and dashed lines are paths consisting of 1 or more edges In each bracket the left partis the closeness score and the right part is the cohesion score

In fact there are multiple studies to justify the struc-tural associations seen in Figure 10 such as the associationbetween ldquosignalingrdquo and ldquocarbohydratesrdquo [28] and the asso-ciation between ldquoextracellular region partrdquo and ldquoskin andconnective tissue diseasesrdquo [29]

In addition we have noticed a number of meaningfulintegrations betweenGO terms and neurological terms in theNDFRT For example synapse is a brain related structure andthe term ldquosymmetric synapserdquo is associated with ldquotraumardquoand the term ldquoasymmetric synapserdquo is associated with ldquobrainneoplasmsrdquo Similarly it is reasonable to see that ldquoneuronalRNA granulerdquo is integrated with ldquogranulomatous diseaserdquo agranule associated disease As another example it is veryinteresting to notice that ldquozyxinrdquo is associated with ldquocelladhesion involved in heart morphogenesisrdquo and that providesa link with the formation of heart

The above observations suggest a novel way of using ourontology integration method to perform association studiesbetween biomedical concepts

33 Integrating Multiple Ontologies In the following exper-iments we will study the performances of Basic Greedyand FastApproximate in integrating multiple ontologiesAll the three approaches are built upon Approximate whichperforms very well in the previous study for integrating tworeal ontology datasets

We use two sets of synthetic datasets in this study In thefirst set of datasets we fix the number of ontologies to be 10and vary the size of each ontology from 100 to 1000 In thesecond set of datasets we fix the size of each ontology to be100 and vary the number of ontologies from 10 to 100 All theontologies are randomly generated by constructing aminimalspanning tree from a randommatrixThe relationshipmatrixbetween every pair of ontologies is also randomly generatedwith entry values ranging from 0 to 1 For each experiment

we generate 10 randomdatasets and the results reported in thefollowing are the average results over the 10 random datasets

The overall cohesion scores of the three approachesover different ontology sizes and over different numbers ofontologies were reported in Figures 6 and 7 respectivelyFastApproximate outperforms all the other approachesin Figure 6 which is consistent with the analysis of itsapproximation rate However Greedy slightly outperformsFastApproximate in Figure 7 especially when the ontologynumber is large This is understandable because when thenumber of ontology (119896) increases the approximation rate (asstated in Theorem 5) decreases and becomes less significantThis result also justifies the choice of adjusted cohesion scorefor Greedy as described at the end of Section 221

The integration time of the three approaches over differ-ent ontology sizes and over different numbers of ontologieswas reported in Figures 8 and 9 respectively These figuresare consistent with the time complexity analysis given inSection 223 In particular we noticed that the integrationtime of Greedy deteriorates sharply over the ontologynumber increase In contrast FastApproximate is muchmore scalable and has a time curve similar to Basic

These results suggest that FastApproximate has the bestoverall performance in integrating multiple ontologies

4 Conclusions

In this work we started with a basic problem on integrating apair of ontology tree structures with a given closeness matrixand later we advanced the basic problem to the problem ofintegrating large number of ontologies We proved optimalstructures in the basic problem and developed both optimaland efficient approximation solutions Although the multipleontology integration problem has similar optimal structuresit is not feasible to extend the optimal and efficient approx-imation solutions for the basic problem to efficiently handle

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 14: Research Article Building Integrated Ontological Knowledge

14 BioMed Research International

multiple ontology integration To tackle the challenge of inte-grating a large number of ontologies we developed both aneffective greedy approach and a fast approximation approachThe empirical study not only confirms our analysis on theefficiency of the proposedmethod but also demonstrates thatourmethod can be used effectively for biomedical associationstudies

Conflict of Interests

The authors declare that there is no conflict of interestsregarding the publication of this paper

References

[1] D L McGuinness F van Harmelen andM K Smith Owl webontology language overview W3C recommendation 10(2004-03)10 2004

[2] O Bodenreider ldquoThe unifiedmedical language system (UMLS)integrating biomedical terminologyrdquo Nucleic Acids Researchvol 32 supplement 1 pp D267ndashD270 2004

[3] X Wu R Jiang M Q Zhang and S Li ldquoNetwork-based globalinference of human disease genesrdquo Molecular Systems Biologyvol 4 no 1 2008

[4] B Linghu E S Snitkin Z Hu Y Xia and C DeLisi ldquoGenome-wide prioritization of disease genes and identification ofdisease-disease associations from an integrated human func-tional linkage networkrdquo Genome Biology vol 10 no 9 articleR91 2009

[5] D Botstein and N Risch ldquoDiscovering genotypes underlyinghumanphenotypes past successes formendelian disease futureapproaches for complex diseaserdquo Nature Genetics vol 33 pp228ndash237 2003

[6] H S Pinto and J P Martins ldquoA methodology for ontologyintegrationrdquo in Proceedings of the 1st International Conferenceon Knowledge Capture (K-CAP rsquo01) pp 131ndash138 ACM 2001

[7] Y Xiang K Lu S L James T B Borlawsky K Huang and P RO Payne ldquoK-Neighborhood decentralization a comprehensivesolution to index the UMLS for large scale knowledge discov-eryrdquo Journal of Biomedical Informatics vol 45 no 2 pp 323ndash336 2012

[8] J Dutkowski M Kramer M A Surma et al ldquoA gene ontologyinferred from molecular networksrdquo Nature Biotechnology vol31 no 1 pp 38ndash45 2013

[9] R Navigli P Velardi and A Gangemi ldquoOntology learningand its application to automated terminology translationrdquo IEEEIntelligent Systems vol 18 no 1 pp 22ndash31 2003

[10] J Jannink andGWiederhold ldquoThesaurus entry extraction froman on-line dictionaryrdquo in Proceedings of the Fusion ConferenceSunnyvale Calif USA 1999

[11] C Papatheodorou A Vassiliou and B Simon ldquoDiscovery ofontologies for learning resources using word-based clusteringrdquoin World Conference on Educational Multimedia Hypermediaand Telecommunications pp 1523ndash1528 2002

[12] D L Rubin M Hewett D E Oliver T E Klein and RB Altman ldquoAutomating data acquisition into ontologies frompharmacogenetics relational data sources using declarativeobject definitions and xmlrdquoPacific SymposiumonBiocomputingpp 88ndash99 2001

[13] N F Noy ldquoSemantic integration a survey of ontology-basedapproachesrdquo ACM SIGMOD Record vol 33 no 4 pp 65ndash702004

[14] S Abels L Haak and A Hahn ldquoIdentification of commonmethods used for ontology integration tasksrdquo in Proceedingsof the 1st International Workshop on Interoperability of Het-erogeneous Information Systems (IHIS rsquo05) pp 75ndash78 ACMNovember 2005

[15] A Gangemi D Pisanelli and G Steve ldquoOntology integrationexperiences withmedical terminologiesrdquo in Formal Ontology inInformation Systems vol 46 pp 98ndash94 IOS Press AmsterdamThe Netherlands 1998

[16] N F Noy and M A Musen ldquoSMART automated support forontology merging and alignmentrdquo in Proceedings of the 12thWorkshop on Knowledge Acquisition Modelling and Manage-ment (KAW 99) Banff Canada 1999

[17] A Doan J Madhavan P Domingos and A Halevy ldquoLearningtomap between ontologies on the semantic webrdquo in Proceedingsof the 11th International Conference onWorldWideWeb (WWWrsquo02) pp 662ndash673 ACM May 2002

[18] J Xie F Liu and S-U Guan ldquoTree-structure based ontologyintegrationrdquo Journal of Information Science vol 37 no 6 pp594ndash613 2011

[19] D Calvanese G De Giacomo andM Lenzerini ldquoA frameworkfor ontology integrationrdquo inTheEmerging SemanticWebSelectedPapers from the First Semantic Web Working Symposium pp201ndash214 2002

[20] O Udrea L Getoor and R J Miller ldquoLeveraging data andstructure in ontology integrationrdquo in Proceedings of the ACMSIGMOD International Conference on Management of Data(SIGMOD rsquo07) pp 449ndash460 ACM June 2007

[21] H W Kuhn ldquoThe Hungarian method for the assignmentproblemrdquoNaval Research Logistics Quarterly vol 1-2 pp 83ndash971955

[22] D E Drake Vinkemeier and S Hougardy ldquoA linear-timeapproximation algorithm for weighted matchings in graphsrdquoACMTransactions on Algorithms vol 1 no 1 pp 107ndash122 2005

[23] A J J Wood W E Evans and H L McLeod ldquoPharmacoge-nomicsmdashdrug disposition drug targets and side effectsrdquo TheNew England Journal of Medicine vol 348 no 6 pp 538ndash5492003

[24] R B Altman ldquoPharmgkb a logical home for knowledge relatinggenotype to drug response phenotyperdquoNature Genetics vol 39no 4 article 426 2007

[25] R B Kim D OrsquoShea and G R Wilkinson ldquoInterindivid-ual variability of chlorzoxazone 6-hydroxylation in men andwomen and its relationship to CYP2E1 genetic polymorphismsrdquoClinical Pharmacology amp Therapeutics vol 57 no 6 pp 645ndash655 1995

[26] T N Ferraro and R J Buono ldquoThe relationship between thepharmacology of antiepileptic drugs and human gene variationan overviewrdquo Epilepsy and Behavior vol 7 no 1 pp 18ndash36 2005

[27] A Albin X Ji T B Borlawsky et al ldquoEnabling online studiesof conceptual relationships between medical terms developingan efficient web platformrdquo JMIR Medical Informatics vol 2 no2 article e23 2014

[28] S Chandrasekaran J W Dean III M S Giniger and M LTanzer ldquoLaminin carbohydrates are implicated in cell signal-ingrdquo Journal of Cellular Biochemistry vol 46 no 2 pp 115ndash1241991

[29] J Uitto and D Kouba ldquoCytokine modulation of extracellularmatrix gene expression relevance to fibrotic skin diseasesrdquoJournal of Dermatological Science vol 24 pp S60ndashS69 2000

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology

Page 15: Research Article Building Integrated Ontological Knowledge

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Anatomy Research International

PeptidesInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporation httpwwwhindawicom

International Journal of

Volume 2014

Zoology

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Molecular Biology International

GenomicsInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioinformaticsAdvances in

Marine BiologyJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Signal TransductionJournal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

BioMed Research International

Evolutionary BiologyInternational Journal of

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Biochemistry Research International

ArchaeaHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Genetics Research International

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Advances in

Virolog y

Hindawi Publishing Corporationhttpwwwhindawicom

Nucleic AcidsJournal of

Volume 2014

Stem CellsInternational

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Enzyme Research

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Microbiology