18
Information retrieval in schema-based P2P systems using one-dimensional semantic space Tao Gu a,b, * , Hung Keng Pung b , Daqing Zhang a a Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore b National University of Singapore, 3 Science Drive 2, Singapore Available online 30 June 2007 Abstract The widespread use of RDF-based information necessitates efficient information retrieval techniques in wide-area net- works. In this paper, we present Dynamic Semantic Space, a schema-based peer-to-peer overlay network that facilitates efficient lookup for RDF-based information in dynamic environments. Peers in this overlay are grouped based on the semantics of their data which are extracted according to a set of schemas, and self-organized as a semantic overlay net- work. To reduce overheads incurred by peer joining, leaving and content changes in a high-dimensional overlay network, peers are constructed as a one-dimensional semantic space that facilitates efficient routing for both pull and push requests. A search or a subscription request is only routed to the appropriate cluster that holds related data, thus reducing unnec- essary search cost and increasing the efficiency of locating information. Through a comprehensive simulation study, we demonstrate the effectiveness of our proposed techniques. Ó 2007 Elsevier B.V. All rights reserved. Keywords: RDF; Ontology; Schema-based peer-to-peer overlay network; Semantic peer-to-peer network 1. Introduction Resource Description Framework [1] has been widely recognized as the standard for storing and exchanging information on the World Wide Web. RDF statements which describe resources and their semantics are machine-understandable and machine-processable, and can be created by dif- ferent users and widely distributed on the Web. The distribution of RDF statements provides great flexi- bility for describing resources and building applica- tions. With the increasing use of RDF statements, there is a need to support efficient retrieval of RDF data in wide-area networks. In recent years, dynamic applications such as e-business and context-aware pervasive systems are becoming more and more pop- ular. Information to be shared in these applications typically exhibits a variety of dynamic characteris- tics, e.g., the location of a user may be changed fre- quently. It is more challenging to provide efficient information retrieval for such applications due to the dynamic changes of their data. One approach is to use centralized search engines to index RDF data. These indices can be obtained by 1389-1286/$ - see front matter Ó 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2007.06.019 * Corresponding author. Address: Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singapore. E-mail addresses: [email protected] (T. Gu), punghk@ comp.nus.edu.sg (H.K. Pung), [email protected] (D. Zhang). Computer Networks 51 (2007) 4543–4560 www.elsevier.com/locate/comnet

Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

Computer Networks 51 (2007) 4543–4560

www.elsevier.com/locate/comnet

Information retrieval in schema-based P2P systems usingone-dimensional semantic space

Tao Gu a,b,*, Hung Keng Pung b, Daqing Zhang a

a Institute for Infocomm Research, 21 Heng Mui Keng Terrace, Singaporeb National University of Singapore, 3 Science Drive 2, Singapore

Available online 30 June 2007

Abstract

The widespread use of RDF-based information necessitates efficient information retrieval techniques in wide-area net-works. In this paper, we present Dynamic Semantic Space, a schema-based peer-to-peer overlay network that facilitatesefficient lookup for RDF-based information in dynamic environments. Peers in this overlay are grouped based on thesemantics of their data which are extracted according to a set of schemas, and self-organized as a semantic overlay net-work. To reduce overheads incurred by peer joining, leaving and content changes in a high-dimensional overlay network,peers are constructed as a one-dimensional semantic space that facilitates efficient routing for both pull and push requests.A search or a subscription request is only routed to the appropriate cluster that holds related data, thus reducing unnec-essary search cost and increasing the efficiency of locating information. Through a comprehensive simulation study, wedemonstrate the effectiveness of our proposed techniques.� 2007 Elsevier B.V. All rights reserved.

Keywords: RDF; Ontology; Schema-based peer-to-peer overlay network; Semantic peer-to-peer network

1. Introduction

Resource Description Framework [1] has beenwidely recognized as the standard for storingand exchanging information on the World WideWeb. RDF statements which describe resourcesand their semantics are machine-understandableand machine-processable, and can be created by dif-ferent users and widely distributed on the Web. The

1389-1286/$ - see front matter � 2007 Elsevier B.V. All rights reserved

doi:10.1016/j.comnet.2007.06.019

* Corresponding author. Address: Institute for InfocommResearch, 21 Heng Mui Keng Terrace, Singapore.

E-mail addresses: [email protected] (T. Gu), [email protected] (H.K. Pung), [email protected] (D.Zhang).

distribution of RDF statements provides great flexi-bility for describing resources and building applica-tions. With the increasing use of RDF statements,there is a need to support efficient retrieval of RDFdata in wide-area networks. In recent years, dynamicapplications such as e-business and context-awarepervasive systems are becoming more and more pop-ular. Information to be shared in these applicationstypically exhibits a variety of dynamic characteris-tics, e.g., the location of a user may be changed fre-quently. It is more challenging to provide efficientinformation retrieval for such applications due tothe dynamic changes of their data.

One approach is to use centralized search enginesto index RDF data. These indices can be obtained by

.

Page 2: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

4544 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

crawling Web pages such as in RDF Google.Although this approach can provide fast responseto a query, it is difficult to keep these indices up todate due to dynamism of data sources. In addition,this approach has limitations such as scalability,processing bottleneck and single point of failure.Peer-to-Peer (P2P) approaches have been proposedto overcome some of these obstacles, and providepotential solutions for building non-centralizedinformation lookup systems. P2P systems such asGnutella [2] allow nodes to interconnect freely andhave low overlay maintenance overhead, making iteasy to handle the dynamic changes of peers andtheir data. However, a query has to be flooded toall nodes in the network including those nodes thatdo not have relevant data. The blind flooding mech-anism used without any restriction on the scope offlooding can become very inefficient because ofexcessive redundant messages. Other P2P systemssuch as Chord [3], CAN [4], Pastry [5] and Tapestry[6] typically implement distributed hash tables(DHTs) and use hashed keys to direct a lookuprequest to the specific nodes by leveraging on a struc-tured overlay network among peers. However, dataplacement in these systems is tightly controlled basedon distributed hash functions. In dynamic environ-ments peers may join or leave the system frequentlyand data may be changed rapidly; thus higher over-lay maintenance overhead for updating the relevantinformation in the DHT-based overlay networks isinevitable. Moreover, for certain applications suchas those in context-aware systems [7], it is desirablebut may not be possible to place data in a particularnode (i.e., near the data source) using a hash value.

To facilitate the efficient retrieval of RDF data indynamic environments, we present Dynamic Seman-tic Space (DSS), a schema-based P2P overlay net-work in which RDF data are organized andretrieved based on their semantics to support bothpull and push services. In this overlay, data can berepresented by a collection of RDF statementsbased on a set of schemas (i.e., ontologies). RDFstatements which are semantically similar are ‘‘tied’’together so that they can be retrieved by a querywhich has the same semantics. As a result, the sys-tem is able to forward a query to nodes which arelikely to contain the relevant data.

While the basic idea appears simple, there areseveral issues that have to be considered in orderto make the system work efficiently. Firstly, overlaymaintenance cost can be high due to the frequentchanges of peers and their data. Hence, minimizing

overlay maintenance cost is important in designingthe search mechanism in DSS. Secondly, the map-ping from data and queries to semantic clustersshould not incur much overhead. Thirdly, the num-ber of semantic clusters used in real-life applicationscan be potentially large. To accommodate the heter-ogeneity of data sources, the search mechanism inDSS should operate efficiently in a high-dimensionalspace without incurring high overhead. Finally, asdata may change rapidly in dynamic environments,it is important to automatically notify consumerswhen changes occur. Hence, DSS should be ableto adapt and scale to data change and growth.

To address these issues, we propose the followingtechniques:

• Use ontology-based metadata to extract thesemantics of data and queries. This techniquecan map data and queries to the appropriatesemantic cluster(s) with minimum computationaloverhead in the presence of frequent peer joining/leaving or content changes.

• Upon joining the system, peers are grouped andarranged into a one-dimensional semantic spacewhere various semantic clusters are organizedand interconnected in a ring space. This structureenables the mapping of the clusters in a k-dimen-sional semantic space to a one-dimensionalsemantic space, and hence reduces overlay main-tenance overhead.

• A cluster encoding scheme enables the systemadapt to the number of peers by splitting ormerging clusters. This can result in a systemwhich has good scalability and load balancingcharacteristics. This scheme also enables the useof parallelism in our system when searching fordata within a semantic cluster.

• Deploy both pull and push services in DSS. Con-sumers can either submit search requests or sub-scription requests. The latter allows consumers tobe notified whenever data changes occur.

The rest of the paper is organized as follows. Wediscuss related work in Section 2. We present thedetails of DSS in Section 3, and the evaluationresults in Section 4. Finally, we conclude the paperin Section 5.

2. Related work

Centralized RDF repositories and lookup sys-tems such as RDFStore [8] and Jena [9] have been

Page 3: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

T. Gu et al. / Computer Networks 51 (2007) 4543–4560 4545

implemented to support the storing and querying ofRDF documents. These systems are simpler todesign and reasonably fast for low to moderatenumber of triples. However, they have the tradi-tional limitations of centralized approaches, suchas single processing bottleneck and single point offailure.

Schema-based P2P networks such as Edutella[10] are proposed to combine P2P computing andthe Semantic Web. These systems build upon peersthat use explicit schemas to describe their contents.They use super-peer based topologies, in whichpeers are organized in hypercubes for routingqueries. However, current schema-based P2P net-works still have some shortcomings; queries haveto be flooded to every node in the network, makingthe system difficult to scale. Chirita et al. [11] built apublish/subscribe system on the Edutella P2P infra-structure. This system uses content advertising,subscribing and notifying. However, content adver-tising may create additional overhead. In our sys-tem, a subscription request is first directed to a setof potential producer peers in a semantic cluster.Following that, each producer peer will map therequest against its local RDF data. Crespo et al.[12] proposed the concept of Semantic Overlay Net-works (SONs) in which peers are grouped by seman-tic relationships of documents they store. Each peerstores additional information about content classifi-cation and route queries to the appropriate SONs,increasing the chances that matching objects willbe found quickly and reducing the search load.However, the maintenance cost in SONs becomesmore expensive when the number of SONsincreases. We adopt the basic idea of semantic clus-tering, and we impose certain link structures onthese semantic clusters to facilitate both intra-clus-ter and inter-cluster routing and to reduce the over-lay maintenance cost. Cai et al. [13] proposed adistributed RDF repository that duplicates andstores each triple at three places in a multi-attributeaddressable network which extends Chord by usinga globally known hash function. Queries can then beefficiently routed to those nodes in the networkwhere the triples in question are known to be storedif they exist. However, the overlay maintenance costis high in this system. In addition, storing each RDFtriple multiple times in the network increases thedata storage and maintenance costs.

Tang et al. [14] applied classical InformationRetrieval techniques to P2P systems and built adecentralized P2P information retrieval system called

pSearch. The system makes use of a variant of Con-tent-Addressable Networks (CAN) to build thesemantic overlay and uses Latent Semantic Indexing(LSI) [15] to map documents into term vectors in thespace. Li et al. [16] built a semantic small world net-work in which peers are clustered based on term vec-tors computed using LSI. They proposed an adaptivespace linearization technique, and constructed thelink structures based on small world network theory.The small world network model was originally intro-duced by Kleinberg [17]. He proposed a two-dimen-sional grid where every node maintains four links toeach of its closest neighbors and one long distancelink to a node chosen from a probability function.He has shown that a query can be routed to any nodein O(log2n) hops where n is the total number of nodesin the network. Our work is inspired by the smallworld model. DSS not only maps a k-dimensionalsemantic space to a one-dimensional semantic space,but also allow peers to be grouped into sub-clusters ina semantic cluster (through the cluster encodingscheme) to better accomplish scalability as well asfacilitate parallel search. To route queries across clus-ters in DSS, we select two long distance links whichare located at certain positions of the ring spaceinstead of choosing one randomly as in the smallworld network model. Through our simulation, weshow how these two shortcuts improve the search effi-ciency with a variant setting of number of semanticclusters. Furthermore, we propose the use ofschema-based metadata to extract data semanticswhich has low overhead as compared to LSI. Weshow how these ideas can be applied into a seman-tic-based P2P lookup system.

3. Dynamic semantic space

In this section, we first present an overview ofDSS, followed by a description of technical details.For ease of discussion, we use the terms node andpeer interchangeably for the rest of the paper.

3.1. Overview

In DSS, a large number of nodes are self-orga-nized into a semantic overlay network, in accor-dance with their semantics. A user or anapplication can act as a producer, a consumer orboth. Producers provide various RDF data for shar-ing; whereas consumers obtain data by submittingtheir queries and receiving query results. Each peermaintains a local data repository which supports

Page 4: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

4546 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

RDF-based semantic query using RDQL [18]. Uponits creation, each producer peer will join a semanticcluster based on the semantics of its major dataand publish its data indices to peers in other seman-tic clusters. Peers within a cluster are interconnectedusing an overlay structure. There is no restriction onthe type of overlay used within a cluster. Uponreceiving a query, a peer first pre-processes the queryto obtain the information about the semantic clusterassociated with the query, and then routes it to anappropriate semantic cluster. When the queryreaches the designated semantic cluster, it is for-warded in parallel. Peers that receive the query willdo a local search, and return results if available.There are two types of queries in DSS: search requestand subscription request. Search requests enableconsumers to pull data from the network at a one-

Legend: owl:Class rd

Dom

ain-

Spec

ific

Low

-laye

rOnt

olog

ies

Locationlongtitude

latitudelaltitude

sub Cl assO

f

ScheduledActivity

DeducedActivity

subC

lass

Of

temperaturenoiseLevel

subClassOf Party

subClassOf M

Din

Tak

name

homeAddress

role

foodPreference

status

Person

nearBysub

Ind

hasChildren

posture

ActivitystartTimeendTime

Location

CompEntity

Upp

er O

ntol

ogy

IndoorSpace

OutdoorS

locatedIn

lo

use

Device

Service

Application

Network

Agent

own

Fig. 1. An example of ontological structure i

time basis, while subscription requests enable con-sumers to subscribe RDF data and be notified whendata changes occur over a period of time.

3.2. Ontology-based semantic clustering

In this section, we describe how to use ontology-based metadata to extract the semantics of bothRDF data and queries. There are several advantagesas compared to other semantic extraction tech-niques such as Vector Space Model (VSM) [19]and LSI. The formal design of ontologies minimizesthe problems of synonyms and polysemy incurredby VSM. Based on ontologies, data and queriescan be mapped to appropriate semantic clustersdirectly without costly computation as in LSI, yetthe same precision is retained.

fs:subClassOf owl:Property

subC

lass

Of

subC

lass

Of

Device

lighting

weatherCondOutdoorSpace

humidity

subClassOfDooryard

ovie

ner

ingShower

Entry

Corridor

curtainStatus

windowStatusdoorStatus

status

foodItemFridge

modeCellPhone

volumeDVDPlayer

volume

ClassOf

Room

oorSpace

Garden

volumeTV

ContextEntity

Person

pace

catedIn

locatedIn

ScheduledActivity

DeducedActivity

parti

cipa

teIn

Activity

Time

happenAt

n the context-aware computing domain.

Page 5: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

T. Gu et al. / Computer Networks 51 (2007) 4543–4560 4547

We adopt a two-tier hierarchy in ontologydesign. The upper ontology defines common con-cepts and is shared by all peers. Each peer can defineits own concepts in its domain-specific low-layerontology. Different peers may store different setsof low-layer ontologies based on their applicationneeds. We illustrate the mapping process using anexample of ontology in the context-aware comput-ing domain as shown in Fig. 1. The leaf conceptsin the upper ontology are used as semantic clusters,and are denoted as a set E = {Service,Applica-

tion,Device, . . .}. Each of these pre-defined semanticclusters will be assigned with a unique Semantic ID

(described in Section 3.3) upon their presence inDSS.

The mapping computation is done locally at eachpeer. For the mapping of RDF data, a peer needs todefine a set of low-layer ontologies and store themlocally. Upon joining DSS, a peer first obtains theupper ontology and merges it with its local low-layer ontologies. Then it creates instances (i.e.,RDF data) and adds them into the merged ontologyto form its local knowledge base. A peer’s local datamay be mapped into one or more semantic clusters

Fig. 2. An example of sema

by extracting the subject, predicate and object of anRDF data triple. Let SCnsub, SCnpred, SCnobj wheren = 1,2, . . . denote the semantic clusters extractedfrom the subject, predicate and object of a data tri-ple respectively. Unknown subjects/objects (whichare not defined in the merged ontology) or variablesare mapped to E. If the predicate of a data triple isof type ObjectProperty, we obtain the semantic clus-ters using (SC1pred [ SC2pred [ � � � SCnpred) \(SC1obj [ SC2obj [ . . . SCnobj). If the predicate ofa data triple is of type DatatypeProperty, we obtainthe semantic clusters using (SC1sub [ SC2sub [ . . .SCnsub) \ (SC1pred [ SC2pred [ . . . SCnpred). Exam-ples 1 and 2 in Fig. 2a show RDF data triples aboutlocation and light level in a bedroom provided by aproducer peer. In Example 2, we first obtain thesemantic clusters from both subject and predicate,and then intersect their results to get the finalsemantic cluster – IndoorSpace.

A query follows the same procedure to obtain itssemantic cluster(s), but it needs all the sets of low-layer ontologies. In real applications, users may cre-ate duplicate properties in their low-layer ontologieswhich conflict with the ones in the upper ontology.

ntic cluster mapping.

Page 6: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

C0 C1C2

C3

C4

C6

C8

C9

C10

C11

C12

C14C16C17C18

C20

C22

C24

C25

C26

C28

Peer 1

Peer 4

Peer 6

Peer 3

Peer 5

Peer 2 SC0

SC1

SC2

SC3SC4

SC5

SC6

SC7

C19

Fig. 3. One-dimensional semantic space.

4548 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

For example, the upper ontology defines therdfs:range of predicate locatedIn as Location

whereas the low-layer ontology defines its rdfs:range

as IndoorSpace. To resolve this issue, we create twomerged ontologies, one for clustering peers and theother for clustering queries. If such a conflict occurs,we select the affected properties defined in the low-layer ontology to generate the merged ontologyfor clustering peers’ data and select the affectedproperties defined in the upper ontology to generatethe merged ontology for clustering queries. Withthis scheme, a peer can extract the semantics of itsdata triples more precisely based on its low-layerontology without losing generality for queries. Forexample, predicate locatedIn may have therdfs:range of IndoorSpace (underlined in Fig. 2a)in the merged ontology for clustering peers’ dataand have the rdfs:range of Location (underlined inFig. 2b) in the merged ontology for clustering que-ries. Data triple ‘Someone is located in the bedroom’will be mapped to IndoorSpace; and query ‘where is

John’ will be mapped to IndoorSpace and Outdoor-

Space rather than only IndoorSpace. This is mostlikely the case in real-life applications.

3.3. One-dimensional semantic space

In DSS, peers are organized in such a way thatthose with semantically similar data are groupedtogether. To enable search across semantic clusters,an intuitive solution is to construct k-dimensionalsemantic clusters by connecting each peer to alldimensions of the corresponding clusters such asin [12,7]. However, overlay maintenance costbecomes expensive when the number of semanticclusters increases. To reduce overlay maintenancecost, we present a new approach to facilitate effi-cient search in a high-dimensional semantic space.We build an overlay network using the one-dimen-sional ring structure which enables the mappingfrom a k-dimensional semantic space into a one-dimensional semantic space.

3.3.1. Peer placementTo join a semantic cluster in the network, a peer

first obtains the semantics of its local data. This isdone by mapping each RDF triple in its repositoryto one or more semantic clusters using the techniquedescribed in Section 3.2. We then count the RDFtriples for each semantic cluster obtained. Thesemantic cluster corresponding to the most triplecount (i.e., its major data) is called its major seman-

tic cluster; and the remaining semantic clusters arecalled its minor semantic clusters.

A peer will then join its major semantic cluster.In order for a query to reach all nodes that providethe same semantics, we adopt index publishing. Apeer publishes the indices of its data to its minorsemantic clusters as follows: it selects a node in eachof its minor semantic clusters and publishes theindex (i.e., reference pointer) to these nodes. Eachindex points to a node where the data is physicallystored. For example, as shown in Fig. 3, Peer 1 pub-lishes its index to semantic cluster SC1 by putting itsindex to Peer 3 which is selected in random withinSC1 (for balancing the load of indices). As a result,a semantic cluster can be viewed as a set of intercon-nected nodes separated by clusters and a collectionof indices stored in these nodes.

The above scheme has several positive effects.For example, if a peer has homogeneous data inits local repository, most of its data will be catego-rized into one corresponding semantic cluster, there-fore reducing the cost to publish data indices. This islikely to be the case in real applications. Further-more, many applications are designed in such away that a peer is likely to query for data availablein its nearby peers. By placing a peer into one par-ticular semantic cluster based on the majority ofits local data, a query can be resolved very effi-ciently. It should be noted that while we have onlyelaborated the joining of a single semantic cluster,the same principle can be applied by any peer forjoining multiple semantic clusters.

A peer may periodically calculate the triple countfor each semantic cluster. The time interval to per-

Page 7: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

T. Gu et al. / Computer Networks 51 (2007) 4543–4560 4549

form such operation is application-specific anddepends on the dynamism of its data (i.e., how oftenthe data is changed or updated); hence it will not befurther studied in this paper. If its major semanticcluster is changed, a peer will need to initiate anew joining process. If its minor semantic clustersare changed, a peer will need to remove the out-dated indices or publish new indices.

3.3.2. Cluster naming scheme

In the design of DSS, one crucial issue is how todesign a naming space. We propose a cluster nam-ing scheme which allows sub-clustering within asemantic cluster. We distinguish the concepts ofcluster and semantic cluster. A cluster refers to apartition which consists of a set of nodes bundledtogether such as C0 in Fig. 3. A semantic clusterrefers to a set of clusters corresponding to the samesemantics. For example, cluster C0, C1, C2, and C3belongs to semantic cluster SC0. We propose ourcluster encoding scheme as follows. A Cluster ID

which is represented by a k-bit binary string (wherek = m + n) is a unique ID that identifies a cluster inDSS. The first m-bit binary string (called Semantic

Cluster ID) is used to identify a semantic cluster.Hence, a DSS can have a maximum of 2k clustersand 2m semantic clusters. An example of a DSSwhich assumes k = 5 and m = 3 is illustrated inFig. 3. The rationale behind this encoding schemeis that, for a given query, we need to obtain theappropriate Semantic Cluster ID to match the samesemantics of the query. Semantic clusters can beviewed as an additional semantic layer on top ofactual clusters. Partitioning peers into a set of clus-ters in a same semantic cluster also provides betterload balancing and enables parallel search withinthe same semantic cluster.

3.3.3. Ring construction

In DSS, clusters are placed in the ring spacebased on their cluster IDs. Each node maintains aset of node entries in its routing table for the pur-pose of both intra-cluster routing and inter-clusterrouting. A node, say x, first decides in which seman-tic cluster to participate. It then picks a cluster ran-domly within this semantic cluster to join byconnecting to a number of nodes in this cluster.These node entries (called x’s neighbors in its owncluster) will be maintained in x’s routing table asintra-cluster routing information. Node x also cre-ates and maintains two node entries in each of itsadjacent clusters. We call these two nodes x’s neigh-

bors in its adjacent clusters. Each node joins the net-work by performing this operation, resulting in allthe clusters being linked linearly in a ring fashion.With this ring structure, a k-dimensional semanticspace can be linearized.

Maintaining two neighbors in the adjacent clus-ters for every node in DSS also ensures that a querygenerated at any node will be able to reach anyother cluster by navigating the ring space. However,queries have to be passed around the ring space lin-early either clockwise or anticlockwise until the des-tination cluster is reached. To accelerate searchacross clusters in DSS, node x maintains a set oflinks to nodes in other semantic clusters except thetwo adjacent clusters. These nodes provide shortcuts

(similar to long contacts in Kleinberg’s small worldmodel) for x to route a query to other semantic clus-ters quickly. For example, in Fig. 3, x creates andkeeps track of two shortcuts: one points to the oppo-site semantic cluster (i.e., shortcut to Peer 5) and theother points to the semantic cluster located in aquarter of the ring space (i.e., shortcut to Peer 6).These shortcuts and neighboring nodes in adjacentclusters are used by node x to perform inter-clusterrouting. In the process of cluster splitting and merg-ing or when a new semantic cluster is inserted intothe ring space, a node needs to update its neighbor-ing nodes in both its own cluster and its adjacentclusters. However, a node only needs to update itsshortcuts upon the insertion or deletion of a seman-tic cluster as a shortcut points to an appropriatesemantic cluster rather than a cluster.

3.3.4. Cluster splitting and merging

The operations of cluster splitting and mergingenable DSS to scale to a large number of peers.Let M represent the maximum cluster size. If thesize of a cluster exceeds M, the splitting process isinvoked to split the cluster into two. A simple wayof cluster splitting is to partition a cluster into twoclusters of equal size without considering load dis-tribution in the two clusters such as in CHORD.To balance the load during splitting and merging,each node maintains a CurrentLoad which measuresits current load in terms of the number of RDF tri-ples and data indices it stores. When node x joinsthe network, it sends a join request message to anexisting node, say y. If y falls into the same semanticcluster that x wishes to join, x joins the cluster byconnecting to y if its cluster size is below M; other-wise y will direct the request to a node, say z, in thesemantic cluster that x wishes to join, and x will

Page 8: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

4550 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

connect to z if its cluster size does not exceed M. Ifthe cluster size exceeds M, node y or z (called an ini-tial node) will initiate the splitting process. The ini-tial node first obtains a list of all the nodes in thiscluster which is sorted according to their Current-

Loads. Then it assigns these nodes in the list tothe two sub-clusters alternatively. After splitting,we obtain two clusters with relatively equal load.The initial node is also responsible for generatinga new cluster ID for each of the two sub-clusters.To obtain a new cluster ID, each node maintainsa bit split pointer which indicates the next bit to besplit in the n-bit binary string (where n = k � m).For example, in Fig. 3, we assume m = 3, n = 2and there exists a cluster C4 in the network. Ini-tially, the bit split pointer points to the most signifi-cant bit of the n-bit string. When cluster splittingoccurs, the bit pointed to by the bit split pointer issplit into 0 and 1, and therefore we obtain clusterID C4 and C6 corresponding to the same semanticcluster SC1. The pointer is then moved forward tothe next bit in the n-bit string. Cluster C4 or C6can be further split into C4 and C5 or C6 and C7,and finally the bit split pointer is set to null indicat-ing no cluster splitting is allowed. The same mecha-nism follows for the insertion of a new semanticcluster in DSS. A semantic cluster can be split intoa maximum number of 2n clusters. After splitting,a node updates its cluster ID, the bit split pointer

as well as the neighbors list in both its own clusterand its adjacent clusters.

When node x leaves the network, it first checkswhether its cluster size has fallen below a thresholdMmin. If the current size is above Mmin, x simplyleaves the network by transferring its indices to arandomly selected node in its cluster. Otherwise, thiscluster needs to be merged into one of its neighbor-ing clusters within the same semantic cluster. Theleaving node triggers cluster merging which is theinverse process of cluster splitting. To obtain thenewly merged cluster ID, the bit split pointer movesbackwards by 1 bit in the n-bit string, and the bitpointed to by the bit split pointer is set to 0. Thenodes in the merged cluster need to perform thesame updating as in the splitting process. For theselection of Mmin, a simple method is to letMmin = 1 so that cluster merging is invoked whenthe last node in a cluster leaves. However, if thereexists only one node in a cluster, this node maybecome a hot spot as all the nodes in its two adja-cent clusters have links to it. The actual value ofMmin should be determined by the statistics of node

joining and leaving within this cluster. If the lastnode in a semantic cluster leaves, it initiates twomessages to all the nodes in its two adjacent clustersinforming them to update their neighbor lists. Sub-sequently, the semantic cluster will be removed fromDSS.

3.4. The routing algorithm

In this section, we describe the routing operationin DSS. As described above, each node in DSSmaintains a routing table with a set of node entries(in the form of a pair hNodeID, ClusterIDi) in itsown cluster, two adjacent clusters and another twosemantic clusters. It also keeps state informationabout its own cluster, consisting of a k-bit ClusterID

(where k = m + n) which indicates the cluster itresides in and ClusterSize which specifies the currentsize of its cluster. The query routing processinvolves two steps: inter-cluster routing and intra-cluster routing. Upon receiving a query, node x firstobtains the destination Semantic Cluster ID

(denoted as D). Then node x will check whether D

falls into its own semantic cluster by comparing D

against the most significant m-bits of its ClusterID.If that is the case, x will flood the query to all thenodes in its own cluster and also forward the queryto the nodes in its adjacent clusters corresponding toD. The first node in a cluster receiving the query isalways responsible for flooding the query withinits cluster and forwarding the query to its adjacentcluster. The forwarding processes are recursivelycarried out until all the clusters corresponding toD have been covered and all nodes in each of theclusters are reached. Every node, upon receiving aquery, will check its local data repository and returnthe matched data and indices.

For example, as illustrated in Fig. 4a, if a query isinitiated at Peer 1 with D = SC0, Peer 1 first for-wards the query to its neighboring node in C1,and then floods the query to all the nodes in C0.The same process is repeated for cluster C1, C2and C3. If D is not the semantic cluster that nodex belongs to, say its adjacent semantic cluster, thequery will be forwarded to D and flooded to allthe clusters corresponding to D. For example, inFig. 4a, a query generated at Peer 2 with D = SC3will hop through C16 and will be flooded in C14and C12. If D neither falls into node x’s own clusternor its adjacent semantic cluster, x will rely on itsshortcuts to route the query across clusters. A querycan be routed to a semantic cluster which is closer to

Page 9: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

C0 C1C2

C3

C4

C6

C8

C9

C10

C11

C12

C14C16C17C18

C20

C22

C24

C25

C26

C28

Peer 1

Peer 5

SC0

SC1

SC2

SC3SC4

SC5

SC6

SC7C0 C1C2

C3

C4

C6

C8

C9

C10

C11

C12

C14C16C17C18

C20

C22

C24

C25

C26

C28

Peer 1

Peer 2

SC0

SC1

SC2

SC3SC4

SC5

SC6

SC7

C19C19

Fig. 4. Query routing.

T. Gu et al. / Computer Networks 51 (2007) 4543–4560 4551

the destination semantic cluster quickly with thehelp of these shortcuts.

In the design of these shortcuts, we have severaldesign options. We need to decide which semanticcluster a shortcut should point to and how manyshortcuts each node should maintain. An intuitivestrategy is for a semantic cluster to select a set ofother semantic clusters randomly and assign a short-

cut to a node in each of these semantic clusters.Each node can have s shortcuts (s P 1) with thetradeoff that the cost of creating and maintainingthese shortcuts is proportional to s. Upon receivinga query, if the distance between D and the semanticclusters that its shortcuts point to falls below athreshold – a preset minimum distance in terms ofnumber of hops, the query will be forwarded tothe closest semantic cluster and hop towards thedestination semantic cluster. If not, x selects a short-

cut randomly, and forwards the query to this short-

cut. The same process is invoked until the distanceto D is below the threshold. This approach is similarto Kleinberg’s Small World network model in whicha query can be routed to any node in O(log2n) hops.

Our approach is based on the observation thatthe ring space can be equally divided into severalpartitions. Each node maintains two shortcuts

(s = 2) that are used to partition the ring space.For example, we can partition a 2m semantic spacewhere m = 3 into four by creating two shortcuts:one pointing to the opposite semantic cluster andanother pointing to the semantic cluster located ina quarter of the ring space. Given the maximumcluster size M, the system can have a total of

M Æ 2m+n�1 nodes when Mmin = 1. Let Cx denotethe cluster where x resides in and SCx denote thesemantic cluster that Cx corresponds to. SCx canbe obtained by truncating Cx to m bits from themost significant bit. The two semantic clustersSChalf and SCquarter that x’s shortcuts point to aredenoted as (SCx + 2i)mod 2m, where i = m � 1,m � 2. To initiate a search, x obtains D based ona query and checks which cluster range (partitionedby x’s shortcuts) D falls into. Then node x forwardsthe query to the closer semantic cluster through itsshortcut. If D is closer to SCx, node x will forwardthe query across its adjacent cluster towards D. Aquery takes a maximum of 2 + 2m�3 hops to reachthe destination semantic cluster.

The above search algorithm is shown in Fig. 5. Toillustrate, consider Fig. 4b, Peer 1 generates a queryand computes the destination semantic cluster asSC5. Peer 1 first realizes that SC5 falls into the inter-val [SC4,SC0] and SC4 is close to SC5. Then Peer 1forwards the query to Peer 5 at C17. As SC5 falls into[SC4,SC6] and C24 is closer to SC5 as compared toC17, Peer 5 forwards the query to SC6 through itsquarter shortcuts. Finally, the query reaches SC3

and is then flooded in both C22 and C20.The more shortcuts we create to partition the ring

space, the finer the granularity we gain to locate thedestination semantic cluster. As a result, we achievebetter search performance in terms of lower routinghops. However, more shortcuts imply higher cost ofcreating, updating and maintaining these shortcuts.In DSS, we set the number of shortcuts to two forthe reason of minimizing maintenance cost. To

Page 10: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

Fig. 5. Pseudocode of the search algorithm.

1 socam is a namespace. Please refer to the SOCAM projectWeb site at http://www1.i2r.a-star.edu.sg/~tgu for more details.

4552 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

partition the ring space in a finer granularity whenthe number of semantic clusters m increases, wecan place the longest shortcut into different pointsin DSS. The other shortcut always points to the mid-dle semantic cluster between SCx and the semanticcluster that the longest shortcut points to. For exam-ple, if we place the longest shortcut to one-quarter ofthe ring, the ring space is divided by eight, and soon. More generally, the following theorem obtainsthe search path length for DSS.

Theorem 1. Given a m-dimensional DSS of N nodes,

with maximum cluster size M, number of bits to

identify sub-cluster n and number of shortcuts s, theaverage path length for routing across semantic

clusters is O 1s log2ðN=M � 2n�2Þ1=m� �

:

Proof. We follow a process similar to that in [17] toprove the theorem. In [17], Kleinberg proved thatthe optimal setting for shortcuts is fx = 1/xm, wherem is the dimensionality. Thus, in DSS, a peerchooses another peer at distance x as one of itsshortcuts using the pdf: fx = 1/xm for x Æ 2 [r, 1]where r, the minimum distance of a shortcut, is theaverage diameter of a semantic cluster. The averagesize of a semantic cluster is M

22n�1, there are alto-

gether N/M Æ 2n�2 semantic clusters in the system,and each semantic cluster takes charge of aM Æ 2n�2/N portion of the whole semantic space onaverage. Therefore, the diameter of each partitionr is approximately (M Æ 2n�2/N)1/m.

We extend the small world network model fromtwo-dimensional space to m-dimensional space. Weuse unit data space in DSS. Since each subspace hasside length r on average, there are 1/r subspacesalong each side. The distance between two clustersalong a dimension is the range of [1, 2, . . . , 1/r].Thus, we separate the search process into phases

1,2, . . . , log(1/r). Let d be the distance from a querymessage’s current node to the destination, anddi = 1/2i. Search is at phase i if di+1 6 d < di. Phasei ends when the message is forwarded to a peer lessthan di+1 distance away from the destination. Theset of peers less than di+1 distance away from thedestination is denoted as Di+1, whose volume is dm

iþ1.The largest distance from a peer at phase i to a peerin set Di+1 is di + di+1. Since a peer has s shortcuts,the probability that a peer at phase i has contacts toset Di+1 is at least s � dm

iþ1 � fdiþdiþ1¼ s=c � logð1=rÞ

where c is a constant that depends on m. Therefore,a query message requires c Æ log(1/r)/s steps to reachthe next phase on average. Since there are in totallog(1/r) phases, the total search path length isOð1s log2ðN=M � 2n�2Þ1=mÞ. h

3.5. Subscription

In addition to search requests which pull datafrom the network, DSS enables consumers to issuesubscription requests to the network and be notifiedwhen data changes occur over a period of time.When a subscription request is generated, it willbe first mapped to a semantic cluster (say D) andthen forwarded to all nodes in D. The mappingand routing processes of a subscription request areidentical to those of a search request. When a nodein D receives a subscription request, it will check itslocal RDF data and decide whether it should acceptthe request. For example, an application may sub-scribe the event ‘‘John is in the bedroom’’ in theRDF triple form of hsocam1 :John socam:locatedIn

socam:Bedroomi to the network. As this RDF triplemay not exist in the network (i.e., John may be in

Page 11: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

T. Gu et al. / Computer Networks 51 (2007) 4543–4560 4553

some other place) at the time of receiving a request,the subscription request may end up with no pro-ducers. To avoid losing potential producers or end-ing up with many irrelevant producers, we employthe following subscription acceptance policy, asshown in Fig. 6, and illustrate how it works in con-text-aware computing and sensor network domains.

Based on this policy, a producer peer attempts tomatch a subscription request against its local RDFdata. This policy works for a subscription requestin the form of any RDF triple pattern whosesubject, predicate or object may take variables.Although predicates can be specified as variables,this situation seldom occurs since users or applica-tions are always in favor of more specific events inreal-life applications. We now consider the case thata predicate is specified in a subscription request. If asubscription request’s predicate is of type Datatype-

Property, a producer peer determines if its localRDF data contains triple(s) with the same sub-ject–predicate pair as the request. For example, fora given subscription request hsocam:Bedroom

socam:lightLevel ‘LOW’i, a producer peer willaccept the request if there exists an RDF triple withsubject socam:Bedroom and predicate socam:light-

Level in its local data. If a subscription request’spredicate is of type ObjectProperty, a producer peerdetermines if its local RDF data contains triple(s)with the same predicate–object pair as the request.

Fig. 6. Subscription ac

For example, for a given subscription requesthsocam:John socam:locatedIn socam:Bedroomi, aproducer peer will accept the request if there existsan RDF triple with subject socam:locatedIn andpredicate socam:Bedroom in its local data.

To understand the rationale behind this tech-nique, consider a subscription request in the formof the RDF triple hSubs,Preds,Objsi. Such a triplemay be obtained from raw data generated by asensor which could be physical or virtual. In thedomain of sensor networks, a predicate alwayscorresponds to a sensor type. For example,socam:locatedIn corresponds to a physical locationsensor and socam:participateIn corresponds to a vir-tual activity sensor. If Preds is of DatatypeProperty,Subs should correspond to the target this sensor ismonitoring, while Objs corresponds to the sensoroutput. For example, the RDF triple of hsocam:Bed-

room socam:lightLevel ‘LOW’i can be interpreted asthe output of a light level sensor monitoring thebedroom’s light level. If a producer peer’s localRDF data contains at least one triple with thisSubs-Preds pair, it can be inferred that this producerpeer has the type of sensor specified by this pair.Hence, we can conclude that this producer peercan provide triples of this same subject–predicatepair. On the other hand, if Preds is of ObjectProp-

erty, Objs should correspond to the target this sen-sor is monitoring, while Subs corresponds to the

ceptance policy.

Page 12: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

4554 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

sensor output. In this case, the producer can providetriples with the same Subs-Preds pair as in the sub-scription request.

Once a producer peer accepts a subscriptionrequest, it keeps monitoring the request. Whenevera change (i.e., an RDF triple is added or removed)occurs, the producer peer will notify the subscribersif the RDF triple matches the subscription request.An RDF triple hSubcPredcObjci is said to matchthe subscription request if (Subc = Subs) \(Predc = Preds) \ (Objc = Objs) = 1. The routing ofnotification traces the exact path of the subscriptionrequest in the reverse direction. A subscriber canunsubscribe an event by sending an unsubscriptionrequest directly to its producers.

3.6. Peer dynamics and failure

In dynamic environments, a node may join andleave the system freely. To keep track of its neigh-boring nodes in DSS, a node maintains a numberof additional backup links for every link a nodehas. The approach is used in many other P2P sys-tems such as Pastry and CAN. However, in a highlydynamic environment, detecting link failure duringthe query routing process can introduce additionaloverhead. Moreover, in the event of failure of allits backup links, a node has to re-establish its neigh-boring links during the search operation, and henceit may affect search performance. With thisapproach, a node will need to inform its neighbor-ing nodes about its leaving and transfer its indicesto a randomly selected node in its cluster beforeleaving. Another approach is that each node period-ically sends a keep-alive message to each neighbor-ing node such as the ping message in Gnutella-likeoverlay networks. If no response is received, theneighboring node is assumed to be dead and anew link needs to be established. The failure detec-tion is done in an off-line manner to avoid affectingsearch performance, but it may increase the overalltraffic. In this approach, a node is not required toinform its neighboring nodes before its leaving. Anode leaves the system by simply transferring itsindices. In the above two approaches, when a nodeis involved in subscription, it has to transfer its backroute information to a node in its cluster or informthe subscriber about its leaving. Both the above twoapproaches have their pros and cons, and require agood study on the justification when they areapplied to a real-life application. In the followingevaluations, we rely only on the backup states to

study how well DSS performs in the presence offailure.

4. Evaluation

We use simulation to evaluate the effectiveness ofDSS and compare DSS with SONs [12]. We showthe performance results by setting various variablessuch as m, n, M and shortcut positions, and justifyour choices. We first describe our simulation modeland the performance metrics. Then we report theresults obtained from a range of experiments.

4.1. Simulation model and metrics

To simulate the performance of DSS in a morerealistic environment, we create two types of net-work topologies in our model: physical topologyand P2P overlay topology. All peer nodes are a sub-set of nodes in the physical topology. We use the ASmodel to generate these topologies as previous stud-ies have shown that P2P overlay topologies [20] fol-low the small world and power law properties.

The simulation is started by having a pre-existingnode in the network and then performing a series ofjoin operations invoked by new arriving nodes. Anode joins a semantic cluster based on its local dataand publishes its data indices. Various RDF dataare mapped into different semantic clusters and eachsemantic cluster is associated with a unique ID rang-ing from 0 to 2m. RDF data stored in each peer maybe heterogeneous or homogeneous. To evaluate thecapability of handling heterogeneous data in DSS,we introduce a parameter b, which is the ratio ofthe number of semantic clusters corresponding toall the local data stored in a peer to the maximumnumber (2m) of semantic clusters. b falls into therange of 1/2m to 1. When b = 1/2m, it implies that apeer has homogeneous RDF data in its local reposi-tory which maps to one particular semantic cluster inDSS. When b = 1, it implies that a peer has heteroge-neous RDF data which maps to all the semantic clus-ters; however, this case is unlikely to occur in real-lifeapplications. In our experiments, we set b to 1/2m,0.25 and 0.5 respectively. The semantic cluster(s)are selected in random by each peer according to b.

A peer also selects a random node in each of thesemantic clusters to publish its indices, if necessary.When the size of a semantic cluster exceeds the max-imum size M (in nodes), it will be split into two.This operation may be performed recursively untilthe number of sub-clusters reaches 2n. After the net-

Page 13: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

0.2

0.4

0.6

0.8

1

1 2 4 8 16 32 64 128 256

frac

tion

of n

odes

con

tact

ed p

er q

uery

# of semantic sub-spaces (2m)

GnutellaDSS with β = 1/2m

DSS with β = 0.25DSS with β = 0.5

Fig. 7. Fraction of nodes contacted per query.

T. Gu et al. / Computer Networks 51 (2007) 4543–4560 4555

work reaches a certain size, a mixture of node join-ing and leaving are invoked to simulate the dynamiccharacteristic of the overlay network. Each node isassigned a query generation rate, which is the num-ber of queries that it generates per unit time. In ourexperiments, each node generates queries at a con-stant rate. If a node receives queries at a rate thatexceeds its capacity to process them, the excess que-ries are queued in its buffer until the node is ready toread the queries from the buffer. Data are randomlyreplicated on nodes at a fraction a. Thus, queryingfor data with fraction a implies that a query hitcan be found at a fraction a of all the nodes in thesystem. A query is selected randomly among differ-ent semantic dimensions.

When a node initiates a query, it is first mapped toa particular semantic cluster, and then routed to thedestination semantic cluster and flooded to all thesub-clusters in parallel. In our simulation study, weuse a Gnutella overlay network to organize nodeswithin a cluster. The average outgoing degree of anode in its cluster is set to 4 by default, and shortcuts

are set to the half and quarter of the ring space unlessspecified. For the simplicity of generating RDF datain our simulation model, we use a set of keywords torepresent RDF data triples; different sets of key-words correspond to different semantic clusters. Inour simulation, we use the following performancemetrics to measure the effectiveness of DSS:

Fraction of nodes contacted per query is the aver-age fraction of nodes contacted for a query. It cap-tures the efficiency of a lookup system. A smallerfraction of nodes implies less overhead in thenetwork.

Search path length is the average number of hopstraversed by a query to the destination.

Search cost is the average number of query mes-sages incurred during a search operation in thenetwork.

Maintenance cost is the average number of mes-sages incurred when a node joins or leaves the net-work. It consists of the costs of node joining andleaving, cluster splitting and merging and indexpublishing. We measured these costs in terms ofnumber of messages.

Routing load is the average number of query mes-sages a node processes.

4.2. Semantic cluster mapping

For the evaluation of the semantic clusteringmapping, we have implemented a working proto-

type based on the Jena 2 toolkit [9]. We created aset of RDF-based data which corresponds to eachof the domain-specified ontologies and variousquery patterns. The system performs the mappingprocess by iterating each data triple or query. Weran the prototype on a 2.0 GHz Pentium machinewith 1 GB of memory. We used 1000 different datatriples and 1000 different queries in this experiment.On average, the mapping process takes about 3.38 sfor the mapping initialization and 0.234 ms for themapping of each data triple and each query. Themapping initialization reads the ontology filesstored locally and generates internal data structuresfor mapping. It is done only once when a peer startsand is only repeated if there is a change in the ontol-ogies. The computation cost of our mapping processis much lower as compared to the computation costof LSI (results can be found in [13]).

4.3. Search efficiency

The efficiency of executing a search request iscaptured in the fraction of nodes contacted andsearch path length during the search. For a givenquery, DSS only needs to contact N/2m nodes inthe system plus those nodes pointed to by a set ofindices. Fig. 7 plots the fraction of nodes contactedper query by setting n to 0 (i.e., parallel search in asemantic cluster is disabled) and varying the numberof semantic clusters from 20 to 28. The values areobtained by taking the average over various net-work sizes N from 28 to 213. As expected, the frac-tion of nodes contacted per query decreases inproportion to 1/2m. When b = 0.25 or 0.5, DSShas to contact about a quarter or a half of nodes.This is because that, besides contacting all the nodes

Page 14: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

4556 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

in the destination semantic cluster, DSS has to con-tact the nodes in other semantic clusters pointed toby their indices. Due to randomness of selectingsemantic clusters and nodes to publish its indicesby a peer, the fraction of nodes contacted is almostidentical to b. Note that for a search request, Gnu-tella has to contact every node in the network. Inthe case of SONs, this fraction is equal to �C=Cmax,where �C is the average number of SONs each nodeparticipates in and Cmax is the maximum number ofSONs in the system. With less nodes contacted byDSS and SONs, the network traffic load incurredby a query will also be reduced. The result confirmsthat DSS has the same efficiency as SONs in termsof number of nodes contacted per query.

Fig. 8 shows the search path length comparingDSS, SONs and Gnutella when the network size N

is varied from 28 to 213. We disabled the clusteringeffect by setting M to 1 for DSS since Gnutella doesnot have a clustering feature. We also disabled par-allel search in DSS by setting n to 0. Hence, the net-work size N is 2m�1. Since M = 1 and n = 0, therewill be no flooding within a semantic cluster. Asshown in Fig. 8, the search path lengths for bothDSS and SONs increase slowly with the networksize as compared to Gnutella, confirming that thesearch path is bound (note that the x-axis uses alog scale). The search path length for DSS is almostidentical to the one for SONs, showing that DSShas the same search effectiveness as SONs. In thecase that a peer has heterogeneous local data (i.e.,b = 0.25 or 0.5), the search path length is almostidentical to the case that a peer has homogeneouslocal data (i.e., b = 1/2m). It shows that it does nothave any negative effect on DSS in terms of searchpath length when a peer has heterogeneous local

0

500

1000

1500

2000

2500

3000

3500

4000

256 512 1024 2048 4096 8192

sear

ch p

ath

leng

th

# of nodes

GnutellaSONs

DSS with β = 1/2m

DSS with β = 0.25DSS with β = 0.5

Fig. 8. Search path length.

data. This is because a peer can directly contactthe node(s) in other semantic clusters using a setof indices that point to them.

In DSS, we explore the parallel search mechanismwithin a semantic cluster. We evaluated the parallelsearch effect by comparing DSS and SONs. We setup a network with m = 4 and varied the network sizefrom 210 to 213. We set n to 2 and 3 respectively forDSS; as a result, a semantic cluster will be split intotwo when the size exceeds N/25 and N/26. Hence, asearch can be performed in parallel among thesesub-clusters. Fig. 9 shows that the parallelism inDSS has effectively reduced search path length ascompared to SONs. The result also shows that theparallel search effect increases (i.e., search pathlength decreases) with respect to n. The results inboth Figs. 8 and 9 show that the search path lengthin DSS is sensitive to 2m and n, but not sensitive to b.

4.4. Overhead

In this experiment, we evaluated search overheadby comparing search costs among DSS, SONs andGnutella. We set m to 5 (i.e., the number of seman-tic clusters is 32), n to 0 (parallel search is disabled),and varied the network size from 28 to 213. Asshown in Fig. 10, the search cost of Gnutellaincreases rapidly when the network size grows. Incontrast, DSS and SONs significantly reduce thesearch cost with the setting of 32 semantic clusters.We repeated the experiment by turning on the par-allel search mechanism (i.e., n = 2 and 3) whilekeeping other settings. We obtained similar resultsas in the case where n = 0. This confirms that theparallel search mechanism in DSS does not incur

0

50

100

150

200

250

300

1024 2048 4096 8192

sear

ch p

ath

leng

th

# of nodes

SONsDSS (n=2, β = 0.5)

DSS (n=2, β = 0.25)DSS (n=2, β = 1/2m)DSS (n=3, β = 1/2m)

Fig. 9. The effect of parallel search in DSS.

Page 15: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

0

20

40

60

80

100

120

140

160

256 512 1024 2048 4096 8192

sear

ch c

ost (

103 )

# of nodes

GnutellaSONs; 32 clusters

DSS (β = 1/2m); 32 clustersDSS (β = 0.25); 32 clusters

DSS (β = 0.5); 32 clusters

Fig. 10. Search cost.

T. Gu et al. / Computer Networks 51 (2007) 4543–4560 4557

any extra search overhead. When b = 0.25 or 0.5,search cost increases because search requests haveto reach many nodes in other semantic clusters(other than D) as well, which are pointed to by aset of indices.

In this experiment, we evaluated the averagemaintenance cost by comparing DSS and SONs.The maintenance cost of SONs only contains thecost of node joining and leaving. As shown inFig. 11, the maintenance cost of SONs increasesrapidly when the number of semantic clusters(dimensions) grows. This is because the requirednumber of outgoing degrees of a node in SONsincreases in proportional to the dimension. In thecase of DSS with M = 32 and n = 2, the averagemaintenance cost of a node consists of the costs ofnode joining and leaving, cluster splitting and merg-ing and index publishing. The maintenance cost inDSS also increases with respect to the dimension,

0

20

40

60

80

100

120

140

160

180

200

220

1 2 4 8 16 32 64 128 256

avg

mai

nten

ance

cos

t

# semantic sub-spaces (2 m)

SONsDSS with β = 1/2m

DSS with β = 0.25DSS with β = 0.5

Fig. 11. Average maintenance cost.

but in a much lower gradient. In the case of hetero-geneous data stored in peers (i.e., b = 0.25 or 0.5),the maintenance cost increases due to the increasedcost of index publishing; however, it is still muchlower than SONs as shown in Fig. 11. This confirmsour design goal of reducing maintenance overheadsincurred by high-dimensional semantic overlay net-works such as in SONs.

4.5. Clustering effects

In this section, we evaluate the effect of clusteringin DSS by varying the cluster size M from 20 to 210.We first evaluate the effect of cluster size on searchpath length by constructing a network of sizeN = 210. We turn off parallel search within a seman-tic cluster by setting n to 0, and ensure no dataduplication in DSS. Hence all clusters are semanticclusters. We also set b to 1/2m as we focus on clusteroperations in this section. Fig. 12 plots the searchpath length in DSS when M increases from 20 to210. The search path length across clusters decreaseswhile the search path length within clustersincreases with larger cluster sizes (note that thereare 210 clusters in the network when M = 1 and onlyone cluster when M = 210). This is because with afixed network size, the total number of clusters inDSS decreases with larger cluster sizes. Fig. 12 sug-gests that the search path length achieves its mini-mum when the number of semantic cluster equalsto 32, 16 and 8 corresponding to M = 32, 64 and128 respectively.

With the same settings as in the previous experi-ment, we evaluated the search cost and its break-down within clusters and across clusters withvarious cluster sizes. From Fig. 13, we observe thatthe search cost in DSS increases rapidly from a

0

10

20

30

40

50

1 2 4 8 16 32 64 128 256 512 1024

cluster size M

sear

ch p

ath

leng

th

search path length within semantic clusterssearch path length across semantic clusters

Fig. 12. Search path length vs. cluster size M.

Page 16: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

1

10

100

1000

10000

1 2 4 8 16 32 64 128 256 512 1024

cluster size M

sear

ch c

ost

search cost within semantic clusterssearch cost across semantic clusters

Fig. 13. Search cost vs. cluster size.

4558 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

point where M = 16. This is due to the effect ofblind flooding within a cluster.

We plot the cost of node joining/leaving and clus-ter splitting/merging over different cluster sizes inFig. 14. As there are fewer clusters in DSS with lar-ger cluster sizes, a new node requires a smaller num-ber of hops to join the network. Therefore the costof joining/leaving decreases with respect to M. Witha larger cluster size, cluster splitting and mergingoccur less frequently, resulting in a lower clustersplitting/merging cost.

From the results in this section, we observe thatthe setting of 16 and 32 semantic clusters providesa good tradeoff between search efficiency and over-head. With larger cluster sizes, the search pathlength and the cost of node joining/leaving and clus-ter splitting/merging are not so sensitive to M ascompared to the search cost. One should notice thatwe set n to 0 in the experiments in this section. If theparallel search mechanism is turned on (n > 0), the

10

20

30

40

50

1 2 4 8 16 32 64 128 256 512 1024cost

of n

ode

join

ing/

leav

ing,

clus

ter

split

ting/

mer

ging

cluster size M

Fig. 14. The cost of node joining/leaving and cluster splitting/merging vs. cluster size.

search path length can be further reduced as a querycan be flooded in parallel in a semantic cluster. Tofurther reduce the search cost incurred by blindflooding within a cluster, the Cost-Aware SelectiveFlooding technique proposed in [7] can be deployed.

4.6. Selection of shortcuts

In this experiment, we evaluated the effect of dif-ferent shortcuts in DSS and compared them to therandom shortcut which is originally used in a smallworld network model. We started a network withthe size of 210 nodes. Each semantic cluster has onlyone node by setting the cluster size M to 1 and n to0. Hence, the search path length for intra-clusterrouting equals to 0. We selected two shortcuts eitherfix-points or random-points in the network and var-ied the location of the longest shortcut. The othershortcut always points to the middle semantic clus-ter between the semantic cluster where a noderesides and the semantic cluster that the longestshortcut points to. We plot the search path lengthfor inter-cluster routing with various numbers ofsemantic clusters in Fig. 15. As compared to fix-point shortcuts, random shortcuts work well in lowerdimensional semantic spaces, but perform worse inhigher dimensional semantic spaces. The locationof fix-point shortcuts depends on the number ofsemantic spaces. Among these shortcuts, the 1/8shortcut seems to provide a balance for the size ofsemantic spaces below 512.

We evaluated the effect of clustering in DSS byvarying the cluster size M. The results suggest thatthe setting of 16 and 32 semantic clusters providesa good tradeoff between search efficiency and over-head. With larger cluster sizes, the search pathlength and the cost of node joining/leaving and clus-

Fig. 15. Selection of shortcuts.

Page 17: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

0

0.005

0.01

0.015

0.02

0.025

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4

frac

tion

of n

odes

routing load

Fig. 16. Routing load.

T. Gu et al. / Computer Networks 51 (2007) 4543–4560 4559

ter splitting/merging are not as sensitive to M ascompared to the search cost.

4.7. Load balancing

We study the load balance in DSS from theaspects of data load, index load and routing load.Since the data load in terms of number of data tri-ples and the index load in terms of number of indi-ces are balanced under the uniform distribution ofdata, we present only the result of routing load inthis section. We evaluate the routing load processedper node in a network setting m = 3, n = 2 andM = 64. The average outgoing degree per node isset to 4 within a semantic cluster. A query is drawnrandomly from all the semantic clusters. Each nodeinitializes a lookup uniformly at random. Fig. 16shows that the routing load distribution across var-ious nodes is relatively well balanced.

5. Conclusion and future work

In this paper, we present a schema-based P2Psystem for information retrieval in dynamic envi-ronments. We propose several techniques such asontology-based semantic clustering, a cluster nam-ing scheme and routing techniques and show howto retrieve RDF data in both pull and push modes.These techniques can be well applied to any P2Psearching systems where schemas are explicitlydefined. The promising results from a range ofexperiments show that DSS works effectively andhas a good tradeoff between search efficiency andsearch cost. The overlay maintenance cost is low,and the system adapts to peer dynamics quicklyand can scale to a large number of peers.

This paper uses a standardized definition of theupper ontology and the low-layer ontologies for over-lay construction. This assumption may restrict the useof DSS in real-life applications since users can onlycreate RDF data according to pre-defined ontologies.However, since there are many on-going efforts to cre-ate standard ontologies for various applicationdomains, e.g., in e-commerce applications [21], andubiquitous and pervasive applications [22], we believeDSS will be widely applied in many real-life dynamicapplications. If ontology interoperability mecha-nisms are put in place, that will offer a greater flexibil-ity for our scheme. While this paper assumes the use ofGnutella-like overlay networks to organize peerswithin a sub-cluster, a DHT-based overlay networkcan be used to provide more efficient routing withthe tradeoff of higher maintenance overhead. Weare studying how to apply DHT-based routing tech-niques such as Chord or CAN to a sub-cluster inDSS while keeping maintenance overhead low.Finally, we believe that DSS can have a significantpractical impact on building large-scale, schema-based P2P lookup systems in dynamic environments.

References

[1] World Wide Web Consortium: Resource Description Frame-work. <http://www.w3.org/RDF>.

[2] Gnutella. <http://gnutella.wego.com>.[3] I. Stoica, R. Morris, D. Karger, F. Kaashoek, H. Balakrish-

nan, Chord: a scalable peer-to-peer lookup service forinternet applications, in: Proceedings of ACM SIGCOMM,San Diego, US, August 2001.

[4] S. Ratnasamy, P. Francis, M. Handley, R. Karp, S. Shenker,A scalable content addressable network, in: Proceedings ofACM SIGCOMM, San Diego, US, August 2001.

[5] A. Rowstron, P. Druschel, Pastry: scalable. Distributedobject location and routing for large-scale peer-to-peersystems, Lecture Notes in Computer Science 2218 (Novem-ber) (2001) 161–172.

[6] B.Y. Zhao, L. Huang, J. Stribling, S.C. Rhea, A.D. Joseph,J.D. Kubiatowicz, Tapestry: a resilient global-scale overlayfor service deployment, IEEE Journal on Selected Areas inCommunications 22 (1) (2004) 41–53.

[7] T. Gu, E. Tan, H.K. Pung, D. Zhang, ContextPeers: scalablepeer-to-peer search for context information, in: Proceedingsof International Workshop on Innovations in Web Infra-structure, in conjunction with the 14th World Wide WebConference (WWW 2005), Japan, May 2005.

[8] RDFStore. <http://rdfstore.sourceforge.net>.[9] Jena 2– A Semantic Web Framework. <http://www.hpl.hp.

com/semweb/jena2.htm>.[10] W. Nejdl, M. Wolpers, W. Siberski, C. Schmitz, M.

Schlosser, I. Brunkhorst, A. Lser, Super-peer-based routingand clustering strategies for RDF-based peer-to-peer net-works, in: Proceedings of the12th World Wide Web Confer-ence, May 2003.

Page 18: Information retrieval in schema-based P2P systems using ...zhang_da/pub/CN2007-gu.pdf · The widespread use of RDF-based information necessitates efficient information retrieval techniques

4560 T. Gu et al. / Computer Networks 51 (2007) 4543–4560

[11] P.A. Chirita, S. Idreos, M. Koubarakis, W. Nejdl, Publish/subscribe for RDF-based P2P networks, in: Proceedings ofthe 1st European Semantic Web Symposium, Greece, May2004.

[12] A. Crespo, H. Garcia-Molina, Semantic overlay networksfor P2P systems. Technical report, Stanford University,January 2003.

[13] M. Cai, M. Frank, RDFPeers: A scalable distributed RDFrepository based on a structured peer-to-peer network, in:Proceedings of the 13th International World Wide WebConference, New York, US, May 2004.

[14] C.Q. Tang, Z.C. Xu, S. Dwarkadas, Peer-to-peer informa-tion retrieval using self-organizing semantic overlay net-works, in: Proceedings of ACM SIGCOMM, Karlsruhe,Germany, August 2003.

[15] S.C. Deerwester, S.T. Dumais, T.K. Landauer, G.W. Fur-nas, R.A. Harshman, Indexing by latent semantic analysis,Journal of the American Society of Information Science 41(6) (1990) 391–407.

[16] M. Li, W.C. Lee, Anand Sivasubramaniam, D.L. Lee, Asmall world overlay network for semantic based search inP2P, in: Proceedings of the 2nd Workshop on Semantics inPeer-to-Peer and Grid Computing, in conjunction with theWorld Wide Web Conference, May, 2004.

[17] J. Kleinberg, The Small-World Phenomenon: an AlgorithmPerspective, in: Proceedings of the 32nd ACM Symposiumon Theory of Computing, 2000.

[18] RDQL. <http://www.w3.org/Submission/2004/SUBM-RDQL- 20040109/>.

[19] M. Berry, Z. Drmac, E. Jessup, Matrices, vector spaces, andinformation retrieval, SIAM Review 41 (2) (1999) 335–362.

[20] S. Saroiu, P. Gummadi, S. Gribble, A measurement study ofpeer-to-peer file sharing systems, in: Proceedings of Multi-media Computing and Networking, San Jose, CA, US,January 2002.

[21] OntoWeb Special Interest Groups (SIGs) on Content Stan-dardization for B2B E-Commerce. <http://zeus.ics.forth.gr/forth/ics/isl/projects/ontoweb/content_standards.html>.

[22] SOUPA – Standard Ontology for Ubiquitous and PervasiveApplications. <http://pervasive.semanticweb.org>.

Tao Gu (http://www1.i2r.a-star.edu.sg/~tgu) is a Scientist at the Institute forInfocomm Research, Singapore. Hereceived his Ph.D. in computer sciencefrom National University of Singaporein 2005. His current research interestsinvolve various aspects of ubiquitousand pervasive computing includingarchitectures, tools, distributed lookupalgorithms, and data management.Other areas of interest are Peer-to-Peer

computing and Wireless Sensor Networks.

Hung Keng Pung (http://www.comp.nu-s.edu.sg/~punghk) is an associate pro-fessor in the Department of ComputerScience at the National University ofSingapore. He received his Ph.D. fromthe University of Kent at Canterbury,UK. He heads the Networks Systemsand Services Laboratory as well asholding a joint appointment as a Princi-pal Scientist at Institute of InfocommResearch, Singapore. His research focu-

ses on context-aware systems, service-oriented computing, qualityof service management, protocol design and networking.

Daqing Zhang is a Principal Investigatorat the Institute for Infocomm Research(I2R), Singapore. He received his Ph.D.from University of Rome ‘‘La Sapienza’’and University of L’Aquila, Italy in1996. He has been initiating and leadingthe efforts in Smart Home, Context-aware Middleware, Context-awareMobile and Healthcare Applications inI2R, Singapore. His research interestsinclude smart home, pervasive comput-

ing, service-oriented computing, context-aware systems andambient intelligence.