Enforcing Secure and Privacy-Preserving Information ... 2013 Dotnet Basepaper... · system can integrate security enforcement and query routing while preserving system-wide privacy

1

Enforcing Secure and Privacy-PreservingInformation Brokering in Distributed Information

SharingFengjun Li, Bo Luo, Peng Liu Dongwon Lee and Chao-Hsien Chu

Abstract—To facilitate extensive collaborations, today’s or-ganizations raise increasing needs for information sharing viaon-demand information access. Information Brokering System(IBS) atop a peer-to-peer overlay has been proposed to supportinformation sharing among loosely federated data sources. Itconsists of diverse data servers and brokering components, whichhelp client queries to locate the data servers. However, manyexisting IBSs adopt server side access control deployment andhonest assumptions on brokers, and shed little attention onprivacy of data and metadata stored and exchanged withinthe IBS. In this article, we study the problem of privacyprotection in information brokering process. We first give aformal presentation of the threat models with a focus on twoattacks: attribute-correlation attack and inference attack. Then,we propose a broker-coordinator overlay, aa well as two schemes,automaton segmentation scheme and query segment encryptionscheme, to share the secure query routing function among a set ofbrokering servers. With comprehensive analysis on privacy, end-to-end performance, and scalability, we show that the proposedsystem can integrate security enforcement and query routingwhile preserving system-wide privacy with reasonable overhead.

Index Terms—Access control, information sharing, privacy

I. INTRODUCTION

In recent years, we have observed an explosion of informa-tion shared among organizations in many realms ranging frombusiness to government agencies. To facilitate efficient large-scale information sharing, many efforts have been devotedto reconcile data heterogeneity and provide interoperabilityacross geographically distributed data sources. Meanwhile,peer autonomy and system coalition becomes a major trade-off in designing such distributed information sharing systems.Most of the existing systems work on two extremes of thespectrum: (1) in the query-answering model for on-demandinformation access, peers are fully autonomous but there is nosystem-wide coordination; so that participants create pairwiseclient-server connections for information sharing; (2) in thetraditional distributed database systems, all the participateslost autonomy and are managed by a unified DBMS. Unfor-tunately, neither of them is suitable for many newly emergedapplications, such as information sharing for healthcare or lawenforcement, in which organizations share information in aconservative and controlled manner, not only from businessconsiderations but also due to legal reasons.

F. Li and B. Luo are with the Department of EECS, The University ofKansas, Lawrence, KS, 66045 USA. E-mail: {fli,bluo}@ku.edu.

P. Liu, D. Lee and C. Chu are with the College of IST, The PennsylvaniaState University. Email: {pliu,dlee,chu}@ist.psu.edu.

As an example, healthcare information systems, such asRegional Health Information Organization (RHIO) [1], aim tofacilitate access to and retrieval of clinical data across collab-orative health providers. An RHIO is formed with multiplestakeholders, including hospitals, outpatient clinics, payers,etc. As a data provider, a participant would not assume free orcomplete sharing with others, since its data is legally privateor commercially proprietary, or both. Instead, it is requiredto retain full control over the data and access to the data.Meanwhile, as a consumer, a health provider requesting datafrom other providers expects to protect private information(e.g. requestor’s identity, interests) in the querying process.

In such scenarios, sharing a complete copy of the datawith others or “pouring” data into a centralized repositorybecomes impractical. To address the need for autonomy,federated database technology has been proposed [2], [3],to manage locally stored data with a federated DBMS andprovide unified data access. However, the centralized DBMSstill introduces data heterogeneity, privacy, and trust issues.Meanwhile, the peer-to-peer information sharing frameworkis often considered a solution between “sharing nothing” and“sharing everything”. In its basic form, every pair of peersestablishes two symmetric client-server relationships, and re-questors send queries to multiple databases. This approachassumes 2n relationships for n peers, and is not scalable.

In the context of sensitive data and autonomous data owners,a more practical and adaptable solution is to construct a data-centric overlay (e.g. [4], [5]), including the data sources and aset of brokers helping to locate data sources for queries [6], [7],[8], [9]. Such infrastructure builds up semantic-aware indexmechanisms to route the queries based on their content, whichallows users to submit queries without knowing data or serverlocation. In our previous study [9], [10], such a distributedsystem providing data access through a set of brokers isreferred to as Information Brokering System(IBS).

As shown in Figure 1, applications atop IBS always involvesome sort of consortium (e.g. RHIO) among a set of orga-nizations. Databases of different organizations are connectedthrough a set of brokers, and metadata (e.g. data summary,server locations) are “pushed” to the local brokers, whichfurther “advertise” (some of) the metadata to other brokers.Each query is sent to the local broker, and routed according tothe metadata until reaching the right database(s). In this way, alarge number of information sources in different organizationsare loosely federated to provide an unified, transparent, andon-demand data access.

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY VOL:8 NO:6 YEAR 2013

2

Fig. 1. An overview of the IBS infrastructure.

While the IBS approach provides scalability and serverautonomy, privacy concerns arise, as brokers are no longerassumed fully trustable – they may be abused by insidersor compromised by outsiders. In this article, we present ageneral solution to the privacy-preserving information sharingproblem. First, to address the need for privacy protection, wepropose a novel IBS, named Privacy Preserving InformationBrokering (PPIB). It is an overlay infrastructure consisting oftwo types of brokering components: brokers and coordinators.The brokers, acting as mix anonymizers [11], are mainlyresponsible for user authentication and query forwarding.The coordinators, concatenated in a tree structure, enforceaccess control and query routing based on the embedded non-deterministic finite automata – the query brokering automata.To prevent curious or corrupted coordinators from inferringprivate information, we design two novel schemes: (a) tosegment the query brokering automata, and (b) to encryptcorresponding query segments. While providing full capabilityto enforce in-network access control and to route queries to theright data sources, these two schemes ensure that a curious orcorrupted coordinator is not capable to collect enough infor-mation to infer privacy, such as “which data is being queried”,“where certain data is located”, or “what are the access controlpolicies”, etc. We show that PPIB provides comprehensiveprivacy protection for on-demand information brokering, withinsignificant overhead and very good scalability.

The rest of the paper is organized as follows: we discuss theprivacy requirements and threats in the information brokeringscenario in Section II, and introduce the related work andpreliminaries in Section III. In Section IV, we present twocore schemes, and then the PPIB approach. We discuss theconstruction and maintenance in Section V, analyze systemprivacy and security in Section VI, evaluate the performancein Section VII, and conclude our work in Section VIII.

II. THE PROBLEM

A. Vulnerabilities and the Threat model

In a typical information brokering scenario, there are threetypes of stakeholders, namely data owners, data providers, anddata requestors. Each stakeholder has its own privacy: (1) theprivacy of a data owner (e.g. a patient in RHIO) is identifiabledata and the information carried by this data (e.g. medicalrecords). Data owners usually sign strict privacy agreementswith data providers to protect their privacy from unauthorizeddisclosure/use. (2) Data providers store collected data, and

create two types of metadata, namely routing metadata andaccess control metadata, for data brokering. Both types ofmetadata are considered privacy of a data provider. (3) Datarequestors disclose identifiable and private information in thequerying process. For example, a query about AIDS treatmentreveals the (possible) disease of the requestor.

We adopt the semi-honest (i.e., honest-but-curious [12])assumption for the brokers, and assume two types of adver-saries, outside attackers and curious or corrupted brokeringcomponents. Outside attackers passively eavesdrop communi-cation channels. Curious or corrupted brokering componentsfollow the protocols properly to fulfill their functions, whiletrying their best to infer others’ private information from theinformation disclosed in the querying process

In the information brokering infrastructure, privacy concernsarise when identifiable information is disseminated with noor poor disclosure control. Data providers push routing andaccess control metadata to brokers [6], [9], which also handlesqueries from requestors. Therefore, a curious or corrupted bro-kering server could: (1) learn query content and query locationby intercepting a local query; (2) learn routing metadata andaccess control metadata from local data servers and otherbrokers; (3) learn data location from routing metadata it holds.On the other hand, although eavesdroppers may not obtainplaintext data over encrypted communication, they can stilllearn query location and data location from eavesdropping.Moreover, with one or several types of such information, theattacker could further infer the privacy of different stake-holders. We classify the attacks into two major classes: theattribute-correlation attack and the inference attack.

Attribute-correlation attack. An attacker intercepts a query (inplaintext), which typically contains several predicates. Eachpredicate describes a condition, which sometimes involves sen-sitive and private data (e.g. name, SSN or credit card number,etc.). If a query has multiple predicates or composite predicateexpressions, the attacker can “correlate” the correspondingattributes to infer sensitive information about the data owner.This attack is known as the attribute correlation attack:

Example 1. A tourist Anne is sent to the emergency roomat California Hospital. Doctor Bob queries for her medicalrecords through a medicare IBS. Since Anne has the symptomof leukemia, the query has two predicates: [name="Anne"],and [symptom="leukemia"]. Any malicious broker thathas helped routing the query could guess “Anne has a bloodcancer” by correlating two predicates in the query. 2

Unfortunately, sensitive query content, including the pred-icates, cannot be simply encrypted since such informationis necessary for content-based query routing. Therefore, weare facing a paradox of the requirement for content-basedbrokering and the risk of attribute-correlation attacks.

Inference attack. More severe privacy leak occurs when anattacker obtains more than one type of sensitive information,and further associates them to learn explicit and implicitknowledge about the stakeholders. By “implicit”, we mean theattacker infers the fact by “guessing”. For example, an attackercan guess the identity of a requestor from her query location

3

Fig. 2. The architecture of PPIB.

(e.g. IP address). Meanwhile, the identity of the data ownercould be explicitly learned from query content (e.g. name orSSN in the predicate). Attackers can also obtain publicly-available information to help his inference. For example, if anattacker identifies a data server located at a cancer researchcenter, he can tag the related queries as “cancer-related”.In summary, we have three distinct combinations of privateinformation, and the reasonable inferences: (1) From querylocation & data location, the attacker infers knowledge aboutwho (i.e. a specific requestor) is interested in what (i.e. aspecific type of data). (2) From query location & querycontent, the attacker infers knowledge about who is interestedin what, or where who is, if the query has predicates describingone’s interests (e.g. symptom or medicine) or identifying apersonnel (e.g. name or address). (3) From query content &data location, the attacker infers which data server has whichdata. Hence, the attacker could continuously create artificialqueries or monitor user queries to learn the data distributionof the system, which could be used to conduct further attacks.

B. Solution Overview

To address the privacy vulnerabilities in current informationbrokering infrastructure, we propose a new model, namelyPrivacy Preserving Information Brokering (PPIB). PPIB hasthree types of brokering components: brokers, coordinators,and a central authority (CA). The key to preserve privacy isto divide the work among multiple components in such a waythat no single node can make a meaningful inference from theinformation disclosed to it.

Figure 2 shows the architecture of PPIB. Data servers andrequestors from different organizations connect to the systemthrough local brokers (green nodes in Fig. 2). Brokers areinterconnected through coordinators (white nodes in Fig. 2).A local broker functions as the “entrance” to the system. Itauthenticates requestors and hides their identity from otherPPIB components. It would also permute query sequence todefend against local traffic analysis.

Coordinators are responsible for content-based query rout-ing and access control enforcement. With privacy-preservingconsiderations, we cannot let a coordinator hold any rule inthe complete form. Instead, we propose a novel automatonsegmentation scheme to divide (metadata) rules into segmentsand assign each segment to a coordinator. Coordinators operatecollaboratively to enforce secure query routing. Moreover,

to prevent coordinators from seeing sensitive predicates, wepropose a query segment encryption scheme to protect querycontent. The scheme divides a query into segments, andencrypts each segment in a way that no segment apart fromthe ones needed to enforce secure routing is revealed to thecoordinators enroute. Last but not least, we assume a separatebrokering server, namely central authority (CA), is responsiblefor key management and metadata maintenance.

III. BACKGROUND

A. Related Work

Research areas such as information integration, peer-to-peerfile sharing systems and publish-subscribe systems providepartial solutions to the problem of large scale data sharing.Information integration approaches focus on providing an inte-grated view over large numbers of heterogeneous data sourcesby exploiting the semantic relationship between schemas ofdifferent sources [13], [14], [15]. The PPIB study assumesthat a global schema exists within the consortium, therefore,information integration is out of our scope.

Peer-to-peer systems are designed to share files and datasets (e.g. in collaborative science applications). Distributedhash table technology [16], [17] is adopted to locate replicasbased on keyword queries. However, although such technologyhas recently been extended to support range queries [18], thecoarse granularity (e.g. files and documents) still makes themshort of our expressiveness needs. Further, P2P systems maynot provide complete set of answers to a request while weneed to locate all relevant data.

Addressing a conceptually dual problem, XML publish-subscribe systems (e.g. [19], [20]) are probably the closely re-lated technology to the proposed research: while PPIB locatesrelevant data sources for a given query and route the query tothese data sources, the pub/sub systems locate relevant con-sumers for a given document and route the document to theseconsumers. However, due to this duality, we have differentconcerns: they focus on efficiently delivering the same pieceof information to a large number of consumers, while we aretrying to route large volume but small-size queries to muchfewer sites. Accordingly, the multicast solution in pub/subsystems does not scale in our environment and we need todevelop new mechanisms.

One idea is to build an XML overlay architecture thatsupports expressive query processing and security checkingatop normal IP network. In particular, specialized data struc-tures are maintained on overlay nodes to route XML queries.In [5], a robust mesh has been built to effectively route XMLpackets by making use of self-describing XML tags and theoverlay networks. Kouds et al. also propose a decentralizedarchitecture for ad hoc XPath query routing across a collectionof XML databases [6]. To share data among a large numberof autonomous nodes, [21] studies content-based routing forpath queries in peer-to-peer systems. Different from theseapproaches, PPIB seamlessly integrates query routing withsecurity and privacy protection.

Research on anonymous communication provides a wayto protect information from unauthorized parters. Different

4

protocols have been proposed to allow a message senderdynamically selecting a set of other users and relaying itsrequest [22], [23]. Such approaches can be incorporated intoPPIB to protect locations of data requestors and data serversfrom being known by irrelevant parties in communication.However, aiming at enforcing access control during queryrouting, PPIB addresses more privacy concerns other thananonymity, and thus faces more challenges.

Finally, many researches have been proposed on distributedaccess control. [24] gives a good overview on access controlin collaborative systems. In summary, earlier approaches im-plement access control mechanisms at the nodes of XML treesand filter out data nodes that users do not have authorizationsto access [25], [26]. These approaches rely much on the XMLengines. View-based access control approaches create andmaintain a separate view (e.g. a specific portion of XML docu-ments) for each user [27], [28], which causes high maintenanceand storage cost. Recently, a NFA-based query re-writingaccess control scheme has been proposed [29], [9], which hasa better performance than view-based approaches [26].

B. Preliminaries

1) XML Data Model and Access Control: The eXtensi-ble Markup Language (XML) has emerged as the de factostandard for information sharing due to its rich semanticsand extensive expressiveness. We assume that all the datasources in PPIB exchange information in XML format, i.e.take XPath [30] queries and return XML data. Note thatthe more powerful XML query language, XQuery, still usesXPath to access XML nodes. In XPath, predicates are used toeliminate unwanted nodes, and test conditions are containedwithin square brackets “[ ]”. In our study, we mainly focuson value-based predicates.

To specify the authorization at the node level, fine-grainedaccess control models are desired. We adopt the 5-tuple accesscontrol policy that are used in the literature [31], [27], [29].The policy consists of a set of access control rules (ACR) ={subject, object, action, sign, type}, where (1) subject is therole to whom the authorization is granted; (2) object is a setof XML nodes specified by an XPath expression; (3) action isoperations as “read”, “write”, or “update”; (4) sign ∈ {+,−}refers to access “granted” or “denied”, respectively; and (5)type ∈ {LC,RC} denotes “local check” (i.e., applying autho-rization only to the attributes or textual data of the contextnodes) or “recursive check” (i.e., applying authorization to allthe descendants of the context node). A set of example rulesare shown below:

Example 2. Example ACRs:R1:{role1,/site//person/name, read, +, RC}R2:{role1,/site/regions/asia/item, read, +, RC}R3:{role2,/site/regions/asia/item, read, +, RC}R4:{role2,/site/regions/*/item/name, read, +,RC}R5:{role2,/site/regions/*/item[location="USA"]/description,read,+,RC} 2

Existing approaches for enforcing access control policycan be classified as engine-based [32], [25], [28], [33], [34],view-based [35], [36], [31], [37], [38], preprocessing [26],

Fig. 3. Data structure of an NFA state.

[27], [39], [40], [41], and post-processing [42] approaches. Inparticular, we adopt the Non-deterministic Finite Automaton(NFA) based approach as presented in [29], which allowsaccess control to be enforced outside data servers, and inde-pendent from the data. The NFA-based approach constructsNFA elements for four building blocks of common XPathaxes (/x, //x, /*, and //*) so that XPath expressions, ascombinations of these building blocks, can be converted toan NFA, which is used to match and rewrite incoming XPathqueries. Please refer to [29] for more details on the QFilterapproach.

2) Content-based Query Brokering: Indexing schemes havebeen proposed for content-based XML retrieval [43], [44],[45], [46]. The index describes the address of the data serverthat stores a particular data item requested in an user query.Therefore, a content-based index rule should contain thecontent description and the address. In [9], we presented acontent-based indexing model with index rules in the form ofI = {object, location}, where (1) object is an XPath expressionthat selects a set of nodes; and (2) location is a list of IPaddress of data servers that hold the content. When an userqueries the system, the XPath query is matched with the objectfield of the index rules, and the query will be sent to thedata server specified by the location field of the matchedrule(s). While other techniques (e.g. bloom filter [7], [6]) canbe used to implement content-based indexing, we adopt themodel in [9] in our study since it can be directly integratedwith the NFA-based access control enforcement scheme. Theintegrated NFA that captures access control rules and indexrules is a content-based query broker (QBroker).

Example 3. Example index rules:I1:{/site/people/person/name, 130.203.189.2}I2:{/site/regions//item[@id>"100"],135.176.4.56}I3:{/site/regions/samerica/item[@id>"200"],195.228.155.9}I4:{/site//namerica/item/name, 135.207.5.126}I5:{/site/regions/namerica/item/location,74.128.5.91} 2

QBroker is an NFA constructed in a similar way as QFil-ter [29]. Fig. 3 shows the data structure of each NFA statein QBroker, where the state transition table stores the childnodes specified by the XPath expression as the child states ineSymbol. The binary flag DSState indicates that the state is a

5

Fig. 4. The state transition graph of the QBroker that integrates index ruleswith ACRs.

“double-slash” state, which means it will recursively acceptsany input symbols, and the child state will be the ε-transitionstate that directly transits to the next state without consumingany input symbol. Fig. 4 shows a QBroker constructed fromExamples 2 and 3. Different from QFilter, QBroker cancapture ACRs for different roles by adding two binary arraysto each state: the AccessList determines the roles that areallowed to access the state and the AcceptList indicates forwhich roles the state is an accept state. For instance, in Fig. 4,the accept list of state 5 is [1 0], indicating the state is anaccept state for role1 but not for role2, and the accesslist of state 6 is [1 1], indicating this state is accessible byboth roles. A LocationList is attached to each accept state. Inthe brokering process, the QBroker first checks if a query isallowed to access the requested nodes according to the roletype. If the query can access (a subset of) the requested data,it will be rewritten into a “safe” query and sent to the dataservers according to the location list; otherwise, the query willbe rejected.

IV. PRIVACY-PRESERVING QUERY BROKERING SCHEME

While QBroker [9] seamlessly integrates the content-basedindexing function into the NFA-based access control mech-anism, it heavily relies on the QBroker for the enforcementand shifts all the data (i.e., the ACR, index rules, and userqueries) to it. However, if the QBroker is compromised or nolonger assumed fully trusted (e.g. under the honest-but-curiousassumption as in our study), the privacy of both the requestorand the data owner is under risk. To tackle the problem,we present a privacy-preserving information brokering (PPIB)infrastructure with two core schemes. The automata segmen-tation scheme divides the QBroker into multiple logicallyindependent components so that each component only needsto process a piece of an user query but still can fulfillthe original brokering functions via collaboration. The querysegment encryption scheme allows to encrypt query pieceswith different keys so that one automaton component candecrypt the responsible piece(s) for further processing, whilenot hurdling the original distributed indexing function.

A. Automaton Segmentation

In the context of distributed information brokering, multipleorganizations join a consortium and agree to share the datawithin the consortium. While different organizations may have

different schemas, we assume a global schema exists byaligning and merging the local schemas. Thus, the accesscontrol rules and index rules for all the organizations canbe crafted following the same shared schema and capturedby a global automaton, the global QBroker. The key ideaof the automaton segmentation scheme is to logically dividethe global automaton into multiple independent yet connectedsegments, and physically distribute the segments onto differentbrokering servers.

1) Segmentation: The atomic unit in the segmentation is anNFA state of the original automaton. Each segment is allowedto hold one or several NFA states. We further define thegranularity level to denote the greatest distance between anytwo NFA states contained in one segment. Given a granularitylevel k, for each segmentation, the next i ∈ [1, k] NFA stateswill be divided into one segment with a probability 1/k. Ob-viously, a larger granularity level indicates that each segmentcontains more NFA states, resulting in a smaller number ofsegments and less end-to-end overhead in distributed queryprocessing. On the contrary, a coarse partition is more likely toincrease the privacy risk. The tradeoff between the processingcomplexity and the privacy requirements should be consideredin deciding the granularity level. As privacy protection is of theprimary concern of this work, we suggest a granularity level≤ 2. To reserve the logical connection between the segmentsafter segmentation, we define heuristic segmentation rules: (1)multiple NFA states in the same segment should be connectedvia parent-child links; (2) no sibling NFA states should notbe put in the same segment without the parent state; and (3)the “accept state” of the original global automaton should beput in separate segments. To ensure the segments are logicallyconnected, we change the last states of each segment to be“dummy” accept states, which point to the segments holdingthe child states in the original global automaton.

Algorithm 1 The automaton segmentation algorithm:deploySegment()Input: Automaton State SOutput: Segment Address: addr1: for each symbol k in S.StateTransTable do2: addr=deploySegment(S.StateTransTable(k).nextState)3: DS=createDummyAcceptState()4: DS.nextState← addr5: S.StateTransTable(k).nextState← DS6: end for7: Seg = createSegment()8: Seg.addSegment(S)9: Coordinator = getCoordinator()

10: Coordinator.assignSegment(Seg)11: return Coordinator.address

2) Deployment: We employ physical brokering servers,called coordinators, to store the logical segments. To reducethe number of needed coordinators, several segments can bedeployed on the same coordinator using different port numberto distinguish them. Therefore, the tuple ¡coordinator, port¿uniquely identifies a segment. For the ease of the presentation,we assume each coordinator only holds one segment in therest of the article. After the deployment, the coordinators canbe linked together according to the relative position of thesegments they store, and thus form a tree structure, where the

6

Fig. 5. An example to illustrate the automaton segmentation scheme: (a)divide the global automaton with granularity level of 1; (b) the segments arelinked to form a tree structure.

coordinator holding the root state of the global automaton isthe root of the coordinator tree and the coordinators holdingthe accept states are the leaf nodes of the tree. The query isprocessed along the paths of the coordinator tree in a similarway as it is processed by the global automaton: starting fromthe root coordinator, the first XPath step (token) of the query iscompared with the tokens in the root coordinator. If matched,the query will be sent to the next coordinator, and so onso forth, until it is accepted by a leaf coordinator and thenforwarded to the data server with the requested content nodes.At any coordinator, if the input XPath step does not match thestored tokens, the query will be denied access and droppedimmediately.

3) Replication: Since all the queries are supposed to beprocessed first by the root coordinator, it becomes a singlepoint of failure and a performance bottleneck. For betterreliability and performance, we need to replicate the rootcoordinator as well as the coordinators at the higher level ofthe coordinator tree. Replication has been extensively studiedin distributed systems. We adopt the passive path replicationstrategy to create the replicas for the coordinators along thepaths in the coordinator tree, and let a centralized authority(CA) to create or revoke the replicas (more details will bediscussed in Section refsec-maintenance). The CA maintainsa set of replicas for each coordinator, where the number ofreplicas is either a preset value or dynamically adjusted basedon the average queries passing through that coordinator.

4) Handling the predicates: The NFA in our study isconstructed in the same way as previous access control en-forcement approaches (e.g. QFilter [29], QBroker [9]). Asshown in Fig. 3, each NFA state has a predicate table attachedto each of its child states to store the predicate symbols

(i.e., pSymbol), if any, at each XPath step of the query.An empty symbol φ denotes no predicate. The strategy forhandling predicates, either from the query or from the ACR,is lookup-and-attach. That is, if an XPath step in the querymatches a child state in the state transition table (i.e., aneSymbol), any predicate attached to that XPath step or anypredicate stored in the predicate table will be attached tothe corresponding XPath step in the safe query and furtherevaluated by the data server. Leaving predicates handling to thedatabase can increase query processing speed at a particularstep, however, it may sibilantly increase unnecessary end-to-end communication and processing overhead. We illustrate thisproblem with the following example.

Example 4. Consider an input queryQ = /site/regions//item[@id = ”30”]/name and a QBrokeras shown in Fig. 4. This query will be accepted at state 11,rewritten into a safe query, and sent to three data servers.However, from the index rules, we can see data servers at195.228.155.9 and 135.176.4.56 do not contain the requesteddata. Routing a query to irrelevant data servers yieldsunnecessary overhead. 2

To address this problem, we present a new scheme to handlevalue-based predicates in input XML queries. Structural pred-icates (that contain twigs in the predicates) are not supportedby this scheme for due to the excessive overhead caused bywaiting for responses from the twigs in a distributed setting.

We first change the data structure of the original NFA stateby adding new fields as condition, type, and location, to thepredicate table, where pSymbol stores the predicate token,condition stores the test condition, type ∈ {R, I} indicatesif it is from an ACR or an index rule, and location storesthe addresses of the index rule. In processing, if the XPathstep of a query does not have a predicate, the scheme worksin a similar way as before: it looks up the predicate tablefor predicates introduced by ACRs (i.e., with type = R) andattaches “pSymbol||condition” to the safe query. If a predicateexists in a particular XPath step, the scheme retrieves recordsof the same predicate from the table, and sends them with thequery predicate to a predicate directory server, which furtherexamines the test conditions. In testing, the query predicateis first compared with the ACR predicate to determine if thequery passes the access control testing. After that, the querypredicate is further compared with the predicate introduced bythe index rules, to limit the scope of possible destination dataservers. For a predicate successfully passes both testing, thelocation of the index rule will be attached. Accordingly, inquery processing, if an accepted query is assigned multiplelocation lists, it will be sent to the data servers in theintersection of these lists.

B. Query Segment Encryption

Informative hints may be learnt from the content of thequery, so it is critical to protect the query from being inter-cepted by irrelevant brokering servers. However, it is difficult,if not impossible, to hide the query content from any of thebrokering servers in traditional brokering approaches, since it

7

is responsible for matching the query with the index rules oraccess control rules to enforce query routing or authorization.In our study, the automaton segmentation scheme providesa new encryption opportunity to encrypt the query in piecesand allow each coordinator to decrypt the piece it is aboutto process. The query segment encryption scheme consists ofthe pre-encryption, post-encryption, and a special commuta-tive encryption module for processing the double-slash (“//”)XPath step in the query.

1) Level-based pre-encryption: According to the automatonsegmentation scheme, each query is processed by a set ofcoordinators along a path of the coordinator tree. If we encryptthe query segments with the public key of the coordinatorscorrespondingly, we guarantee that each segment will bedecrypted and processed by the one who is supposed to doso. Moreover, each coordinator only sees a small portion ofthe query that is not enough for inference, but by collaboratingwith others, they can still fulfill the designed functions. Thekey problem in this approach is that the segment-coordinatorassociation is unknown beforehand in the distributed setting,since no party other than the CA knows how the global au-tomaton is segmented and distributed among the coordinators.

To tackle the problem, we propose to encapsulate querypieces based on the static and publicly known information– the global schema. XML schema forms a tree structure,in which the level of a node in the schema tree is definedas its distance to the root node. Since both the ACR andindex rules are constructed following the global schema, anXPath step (token) in the XPath expression of a rule isassociated with level i if the corresponding node in the schematree is at level i. We assume the nodes of the same levelshare a pair of public and private level keys, {pk, sk}. Afterautomaton segmentation, the segments (and the correspondingcoordinators) are assigned with the private key of level i, ski,if it contains a node of level i. In pre-encryption, the XPathsteps (between two “/” or “//”) of a query are encrypted fromthe root with the public level keys {pk1, pk2, ...}, respectively.Intuitively, the ith XPath step of a query should be processedby a segment with a node at level i, and therefore, is able tobe decrypted by the coordinator with that segment. Moreover,if a coordinator has a segment that contains XML nodes ofk different levels, it needs to decrypt the first k unprocessedXPath steps of the query.

2) Post-encryption: The processed query segment shouldalso be protected from all the coordinators in later process-ing, so post-encryption is necessary. In a simple scheme,we assume all the data servers share a pair of public andprivate keys, {pkDS , skDS}, where pkDS is known to allthe coordinators. Each coordinator first decrypts the querysegment(s) with its private level key, performs authorizationand indexing, and then encrypts the processed segment(s) withpkDS so that only the data servers can view it.

3) Commutative encryption for “//” handling: When aquery has the descendant-or-self axis (i.e., “//” in XPathexpressions), a so-called mismatching problem occurs at thecoordinator who takes the “//” XPath step as input. This is be-cause that the “//” XPath step may recursively accepts severaltokens until it finds a match. Consequently, the coordinator

Fig. 6. (a) An example of the mismatching problem; (b) An example ofcommutative encryption.

with the private level key may not be the one that matches the“//” token, and vice versa. We use the following example (asshown in Fig. 6(a)) to better explain this problem.

Example 5. Assume the global automaton is segmented asshown in Fig. 5. Coordinators C1, C2, C5, and C8

hold the segments Seg1, Seg2, Seg5, and Seg8, respec-tively. According to the level-based query segment encryp-tion scheme, a query //item/name will be encrypted as< //item >pk1< /name >pk2 . When C1 gets “//item”, itfinds the token does not match the NFA state “site” but it stillaccepts it due to the property of “//”. C2, while still not beable to process “//item”, can decrypt further irrelevant querysegment “/name” with pk2. 2

To tackle this problem, we revise the level-based encryptionscheme by adopting the commutative encryption, which isa collection of algorithms [12], [47], [48] that have theproperty of being commutative, where an encryption algorithmis commutative if for any two commutative keys e1 and e2 anda message m,<< m >e1>e2=<< m >e2>e1 .

In this scheme, we assign a new commutative level key eito each level, which is also commutative with the originallevel keys. We further assume the commutative level key oflevel i is shared with the nodes at level i+1. The commutativeencryption is invoked by the first coordinator encountering the“//” XPath step in a query and ended by the first coordinatorwhose NFA states match the “//” token. The entire processexperiences four stages. A flag ∈ [0, 3] is attached to a query torepresent each of the stage. The detailed algorithm is explainedas follows:

1) When a coordinator Cm first encounters the “//” XPathstep, it sets a pointer to the first unprocessed querysegment, encrypts all the unprocessed query segments(except the “//” XPath step) with its commutative levelkey em, and sets the flag to 1.

2) With flag = 1, the next coordinator Cm+1 first adds thesecond encapsulation to the unprocessed query segmentswith its commutative level key em+1; then it decrypts thepointed query segment with its private level key skm+1,

8

and moves the pointer to the next segment. After that,it sets the flag to 2.

3) For a following coordinator Cj , if none of its NFAstates matches the “//” token, it first removes onewrapping (by decrypting with dj−2) and adds onewrapping (by encrypting with ej) to the unprocessedquery segments. Due to the commutative encryp-tion property, the unprocessed query segments arechanged from << unprocessed >ej−2>ej−1 to <<unprocessed >ej−1

>ej . Then it decrypts the pointedsegment with its private level key skj and moves thepointer to the next.

4) When a coordinator Cn accepts the “//” token, it appliesthe commutative decryption to all the unprocessed seg-ments with key dn−2, and encrypts each of them withthe public level keys {pkn+1, pkn+2, ...}, respectively.After that, it sets the flag to 3.

5) Coordinator Cn+1 decrypts all the unprocessed segmentswith dn−1 and resets the flag to 0.

The core idea of commutative encryption is to wrap theunprocessed query segments after the “//” XPath step with twoconsecutive commutative layer keys, which are not possessedby a same coordinator. The additional wrapping is kept untilthe commutative encryption process is stopped by a matchingof the “//” token. In practical, we adopt Pohlig-Hellmanexponentiation cipher with modulus p as our commutativeencryption algorithm to generate the commutative keys.

Example 6. Let us revisit the previous example with commuta-tive encryption. As shown in Fig. 6(b), once C1 accepts the “//”XPath step “//item”, it starts to add the commutative wrap-ping to the second query segment “/name” by encrypting withe1. C2 removes the wrapping of original level-based encryp-tion and continues to add one more wrapping by encrypting wite2. C5 still does not match the “//” token, so it only changescommutative wrapping to the ones related to e2 and e3. WhenC8 matches ‘//item”, it unwraps one layer of commutativeencryption with d2 and adds a level-based encryption withpk5 to yield an output of << /name >e3>pk5 , which can bedecrypted by C9. 2

C. The Overall PPIB Architecture

The architecture of the privacy preserving information bro-kering system is shown in Fig. 7, where users and dataservers of multiple organizations are connected via a broker-coordinator overlay. User requests for remote data by sendinga query to the local broker, which further forwards the queryto the root of the coordinator tree. The query is processedalong a path of the coordinator tree, until it is denied by anycoordinator or accepted by a leaf coordinator. The acceptedquery is or rewritten into a safe query, and thus sent to therelevant data server(s). In particular, the brokering processconsists of four phases:• Phase 1: To join the system, a user needs to authenticate

himself to the local broker. After that, the user submitsan XML query with each segment encrypted by thecorresponding public level keys, and a unique session key

Fig. 7. We explain the query brokering process in four phases.

KQ, encrypted with the public key of the data servers,for the data server to return data.

• Phase 2: Beside authentication, the major task of the bro-ker is metadata preparation: (1) it extracts the role of theauthenticated user and attaches it to the encrypted query;(2) it creates a unique ID for each query, and attachesQID with its own address (as well as < KQ >pkDS

) tothe query so that the data server can directly return thedata.

• Phase 3: When the root of the coordinator tree receivesthe query and its metadata from a local broker, it followsthe automata segmentation scheme and the query seg-ment encryption scheme to perform access control andindexing to route the query within the coordinator tree,until it reaches a leaf coordinator, which further forwardsthe query to the related data servers. For any query thatis denied access based on the ACRs, a failure messagewill be returned to the broker with QID. At the leafcoordinator, all the query segments should be processedand encrypted with the public key of the data server.

• Phase 4: In the final phase, the data server receives a safequery in an encrypted form. After decryption, the dataserver evaluates the query and returns the data, encryptedby KQ, to the broker of the query.

V. MAINTENANCE

A. Key management

A central authority (CA) is assumed for off-line initiationand maintenance. With the highest level of trust, the CAholds a global view about all the rules and plays a criticalrole in automaton segmentation and key management. Fourtypes of keys are involved in the brokering process, the querysession key KQ, the public and private level keys {pk, sk},the commutative level keys {e, d}, and the public and privatekeys for data servers {pkDS , skDS}. Except the query sessionkeys created by the user, the other three types of keys aregenerated and maintained by the CA. The data servers aretreated as a unique party and share a pair of public and privatekeys, while each of the coordinators has its own pairs of levelkey and commutative level key. Along with the automatonsegmentation and deployment process, the CA creates keypairs for coordinators at each level and assigns the privatekeys with the segments. The level keys need to be revoked in

9

a batch once a certificate expires or when a coordinator at thesame level quits the system.

B. Brokering Servers Join/Leave

Brokers and coordinators, contributed by different organi-zations, are allowed to dynamically join or leave the PPIBsystem. Besides authentication, a local broker only works asan entrance to the the coordinator overly. It stores the addressof the root coordinator (and its replica) for forwarding thequeries. When a new broker joins the system, it registersto the CA to receive the current address list from the CAand broadcasts its own address to the local users. Whenleaving the system, a broker only needs to broadcast a leavemessage to the local users. Thing are more complicate forthe coordinators. Once joining the system, a new coordinatorsends a join request to the CA. The CA authenticates itsidentity, and assigns automaton segments to it consideringboth the load balance requirement and its trust level. Afterthat, the CA issues the corresponding private level keysand sends a broadcast ServerJoin(addr) message to updatethe location list attached to the parent coordinator with theaddress of the newly joined coordinator. When a coordinatorleaves the system, the CA decides whether to employ anexisting or a new coordinator as a replacement, based onthe heuristic rules for automaton deployment and the currentload at each coordinator. After that, the CA broadcasts aServerLeave(addr1, addr2) message to replace the addressof the old coordinator with the address of the new one inthe location list at the dummy accept state of the parentcoordinator. Finally, the CA revokes the corresponding levelkeys. If a failure is detected from a periodical status check bythe CA or reported by a neighboring coordinator, the CA willtreat the failed coordinator as a leaving server.

C. Metadata Update

ACR and index rules should be updated to reflect thechanges in the access control policy or the data distribution inan organization.

1) Index rules: To add or remove a (set of) data object,a local server need to send an update message, in the formof DataUpdate(object, address, action), to the CA, whereobject is an XPath expression to describe a set of XML nodes,address is the location of the data object, and action is either“add” or “remove”. For adding a data object, the CA sendsthe update message to the root coordinator, from which themessage traverses the coordinator network until reaching aleaf coordinator, where the address will be appended to itslocation list. A similar process is taken for data object removalto retrieve the corresponding leaf coordinators and removes theaddress from the location list.

2) Access control rules: Any change in the accesscontrol policy can be described by (a set of) positiveor negative access control rules. Therefore, we constructan ACRUpdate(role, object, type) message to reflect thechange for a particular role and send it to the CA. The CAforwards the message to the root coordinator, from which theXPath expression in object is processed by each coordinator

according to its state transition table, in the same way asconstructing an automaton with a new ACR: if the messagestops at a particular NFA state, the state will be changed to anaccept state for that role. Then, all the child and descendentleaf coordinators will be retrieved and the location lists willbe attached to the accept state. If the message is accepted byan existing leaf coordinator, new automaton segments will becreated and assigned to new coordinators. The location list atthe original leaf coordinator will be copied to the new leafcoordinator.

VI. PRIVACY AND SECURITY ANALYSIS

There are various types of attackers in the information bro-kering process. From their roles, we have abused insiders andmalicious outsiders; from their capabilities, we have passiveeavesdroppers and active attackers that can compromise anybrokering server; from their cooperation mode, we have singleand collusive attackers. In this section, we consider three mostcommon types of attackers, local and global eavesdroppers,malicious brokers and malicious coordinators. We first analyzepossible privacy breakages caused by each of them, and thensummarize possible privacy exposures in Figure I.

3) Eavesdroppers: A local eavesdropper is an attacker whocan observe all communication to and from the user side. Oncean end user initiates an inquire or receives requested data,the local eavesdropper can seize the outgoing and incomingpackets. However, it can only learn the location of local brokerfrom the captured packets since the content is encrypted. Al-though local brokers are exposed to this kind of eavesdroppers,as a gateway of DIBS system, it prevents further probingof the entire DIBS. Although the disclosed broker locationinformation can be used to launch DoS attack against localbrokers, a backup broker and some recovery mechanisms caneasily defend this type of attacks. As a conclusion, an outsideattacker who is not powerful enough to compromise brokeringcomponents in the system is less harmful to system securityand privacy.

A global eavesdropper is an attacker who observes the trafficin the entire network. It watches brokers and coordinatorsgossip, so it is capable to infer the locations of local brokersand root-coordinators. This is because the assurance of theconnections between user and broker, and between brokerand root-coordinator. However, from the later-on communi-cation, the eavesdropper cannot distinguish the coordinatorsand the data servers. Therefore, the major threat from a globaleavesdropper is the disclosure of broker and root-coordinatorlocation, which makes them targets of further DoS attack.

4) Single malicious broker: A malicious broker deviatesfrom the prescribed protocol and discloses sensitive infor-mation. It is obvious that a corrupted broker endangers userlocation privacy but not the privacy of query content. More-over, since the broker knows the root-coordinator locations,the threat is the disclosure of root-coordinator location andpotential DoS attack.

5) Collusive coordinators: Collusive coordinators deviatefrom the prescribed protocol and disclose sensitive informa-tion. Consider a set of collusive (corrupted) coordinators in

10

Privacytype

local eaves-dropper

globaleavesdropper

maliciousbroker

collusive coordinators

User Lo-cation

Exposed Exposed Exposed Protected

QueryContent

Protected Exposed Exposed Exposed only withcompromised rootcoordinator

AC Policy Protected Protected Protected Exposed if path coordi-nators collude

IndexRules

Protected Protected Protected Exposed if path coordi-nators collude

Data Dis-tribution

Protected Protected Protected Exposed if path coordi-nators collude

Data Lo-cation

Protected Beyond suspi-cion

Protected Exposed with maliciousleaf coordinators

TABLE ITHE POSSIBLE PRIVACY EXPOSURE CAUSED BY FOUR TYPES OF

ATTACKERS: LOCAL EAVESDROPPER (LE), GLOBAL EAVESDROPPER (GE),MALICIOUS BROKER (MB), AND COLLUSIVE COORDINATORS (CC).

the coordinator tree framework. Even though each coordinatorcan observe traffic on a path routed through it, nothing willbe exposed to a single coordinator because (1) the senderviewable to it is always a brokering component; (2) thecontent of the query is incomplete due to query segmentencryption; (3) the ACR and indexing information are alsoincomplete due to automaton segmentation; (4) the receiverviewable to it is likely to be another coordinator. However,privacy vulnerability exists if a coordinator makes reasonableinference from additional knowledge. For instance, if a leaf-coordinator knows how PPIB mechanism works, it can assureits identity (by checking the automaton it holds) and find outthe destinations attached to this automaton are of some dataservers. Another example is that one coordinator can comparethe segment of ACR it holds with the open schemas and makereasonable inference about its position in the coordinator tree.However, inference made by one coordinator may be vagueand even misleading.

VII. PERFORMANCE ANALYSIS

In this section, we analyze the performance of proposedPPIB system using end-to-end query processing time andsystem scalability. In our experiments, coordinators are codedin Java (JDK 5.0) and results are collected from coordinatorsrunning on a Windows desktop (3.4G CPU). We use theXMark [49] XML document and DTD, which is wildly usedin the research community. As a good imitation of real worldapplications, the XMark simulates an online auction scenario.

A. End-to-End Query Processing Time

End-to-end query processing time is defined as the timeelapsed from the point when query arrives at the broker untilto the point when safe answers are returned to the user. Weconsider the following four components: (1) average querybrokering time at each broker/coordinator (TC); (2) aver-age network transmission latency between broker/coordinators(TN ); (3) average query evaluation time at data server(s)(TE); and (4) average backward data transmission latency(Tbackward).

Query evaluation time highly depends on XML databasessystem, size of XML documents, and types of XML queries.

X: Number of keywords at a query broker; Y: Time (ms)

(a) Average query brokeringtime at a coordinator.

(b) Average symmetric and asym-metric encryption time.

Fig. 8. Estimate the overall processing time at each coordinator.

Once these parameters are set in the experiments, TE willremain the same (at seconds level [50]). Similarly, the samequery set and ACR set will create the same safe query set,and the same data result will be generated by data servers.As a result, TE and Tbackward are not affected by the broker-coordinator overlay network. We only need to calculate andcompare the total forward query processing time (Tforward) asTforward = TC ×NHOP + TN × (NHOP + 1). It is obviousthat Tforward is only affected by TC , TN , and the averagenumber of hops in query brokering, NHOP .

1) Average query processing time at the coordinator.:Query processing time at each broker/coordinator (TC) con-sists of: (1) access control enforcement and locating next coor-dinator (Query brokering); (2) generating a key and encryptingthe processed query segment (Symmetric encryption); and (3)encrypting the symmetric key with the public key created bysuper node (Asymmetric encryption).

To examine TC , we manually generate five sets of accesscontrol rules. Access control rules of each set are partitionedinto segments (keywords), which are assigned to coordinators.From set 1 to set 5, the number of keywords held by onecoordinator increases from 1 to 5. We also generate 1000synthetic XPath queries, and use Triple DES for symmetricencryption and RSA for asymmetric encryption. Figure 8(a)shows that query brokering time is at milliseconds level, andincreases linearly with the number of keywords at a site.Figure 8(b) shows that symmetric and asymmetric encryptiontime is at seconds level, and asymmetric encryption timedominates the total query processing time at a coordinator.As a result, average (TC) is about 1.9 ms. Query processingtime at brokers and leaf-coordinators are shorter but still inthe same level. For simplicity, we adopt the same value (i.e.1.9 ms) for the average query processing time at brokers andcoordinators.

2) Average network transmission latency.: We adopt aver-age Internet traffic latency 100 ms as a reasonable estimationof TN 1, instead of using data collected from our gigabyteEthernet.

3) Average number of hops.: We consider the case in whicha query Q is accepted or rewritten by n ACRs {R1, ..., Rn}into the union of n safe sub-queries {Q′1, ..., Q′n}. When anaccepted/rewritten sub-query Q′i is processed by the rule Ri,the number of hops it experiences is determined by the numberof segments of Ri. In the experiment, we generate a set of200 synthetic access control rules and 1000 synthetic XPath

1http://www.internettrafficreport.com

11

X: number of access control rules; Y: number of coordinators.

(a) Using simple path rules. (b) Using XPath rules with wildcards.

Fig. 9. System scalability: number of coordinators.

queries.It is obvious to see that the more segments the global

automaton is divided into, the more coordinators are neededand the less scalable the system is, due to the increasedquery processing cost. However, higher granularity leads tobetter privacy preserving performance. We choose the finest-granularity automaton segmentation (each XPath step of anACR is partitioned as one segment and kept at one coordinator)for maximum privacy preserving. Our experiment result showsthat NHOP is 5.7, and the maximum number of hops of allqueries is 8.

4) End-to-end query processing time.: From above experi-ment results, the total forward query processing time is calcu-lated as Tforward ' 1.9×5.7+100× (5.7+1) ' 681(ms). Itis obvious that network latency TN ∗ (NHOP + 1) dominatestotal forward end-to-end query processing time, because thevalue of TC is negligible compared with TN . Moreover, sinceTN remains the same (as an estimation from Internet traffic),NHOP becomes the deterministic factor that affects end-to-end query processing time. Note that for other informationbrokering systems, although they use different query routingscheme, network latency is not avoidable. As a conclusion, theproposed PPIB approach achieves privacy-preserving querybrokering and access control with limited computation.

B. System Scalability

We evaluate the scalability of the PPIB system againstcomplicity of ACR, the number of user queries, and data size(number of data objects and data servers).

1) Complicity of XML schema and ACR.: When the seg-mentation scheme is determined, the demand of coordinatorsis determined by the number of ACR segments, which islinear with the number of access control rules. Assume finestgranularity automaton segmentation is adopted, we can see thatthe increase of demanded number of coordinators is linear oreven better, as shown in Figure 9(a) and (b). This is becausesimilar access control rules with same prefix may share XPathsteps, and save the number of coordinators. Moreover, differentACR segments (or, logical coordinators) may reside at thesame physical site, thus reduce the actual demand of physicalsites. In this framework, the number of coordinators, m, andthe height of the coordinator tree, h, are highly dependent onhow access control policies are segmented.

2) Number of queries: Considering n queries submittedinto the system in a unit time, we use the total number of

X: number of queries in a unit time; Y: number of total segments in the system.

(a) Using simple XPath rulesand simple XPath queries.

(b) Using simple XPathqueries and rules withwildcards.

(c) Using queries and ruleswith 5% wildcards probabilityat each XPath step.

(d) Using query and ACR with10% wildcards probability ateach XPath step.

Fig. 10. System scalability: number of query segments.

query segments being processed in the system to measure thesystem load. When a query is accepted as multiple sub-queries,all sub-queries are counted towards system load.For a querythat is rejected after i segments, the processed i segments arecounted.

We generate 5 sets of synthetic ACRs and 10 sets ofsynthetic XML queries with different numbers and wildcard(i.e. “/*” and “//”) probabilities at each XPath step in eachexperiment. Figure 10 shows system load vs. number of XPathqueries in a unit time. More specifically, Figure 10(a) onlyhas simple path rules (without wildcard or predicate), andFigure 10(b) has rules with wildcards. In both cases, systemload increases linearly and each query is processed less than10 segments. Figure 10(c) and (d) use the same set of ACRs asin Figure 10(b), but add wildcards into queries with probability5% and 10% at each step, respectively. In the worst case, eachquery is processed no more than 50 segments. Moreover, ifwe compare curves in each sub-figure, we can see that largerACR leads to higher system load, but the increase appears tobe linear in all cases.

3) Data size: When data volume increases (e.g. addingmore data items into the online auction database), the numberof indexing rules also increases. This results in increasing ofthe number of leaf-coordinators. However, in PPIB, query in-dexing is implemented through hash tables, which is scalable.Thus, the system is scalable when data size increases.

VIII. CONCLUSION

With little attention drawn on privacy of user, data, andmetadata during the design stage, existing information bro-kering systems suffer from a spectrum of vulnerabilities as-sociated with user privacy, data privacy, and metadata pri-vacy. In this paper, we propose PPIB, a new approach topreserve privacy in XML information brokering. Throughan innovative automaton segmentation scheme, in-networkaccess control, and query segment encryption, PPIB integratessecurity enforcement and query forwarding while providing

12

comprehensive privacy protection. Our analysis shows that it isvery resistant to privacy attacks. End-to-end query processingperformance and system scalability are also evaluated and theresults show that PPIB is efficient and scalable.

Many directions are ahead for future research. First, atpresent, site distribution and load balancing in PPIB areconducted in an ad-hoc manner. Our next step of researchis to design an automatic scheme that does dynamic sitedistribution. Several factors can be considered in the schemesuch as the workload at each peer, trust level of each peer,and privacy conflicts between automaton segments. Designinga scheme that can strike a balance among these factors isa challenge. Second, we would like to quantify the level ofprivacy protection achieved by PPIB. Finally, we plan to min-imize (or even eliminate) the participation of the administratornode, who decides such issues as automaton segmentationgranularity. A main goal is to make PPIB self-reconfigurable.

REFERENCES

[1] W. Bartschat, J. Burrington-Brown, S. Carey, J. Chen, S. Deming, andS. Durkin, “Surveying the RHIO landscape: A description of currentRHIO models, with a focus on patient identification,” Journal of AHIMA77, pp. 64A–D, January 2006.

[2] A. P. Sheth and J. A. Larson, “Federated database systems for managingdistributed, heterogeneous, and autonomous databases,” ACM Comput-ing Surveys (CSUR), vol. 22, no. 3, pp. 183–236, 1990.

[3] L. M. Haas, E. T. Lin, and M. A. Roth, “Data integration throughdatabase federation,” IBM Syst. J., vol. 41, no. 4, pp. 578–596, 2002.

[4] X. Zhang, J. Liu, B. Li, and T.-S. P. Yum, “CoolStreaming/DONet:A data-driven overlay network for efficient live media streaming,” inProceedings of IEEE INFOCOM, 2005.

[5] A. C. Snoeren, K. Conley, and D. K. Gifford, “Mesh-based contentrouting using XML,” in SOSP, pp. 160–173, 2001.

[6] N. Koudas, M. Rabinovich, D. Srivastava, and T. Yu, “Routing XMLqueries,” in ICDE ’04, p. 844, 2004.

[7] G. Koloniari and E. Pitoura, “Peer-to-peer management of XML data:issues and research challenges,” SIGMOD Rec., vol. 34, no. 2, 2005.

[8] M. Franklin, A. Halevy, and D. Maier, “From databases to dataspaces: anew abstraction for information management,” SIGMOD Rec., vol. 34,no. 4, pp. 27–33, 2005.

[9] F. Li, B. Luo, P. Liu, D. Lee, P. Mitra, W. Lee, and C. Chu, “In-brokeraccess control: Towards efficient end-to-end performance of informationbrokerage systems,” in Proc. IEEE SUTC, 2006.

[10] F. Li, B. Luo, P. Liu, D. Lee, and C.-H. Chu, “Automaton segmentation:A new approach to preserve privacy in XML information brokering,” inACM CCS ’07, pp. 508–518, 2007.

[11] D. L. Chaum, “Untraceable electronic mail, return addresses, and digitalpseudonyms,” Communications of the ACM, vol. 24, no. 2, 1981.

[12] R. Agrawal, A. Evfimivski, and R. Srikant, “Information sharing acrossprivate databases,” in Proceedings of the 2003 ACM SIGMOD, 2003.

[13] M. Genesereth, A. Keller, and O. Duschka, “Informaster: An informationintegration system,” in SIGMOD, (Tucson), 1997.

[14] I. Manolescu, D. Florescu, and D. Kossmann, “Answering XML querieson heterogeneous data sources,” in VLDB, pp. 241–250, 2001.

[15] J. Kang and J. F. Naughton, “On schema matching with opaque columnnames and data values,” in SIGMOD, pp. 205–216, 2003.

[16] I. Stoica, R. Morris, D. Liben-Nowell, D. Karger, M. Kaashoek,F. Dabek, and H. Balakrishnan, “Chord: A scalable peer-to-peer lookupprotocol for Internet applications,” in IEEE/ACM Transactions on Net-working, vol. 11 of 1, 2003.

[17] R. Huebsch, B. Chun, J. Hellerstein, B. Loo, P. Maniatis, T. Roscoe,S. Shenker, I. Stoica, and A. Yumerefendi, “The architecture of PIER:an Internet-scale query processor,” in CIDR, pp. 28–43, 2005.

[18] O. Sahin, A. Gupta, D. Agrawal, and A. E. Abbadi, “A peer-to-peerframework for caching range queries,” in ICDE, 2004.

[19] A. Carzaniga, M. J. Rutherford, and A. L. Wolf, “A routing scheme forcontent-based networking,” in Proc. of INFOCOM, 2004.

[20] Y. Diao, S. Rizvi, and M. J. Franklin, “Towards an Internet-scale XMLdissemination service,” in VLDB Conference, (Toronto), August 2004.

[21] G. Koloniari and E. Pitoura, “Content-based routing of path queries inpeer-to-peer systems.,” in EDBT, pp. 29–47, 2004.

[22] M. K. Reiter and A. D. Rubin, “Crowds: anonymity for Web transac-tions,” ACM TISS, vol. 1, no. 1, pp. 66–92, 1998.

[23] P. F. Syverson, D. M. Goldschlag, and M. G. Reed, “Anonymousconnections and onion routing,” in IEEE Symposium on Security andPrivacy, (Oakland, California), pp. 44–54, 4–7 1997.

[24] W. Tolone, G.-J. Ahn, T. Pai, and S.-P. Hong, “Access control incollaborative systems,” ACM Comput. Surv., vol. 37, no. 1, 2005.

[25] S. Cho, S. Amer-Yahia, L. V. S. Lakshmanan, and D. Srivastava,“Optimizing the secure evaluation of twig queries.,” in VLDB, 2002.

[26] M. Murata, A. Tozawa, and M. Kudo, “XML access control using staticanalysis,” in ACM CCS, 2003.

[27] S. Rizvi, A. Mendelzon, S. Sudarshan, and P. Roy, “Extending queryrewriting techniques for fine-grained access control,” in SIGMOD’04,(Paris, France), pp. 551–562, 2004.

[28] T. Yu, D. Srivastava, L. V. S. Lakshmanan, and H. V. Jagadish,“Compressed accessibility map: Efficient access control for XML,” inVLDB, (China), pp. 478–489, 2002.

[29] B. Luo, D. Lee, W. C. Lee, and P. Liu, “Qfilter: Fine-grained run-time XML access control via nfa-based query rewriting enforcementmechanisms,” in CIKM, 2004.

[30] A. Berglund, S. Boag, D. Chamberlin, M. F. Fernndez, M. Kay,J. Robie, and J. Simon, “XML path language (XPath) version 2.0.”http://www.w3.org/TR/xpath20/, 2003.

[31] E. Damiani, S. Vimercati, S. Paraboschi, and P. Samarati, “A fine-grainedaccess control system for XML documents,” ACM TISSEC, vol. 5, no. 2,pp. 169–202, 2002.

[32] E. Damiani, S. di Vimercati, S. Paraboschi, and P. Samarati, “SecuringXML documents,” EDBT 2000, pp. 121–135, 2000.

[33] H. Zhang, N. Zhang, K. Salem, and D. Zhuo, “Compact access controllabeling for efficient secure XML query evaluation,” Data & KnowledgeEngineering, vol. 60, no. 2, pp. 326–344, 2007.

[34] Y. Xiao, B. Luo, and D. Lee, “Security-conscious XML indexing,”Advances in Databases: Concepts, Systems and Applications, 2007.

[35] E. Bertino, S. Castano, and E. Ferrari, “Securing XML Documents withAuthorX,” IEEE Internet Computing, vol. 5, no. 3, pp. 21–31, 2001.

[36] E. Damiani, S. Vimercati, S. Paraboschi, and P. Samarati, “Design andimplementation of an access control processor for XML documents.,”Computer Networks, vol. 33, no. 1-6, pp. 59–75, 2000.

[37] A. Gabillon and E. Bruno, “Regulating access to xml documents,” inProc. DAS, pp. 299–314, 2002.

[38] W. Fan, C.-Y. Chan, and M. Garofalakis, “Secure xml querying withsecurity views,” in ACM SIGMOD, pp. 587–598, 2004.

[39] M. Kudo, “Access-condition-table-driven access control for XMLdatabases,” ESORICS 2004, pp. 17–32, 2004.

[40] S. Mohan, A. Sengupta, and Y. Wu, “Access control for XML: a dynamicquery rewriting approach,” in Proc. IKM, pp. 251–252, 2005.

[41] N. Qi and M. Kudo, “XML access control with policy matching tree,”ESORICS 2005, pp. 3–23, 2005.

[42] L. Bouganim, F. D. Ngoc, and P. Pucheral, “Client-based access controlmanagement for XML documents.,” in VLDB, pp. 84–95, 2004.

[43] S. Abiteboul, A. Bonifati, G. Cobena, I. Manolescu, and T. Milo,“Dynamic xml documents with distribution and replication,” in ACMSIGMOD, pp. 527–538, ACM, 2003.

[44] P. Skyvalidas, E. Pitoura, and V. Dimakopoulos, “Replication routingindexes for xml documents,” in DBISP2P Workshop, 2007.

[45] G. Skobeltsyn, Query-driven indexing in large-scale distributed systems.PhD thesis, EPFL, 2009.

[46] P. Rao and B. Moon, “Locating xml documents in a peer-to-peer networkusing distributed hash tables,” TKDE, vol. 21, no. 12, 2009.

[47] C. Clifton, M. Kantarcioglu, J. Vaidya, X. Lin, and M. Zhu, “Toolsfor privacy preserving distributed data mining,” ACM SIGKDD Explo-rations, vol. 4, no. 2, 2003.

[48] H. Y. S. Lu, “Commutative cipher based en-route filtering in wirelesssensor networks,” in VTC, vol. 2, pp. 1223– 1227, Sept. 2004.

[49] A. Schmidt, F. Waas, M. Kersten, M. J. Carey, I. Manolescu, andR. Busse, “XMark: a benchmark for XML data management,” in VLDB,pp. 974–985, 2002.

[50] H. Lu, J. X. Yu, G. Wang, S. Zheng, H. Jiang, G. Yu, and A. Zhou, “Whatmakes the differences: benchmarking xml database implementations,”ACM Trans. Inter. Tech., vol. 5, no. 1, pp. 154–194, 2005.