14
1 Privacy-Preserving Selective Aggregation of Online User Behavior Data Jianwei Qian, Fudong Qiu, Student Member, IEEE, Fan Wu, Member, IEEE, Na Ruan, Member, IEEE, Guihai Chen, Member, IEEE, and Shaojie Tang, Member, IEEE Abstract—Tons of online user behavior data are being generated every day on the booming and ubiquitous Internet. Growing efforts have been devoted to mining the abundant behavior data to extract valuable information for research purposes or business interests. However, online users’ privacy is thus under the risk of being exposed to third-parties. The last decade has witnessed a body of research works trying to perform data aggregation in a privacy-preserving way. Most of existing methods guarantee strong privacy protection yet at the cost of very limited aggregation operations, such as allowing only summation, which hardly satisfies the need of behavior analysis. In this paper, we propose a scheme PPSA, which encrypts users’ sensitive data to prevent privacy disclosure from both outside analysts and the aggregation service provider, and fully supports selective aggregate functions for online user behavior analysis while guaranteeing differential privacy. We have implemented our method and evaluated its performance using a trace-driven evaluation based on a real online behavior dataset. Experiment results show that our scheme effectively supports both overall aggregate queries and various selective aggregate queries with acceptable computation and communication overheads. Index Terms—Selective Aggregation; Differential Privacy; Homomorphic Encryption. 1 I NTRODUCTION Online user behavior analysis studies how and why users of e-commerce platforms and web applications behave. It has been widely applied in practice, especially in commer- cial environments, political campaigns, and web application development [1]–[3]. Data aggregation is one of the most critical operations in behavior analysis. Nowadays, the ag- gregation tasks for user data are outsourced to third-party data aggregators including Google Analytics, comScore, Quantcast, and StatCounter. While this tracking scheme brings great benefits to analysts and aggregators, it also raises serious concerns about disclosure of users’ privacy [4]. Aggregators hold detailed data of users’ online behaviors, from which demographics can be easily inferred [5]. To protect users’ privacy, government and industry regulations were established, e.g., the EU Cookie Law [6] and W3C This work was supported in part by the State Key Development Program for Basic Research of China (973 project 2012CB316201), in part by China NSF grant 61422208, 61472252, 61272443 and 61133006, in part by CCF- Intel Young Faculty Researcher Program and CCF-Tencent Open Fund, in part by the Scientific Research Foundation for the Returned Overseas Chinese Scholars, and in part by Jiangsu Future Network Research Project No. BY2013095-1-10. The opinions, findings, conclusions, and recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies or the government. J. Qian was with Shanghai Key Laboratory of Scalable Computing and Systems, the Department of Computer Science and Engineering, Shanghai Jiao Tong University, China, when doing this work. He is currently with the Department of Computer Science, Illinois Institute of Technology, USA. E- mail: [email protected]. F. Qiu, F. Wu, N. Ruan and G. Chen are with Shanghai Key Laboratory of Scal- able Computing and Systems, the Department of Computer Science and En- gineering, Shanghai Jiao Tong University, China. E-mails: [email protected], {fwu, naruan, gchen}@cs.sjtu.edu.cn. S. Tang is with the Department of Information Systems, Naveen Jindal School of Management, the University of Texas at Dallas, USA. E-mail: [email protected]. F. Wu is the corresponding author. Clients Intermediary Analyst Query Result Data Data Data Quer Resu ata ata ata Fig. 1: System overview: Clients are installed on user side. The intermediary collects data from clients, computes aggre- gate statistics, and answers queries issued by the analyst. The intermediary should also ensure users’ privacy is not leaked. Do-Not-Track [7], which significantly restricts the analy- sis of users’ online behaviors [4]. To address the conflict between the utility of analysis results and users’ privacy, much effort has been devoted to designing protocols that allow operations on user data while still protecting users’ privacy (e.g., [4], [8]–[14]). Unfortunately, existing schemes guarantee strong privacy at the expense of limitations on analysis. Most of them can only compute summation and mean of data over all users without filter or selection, i.e., overall aggregation. Some previous methods allow more complex computations [13]–[15]. For instance, Jung et al. [13] proposed a system that can perform multivariate poly- nomial evaluation. Unfortunately, they still do not support selection. However, selective aggregation is one of the most important operations for queries on databases. It can be used to tell the difference among different user groups in a certain aspect. For instance, “select avg(income) from database group by gender”. As shown in Fig. 1, a typical privacy-preserving data

Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

1

Privacy-Preserving Selective Aggregation ofOnline User Behavior Data

Jianwei Qian, Fudong Qiu, Student Member, IEEE, Fan Wu, Member, IEEE, Na Ruan, Member, IEEE,Guihai Chen, Member, IEEE, and Shaojie Tang, Member, IEEE

Abstract—Tons of online user behavior data are being generated every day on the booming and ubiquitous Internet. Growing effortshave been devoted to mining the abundant behavior data to extract valuable information for research purposes or business interests.However, online users’ privacy is thus under the risk of being exposed to third-parties. The last decade has witnessed a body ofresearch works trying to perform data aggregation in a privacy-preserving way. Most of existing methods guarantee strong privacyprotection yet at the cost of very limited aggregation operations, such as allowing only summation, which hardly satisfies the need ofbehavior analysis. In this paper, we propose a scheme PPSA, which encrypts users’ sensitive data to prevent privacy disclosure fromboth outside analysts and the aggregation service provider, and fully supports selective aggregate functions for online user behavioranalysis while guaranteeing differential privacy. We have implemented our method and evaluated its performance using a trace-drivenevaluation based on a real online behavior dataset. Experiment results show that our scheme effectively supports both overallaggregate queries and various selective aggregate queries with acceptable computation and communication overheads.

Index Terms—Selective Aggregation; Differential Privacy; Homomorphic Encryption.

F

1 INTRODUCTION

Online user behavior analysis studies how and why usersof e-commerce platforms and web applications behave. Ithas been widely applied in practice, especially in commer-cial environments, political campaigns, and web applicationdevelopment [1]–[3]. Data aggregation is one of the mostcritical operations in behavior analysis. Nowadays, the ag-gregation tasks for user data are outsourced to third-partydata aggregators including Google Analytics, comScore,Quantcast, and StatCounter. While this tracking schemebrings great benefits to analysts and aggregators, it alsoraises serious concerns about disclosure of users’ privacy [4].Aggregators hold detailed data of users’ online behaviors,from which demographics can be easily inferred [5]. Toprotect users’ privacy, government and industry regulationswere established, e.g., the EU Cookie Law [6] and W3C

This work was supported in part by the State Key Development Programfor Basic Research of China (973 project 2012CB316201), in part by ChinaNSF grant 61422208, 61472252, 61272443 and 61133006, in part by CCF-Intel Young Faculty Researcher Program and CCF-Tencent Open Fund, inpart by the Scientific Research Foundation for the Returned Overseas ChineseScholars, and in part by Jiangsu Future Network Research Project No.BY2013095-1-10. The opinions, findings, conclusions, and recommendationsexpressed in this paper are those of the authors and do not necessarily reflectthe views of the funding agencies or the government.J. Qian was with Shanghai Key Laboratory of Scalable Computing andSystems, the Department of Computer Science and Engineering, ShanghaiJiao Tong University, China, when doing this work. He is currently with theDepartment of Computer Science, Illinois Institute of Technology, USA. E-mail: [email protected]. Qiu, F. Wu, N. Ruan and G. Chen are with Shanghai Key Laboratory of Scal-able Computing and Systems, the Department of Computer Science and En-gineering, Shanghai Jiao Tong University, China. E-mails: [email protected],{fwu, naruan, gchen}@cs.sjtu.edu.cn.S. Tang is with the Department of Information Systems, Naveen JindalSchool of Management, the University of Texas at Dallas, USA. E-mail:[email protected]. Wu is the corresponding author.

Clients Intermediary Analyst

Query

Result

Data

Data

Data

Quer

Resu

ata

ata

ata

Fig. 1: System overview: Clients are installed on user side.The intermediary collects data from clients, computes aggre-gate statistics, and answers queries issued by the analyst.The intermediary should also ensure users’ privacy is notleaked.

Do-Not-Track [7], which significantly restricts the analy-sis of users’ online behaviors [4]. To address the conflictbetween the utility of analysis results and users’ privacy,much effort has been devoted to designing protocols thatallow operations on user data while still protecting users’privacy (e.g., [4], [8]–[14]). Unfortunately, existing schemesguarantee strong privacy at the expense of limitations onanalysis. Most of them can only compute summation andmean of data over all users without filter or selection,i.e., overall aggregation. Some previous methods allow morecomplex computations [13]–[15]. For instance, Jung et al.[13] proposed a system that can perform multivariate poly-nomial evaluation. Unfortunately, they still do not supportselection. However, selective aggregation is one of the mostimportant operations for queries on databases. It can beused to tell the difference among different user groupsin a certain aspect. For instance, “select avg(income) fromdatabase group by gender”.

As shown in Fig. 1, a typical privacy-preserving data

Page 2: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

2

aggregation system is composed of three parts: clients, inter-mediary (i.e., aggregation service provider) and analyst. Theintermediary collects data from clients (users’ devices), doessome calculations and evaluates aggregate queries issued bythe analyst. A common assumption made in many existingsystems is that the intermediary is not trusted.

Adding noise to the aggregate result is a commonmethod to achieve stronger privacy preservation, differentialprivacy [16], without which individuals’ privacy can beeasily inferred from aggregate results by adversaries whohave computational power and auxiliary information. Thereare two ways to add noise: either each client adds noiseto its own data (e.g., [8], [12], [17]), or the intermediaryobliviously adds noise to the aggregate result (e.g., [4], [10],[11]). PPSA adopts the latter way to achieve differentialprivacy, in which noise needs to be added obliviously so asto prevent the intermediary from determining the noise-freeresult when the noisy result is publicly released.

The main goal of this paper is to design a practicalprotocol that is able to compute selective aggregation ofuser data while still preserving users’ privacy. There aremainly three challenges. First, the untrusted intermediaryneeds to evaluate selective aggregation obliviously. It cannotaccess user data for privacy concerns, but we hope it doescomputations to achieve selection and aggregation on userdata. We exploit homomorphic cryptosystem to address thischallenge, but so far it does not directly support data selec-tion. Second, our scheme PPSA needs to achieve differentialprivacy in a homomorphic cryptosystem. To protect individ-uals’ privacy, we need to obliviously add noise to aggregateresults in addition to encrypting user data. Existing differen-tial privacy mechanism generates noise from real numbers,but homomorphic cryptosystems require plaintexts to beintegers. Simply scaling real numbers to integers wouldcause inaccuracy and inconvenience. Thus, we need to re-solve this conflict. Third, PPSA should be resistant to clientchurn, the situation where clients switch between online andoffline frequently. When an analyst issues a query, therecould be few users connected, which means few data canbe collected to evaluate the query. But the analyst wants theintermediary to respond to her as soon as possible. Thus,our protocol needs to tolerate client churn and evaluate thequery both timely and accurately.

To address these challenges, we design a scheme PPSA(Privacy-Preserving Selective Aggregation). In general, ourcontributions can be summarized as follows:

• We present the first scheme PPSA that allowsprivacy-preserving selective aggregation on userdata, which plays a critical role in online user be-havior analysis.

• We combine homomorphic encryption and differen-tial privacy mechanism to protect users’ sensitiveinformation from both analysts and aggregation ser-vice providers, and protect individuals’ privacy frombeing inferred. We prove that differential privacycan be achieved by adding two Geometric variables,which is computed via homomorphic encryption.Furthermore, we present a privacy analysis of PPSA.

• We extend PPSA to two more scenarios to fullysupport more complex selective aggregation of user

T The centralized data tableatt An attribute in Ta A numeric attribute in Tb A boolean attribute in Tc A numeric attribute for selection in TA Attribute setU User setu A user in U

xuatt The answer of user u to attribute att

cuatt The ciphertext of xuatt

Q A query from the analystε Privacy parameter in differential privacyp A parameter indicating how much noise should

be addedDL(p) Discrete Laplace distribution on integers

Geo(1− p) Geometric distributionZ1, Z2 Two IID geometric variables as half-noisespk Public key of BGN cryptosystemsk Private key of BGN cryptosystemmin Minimum sample size set by the authoritys Sample size for the analyst requests

T (u, att) The entry in T corresponding to user u andattribute att

E(pt) The ciphertext of any plaintext ptS A randomly chosen set of s users who has valid

dataSatt The sum over attribute att

S Noisy result by adding noise to S

Sbatt The sum over att selected by b

flagu A parameter indicating if u satisfies the selectionconditions

TABLE 1: Symbol Table

data. We utilize a calculation to evaluate aggregationselected by multiple boolean attributes. We design away of oblivious comparison between two integers,and utilize it to evaluate aggregation selected by anumeric attribute.

• We implement PPSA and do a trace-driven evalua-tion based on an online behavior dataset. Evaluationresults show that our scheme effectively supportsvarious selective aggregate queries with high accu-racy and acceptable computation and communica-tion overheads.

The rest of this paper is organized as follows. Section2 presents an overview of PPSA and related backgroundknowledge. Section 3 gives details on design rationale andprotocols. Then, Section 4 presents two extensions of PPSA.Section 5 analyzes simulation results, followed by relatedwork in Section 6. Finally, Section 7 concludes this paper.

2 PRELIMINARIES

In this section, we first present PPSA model. Then, we intro-duce differential privacy and review a useful homomorphiccryptosystem.

2.1 System ModelFirst of all, we describe terms used in our study. In behavioranalytics, overall aggregation and selective aggregation aretwo basic types used to query over data from a group ofusers.

Overall aggregation means computing the sum and meanof a certain value of every user, e.g., “the total amount oftime online of all the users yesterday”.

Page 3: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

3

Selective aggregation literally refers to selecting the userswho satisfy some conditions before aggregating their values,e.g., “the average amount of time online of all the maleusers”. Herein, “male” is a condition to pick out target users.

We suppose there is a centralized table T that containsattributes and collects users’ answers to them. Attributes(denoted by att) can only be numeric, because non-numericattributes cannot be directly aggregated.

Boolean attributes are a special type of numeric attributes,to which users’ answers are boolean values (0 or 1), e.g.,“gender is male”. A male user’s answer would be 1. Mostcategorical attributes can be easily transformed into booleanattributes. For example, education level can be decomposedinto several boolean attributes: “education level is bache-lor”, “education level is master”, etc.

Numeric attributes. Since boolean attributes are very im-portant in PPSA, we are referring to non-boolean numericattributes when we use “numeric attributes” later on. Users’answers to numeric attributes are non-negative integers, e.g.,“age=25”.

To formalize, the relation schema of T is

T (id, a1, a2, . . . , ai, b1, b2, . . . , bj).

id is user ID, a represents numeric attribute, and b representsboolean attribute. The set of attributes is:

A = {a1, a2, . . . , ai, b1, b2, . . . , bj}.

Important symbols of this paper are listed in Table 1.Fig. 1 gives an overview of PPSA, which is comprised of

a set of n users, the intermediary, and an analyst.Clients are installed on the user side. They can be bun-

dled with users’ software that requires private analytics.Thus, it is reasonable to assume clients are trusted. A clientcollects a user’s data, detects and removes outliers. Oncethe user gets online, the client sends encrypted data tothe intermediary. Clients are not involved in the processof statistical aggregation. We suppose the set of users isdenoted as U = {1, 2, . . . , n}. The plaintext answer of useru to attribute att is denoted as xuatt, and the ciphertext ofxuatt is denoted as cuatt, for ∀att ∈ A, ∀u ∈ U.

Analysts are individuals or institutions that want toquery about user data. An analyst sends a query Q tothe intermediary and then receives a noisy answer fromit. Analysts are assumed to be semi-honest, trying to learnindividual users’ privacy. An analyst may collude with otheranalysts or make one single query multiple times.

The intermediary, comprised of an aggregator and an au-thority, bridges clients and analysts. They are in charge of ag-gregating user data from clients and responding to queriesof analysts. They provide both functionality and security.Table T is stored here. Details of them will be presentedin Section 3. We assume they are both semi-honest (a.k.a.honest but curious), i.e., they faithfully run the specifiedprotocol, but may try to learn additional information ontheir own. Furthermore, they do not collude with each other.This assumption is common in the literature [4], [10], [11],[17], and appropriate in realistic scenarios. For example, weassume that it is explicitly stated in the privacy policies ofthe aggregator and the authority, and enforced by the law.Aggregators who do not provide such privacy statements

would be blocked by the client software, e.g.browsers. Like-wise, analysts would turn to only honest authorities whoobey such policies. Though there is a possibility that theaggregator colludes with the authority, the possibility isvery small, because the party who initiates collusion takesa high risk of being reported and exposed by the otherparty who is honest, and then suffer from legal punishmentand loss of customers and profits. As a result, we believeour assumption is reasonable in reality. In addition, thisassumption can be relaxed by using trusted hardware [11].

Our privacy goal is to prevent user data leakage toanalysts and the intermediary.

2.2 Differential PrivacyDifferential privacy proposed by Dwork et al. [16], [18],[19] is a privacy mechanism that protects any individual’sprivacy by making it very hard to determine whether or nother record is in the queried table.

Definition 1 ((ε, δ)-differential privacy [20]). A computationC achieves (ε, δ)-differential privacy if, for any two tables T1 andT2 that differ in at most one record, and all O ⊆ range(C):

Pr(C(T1) ∈ O) ≤ exp(ε)× Pr(C(T2) ∈ O) + δ, (1)

where Pr is a probability distribution over the randomness of thecomputation.

That means the probability that a computation generatesa given output is almost independent of the presence ofany individual record in the dataset. Differential privacyis independent of the adversary’s computational powerand auxiliary information so it is a very strong guarantee.Specifically, there are two privacy parameters, ε and δ, inthe expression. The former parameter ε mainly controlsthe tradeoff between the accuracy of a computation andthe strength of its privacy guarantee. Higher ε representshigher accuracy but weaker privacy guarantee, and viceversa. The latter parameter δ relaxes the strict relative shiftof probability in some cases, where Inequation (1) cannot besatisfied without a non-zero δ [20].

A standard technique is adding random noise to theanswers. The noise distribution is carefully calibrated to thesensitivity of the query.

Definition 2 (Sensitivity [16]). For a single snapshot queryQ : T → Rk, the sensitivity of Q is

∆Q = max ‖Q(T1)−Q(T2)‖1 (2)

for all T1, T2 differing in at most one element.

Informally, the sensitivity ∆Q is the maximum amountthe result Q(T ) can change, given any change to a singleuser’s data in table T . It is a property of the query Q aloneand is independent of the table. The sensitivity decides howmuch noise should be added to the answer of a query.

To achieve differential privacy, we prove that noise Z canbe generated according to the discrete Laplace distributionDL [21], which is on integers. The probability mass functionof DL(p) is

Pr(Z = k) =1− p1 + p

p|k|, k ∈ Z. (3)

Page 4: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

4

Theorem 1. Given a queryQ and a random integer Z ∼ DL(p),adding Z to the result of Q achieves ε-differential privacy, wherep = exp(−ε/∆Q).

Proof: For any r ∈ range(C),

Pr(C(T1) = r) = Pr(Q(T1) + Z = r)

= Pr(Z = r −Q(T1))

=1− p1 + p

p|r−Q(T1)|, (4)

Similarly,

Pr(C(T2) = r) =1− p1 + p

p|r−Q(T2)|. (5)

Then, we have

Pr(C(T1) = r)

Pr(C(T2) = r)=p|r−Q(T1)|−|r−Q(T2)|

=exp

(ε(|r −Q(T2)| − |r −Q(T1)|)

∆Q

)≤exp

(ε|Q(T1)−Q(T2)|

∆Q

)≤exp(ε) (6)

It is straightforward that Inequation (1) can be achieved.Inusah et al. [21] proved that the difference of two

independent and identically distributed geometric variables(denoted as Geo(·)) has the same distribution as a discreteLaplace distribution variable.

Proposition 1. Let two independent variables Z1, Z2 ∼Geo(1− p) and Z = Z1 − Z2, we have Z ∼ DL(p).

PPSA utilizes this method to obliviously add noise tothe true result. Because the noise can be decomposed intotwo part, we call each of them a half-noise. If the aggregatorgenerates a Z1 and the authority generates Z2, we canobtain a discrete Laplacian noise by subtracting Z2 fromZ1. Neither of the two components knows the whole noise,so neither can remove it. Blind noise addition prevents thesystem from determining the noise-free result when thenoisy result is publicly released.

Unfortunately, differential privacy has an intrinsic draw-back: the privacy parameter ε degrades with the growingnumber of queries answered, as shown by [22]–[24]. Forexample, when the attacker issues the same query and getsthe results for unlimited times, she can estimate the realresult by averaging those results and thus breach differentialprivacy.

Theorem 2 (Composition Theorem [22], [24]). For any ε >0, δ, δ′ > 0, and k ∈ N, a k-fold adaptive composition of (ε, δ)-differentially private mechanisms is (ε′, kδ + δ′)-differentiallyprivate, where

ε′ =√

2k ln(1/δ′) · ε+ kε(eε − 1). (7)

Later on, more strict quantification of the privacy degra-dation level was given by Oh et al. [23]. As researchers keepworking on tightening the bound, we denote ε′ = f(k, ε)for simplicity and in case of inaccuracy. Given the numberof query k and objective privacy level ε′, we can solve theequation and calculate ε for answering every single queryin turn.

2.3 Boneh-Goh-Nissim Cryptosystem

The Boneh-Goh-Nissim (BGN) Cryptosystem [25] is a kindof homomorphic cryptosystem (please refer to [26] for a de-tailed survey) which utilizes a bilinear pairing [27] to allowthe computation of an unlimited number of homomorphicadditions and a single homomorphic multiplication of twociphertexts. Bilinear map and bilinear group are the bases ofthe BGN cryptosystem.

Definition 3 (Bilinear map [25]). Let G and G1 be two cyclicgroups of order n with g as a generator of G. A map e : G×G→G1 is said to be bilinear if e(g, g) is a generator of G1, and

e(ua, vb) = e(u, v)ab (8)

for all u, v ∈ G and all a, b ∈ Z.

Definition 4 (Bilinear group [25]). G is a bilinear group if thereexists a group G1 and a bilinear map e such that,

1) G and G1 are two multiplicative cyclic groups of finiteorder n.

2) g is a generator of G.3) e : G×G→ G1 is a bilinear map and g1 = e(g, g) is a

generator of G1.

Boneh et al. [25] presented an approach to construct abilinear group of order n. We do not detail it here.

Next, we introduce the BGN cryptosystem as follows.Let n = pq for distinct odd primes p and q, and let G,G1 betwo multiplicative groups of order n with a bilinear pairinge: G× G→ G1, let g be a random generator of G, and let hbe a random generator of the subgroup of G of order p. LetT < q, then P = ZT , C = G,R = Zn.

KeyGen(τ ): Given a security parameter τ ∈ Z+, wegenerate two random τ -bit primes p, q, set n = pq, and selecta positive integer T < q. Then, we choose two multiplicativegroups G,G1 of order n, which support a bilinear pairinge : (G × G → G1), as well as random generators g, u ∈ G,and sets h = uq such that h is a generator of the subgroupof order p. The public key is pk = (n, g, h,G,G1, e), and theprivate key is sk = p.

Encrypt(pk,m): Given a messagem ∈ P and a public keypk, Encrypt(pk,m) chooses a random r ∈ R and calculatesthe ciphertext

c = gmhrmod n ∈ G. (9)

In this paper, we use c or E(·) to denote ciphertext.Decrypt(sk, c): Given a ciphertext c ∈ C and a private key

sk, Decrypt(sk, c) calculates

c′ = cp = (gp)mmod n, (10)

and uses Pollard’s lambda method [28] to take the discretelogarithm of c′ in base gp, and then get the plaintext m intime O(

√T ). We actually can precompute a (polynomial-

sized) table of powers of gp so that decryption can occur inconstant time. In the rest of this paper, “mod n” is omittedand understood to have been operated.

The security of the BGN cryptosystem is summarized inthe following theorem [25].

Theorem 3. Let n = pq for distinct odd primes p and q. TheBGN cryptosystem is semantically secure if deciding whether or

Page 5: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

5

not a random element of G has order p is hard when p and q areunknown.

Lastly, we show the homomorphic properties of BGNcryptosystem. Given c1 = gm1hr1 , c2 = gm2hr2 , then

c1 � c2 = c1c2 = gm1hr1gm2hr2 = gm1+m2hr1+r2 (11)

is a valid encryption of m1 + m2 (� means homomorphicaddition),

c1gk = gm1hr1gk = gm1+khr1 (12)

is a valid encryption of m1 + k, and

ck1 = (gm1hr1)k = gkm1hr1k (13)

is a valid encryption of km1. Subtraction of encryptedmessages and constants can be accomplished by computingc1c−12 and c1g

−k, respectively. Given c1, c2 and a randomr ∈ R, a ciphertext representing the product m1m2 can becalculated:

c1 � c2 = e(c1, c2)hr1= e(gm1hr1 , gm2hr2)hr1= e(gm1gαr1 , gm2gαr2)hr1= e(gm1+αr1 , gm2+αr2)hr1

= e(g, g)(m1+αr1)(m2+αr2)hr1

= gm1m2+m1αr2+m2αr1+α2r1r21 hr1

= gm1m21 hm1r2+m2r1+αr1r2+r

1

= gm1m21 hr1 ∈ G1 (14)

Here, � means homomorphic multiplication, e is a bilinearmap, g1 = e(g, g), h1 = e(g, h), and r is a uniformlyrandom element of R. All further homomorphic operationsbut multiplications are allowed in G1.

3 SYSTEM DESIGN

In this section, design choices of PPSA are explained firstly.Then, the protocols for overall aggregation and selectiveaggregation are described respectively. Finally, we presenta privacy analysis.

3.1 Design Rationale

In this subsection, we introduce three important decisionson the system design.

3.1.1 Storing data on aggregator

Most of recent proposals are distributed systems, whereeach client stores its private data locally. Such systemsprovide privacy but are susceptible to client churn. To avoidthis weakness, PPSA chooses to store all the user data ona server, the aggregator. Clients are only required to sendtheir data when they are online. The aggregator can evaluatequeries without their participation. So the system operateswell even if no client is online. In our mechanism, each clientencrypts its data before sending it to the aggregator, whichcan only do calculations obliviously on these encrypted dataand output the results.

3.1.2 Introducing authority

Due to using encryption in PPSA, there must be a compo-nent to manage keys. First, the aggregator cannot take thisresponsibility, because it holds all the private data. Second, ifclients manage keys, they have to participate in the processof evaluating queries to decrypt the results. In that case,the system could be out of service, when clients frequentlyshift between online and offline, or when few clients areconnected, which is opposite to our goal of resisting clientchurn. Last, analysts cannot manage keys either becauseany analyst can be an adversary. As a result, we have tointroduce the authority into the system to generate keysand keep the private key. The aggregator and the authorityconstitute the intermediary of PPSA model.

3.1.3 Using homomorphic encryption and differential pri-vacy mechanism

On one hand, the aggregator needs to do calculationsobliviously on encrypted data. Homomorphic encryptionsupports such operations. Therefore, we introduce BGNcryptosystem to perform this task. On the other hand, whenan analyst obtains aggregate results from the system, shecan infer individual privacy with her computational powerand auxiliary information. To prevent such privacy disclo-sure, we exploit differential privacy mechanism proposedby [18]. The incorporation of homomorphic encryption anddifferential privacy guarantee strong security of PPSA.

3.2 Protocol for Overall Aggregation

The aggregation mechanism comprises 3 phases.Phase 1: Setup

The authority first decides a set of attribute:

A = {a1, a2, . . . , ai, b1, b2, . . . , bj}.

Then, it runs KeyGen(τ) and gets sk = p and pk =(n, g, h,G,G1, e). The authority also decides the objectiveoverall privacy level by setting the parameter ε′. It also lim-its the number of queries to be answered to k and maintainsa query counter cnt. Accordingly, the privacy budget ε forevery single query can be calculated as mentioned in Section2.2. Besides, the authority sets a minimum and a maximumvalue min,max of acceptable sample size for the analysts.Then, it publishes the following tuple:

〈A, pk,min,max〉.

Based on A, the aggregator creates a table

T (id, a1, a2, . . . , ai, b1, b2, . . . , bj),

with values of all items set to null.Phase 2: Data Collection

After setup, the system begins to collect user data. Everyclient refers to A, and collects user’s answer to each of thegiven attributes. (How the client gets user data is outside thescope of this paper.) Once client u obtains an answer xuatt toan attribute att, it encrypts the answer with pk publishedby the authority:

cuatt = Encrypt(pk, xuatt) = gxuatthr

uatt . (15)

Page 6: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

6

Analyst

Ê

E !Ê

Time

② ③

Aggregator

Authority

Aggregator

AuthorityAuthority

Interm

ediary

Fig. 2: Basic protocol for Query Evaluation: The interme-diary consists of two non-colluding entities, aggregator andauthority. They cooperate to calculate the noisy aggregateresult following the given protocol.

Then, client u sends the following tuple to the aggregator, ifthey are connected:

〈u, att, cuatt〉,∀att ∈ A.

After receiving the tuple from client u, the aggregatorupdates the value of corresponding item in table T :

T (u, att) = cuatt.

T (u, att) refers to the item in the row corresponding to useru and column corresponding to attribute att. So, the contentof T is dynamic.

Instead of sending all the data at a time, the clientsends every single answer shortly after it is obtained andencrypted. It still fulfills data collection even when discon-nected with the aggregator. In this case, the client wouldsend obtained data to the aggregator once getting online.We do not need any other interactions between clients andservers. Thus, PPSA is resistant to client churn.Phase 3: Query Evaluation

Query evaluation can be executed before data collectionis finished, because evaluating a query does not involve allthe data in table T . The key to achieving selective aggrega-tion is counting data items of target users by multiplyingthem by 1, and skipping the rest by multiplying them by0. These calculations are done in ciphertext, and thus noprivacy disclosure would occur. It consists of four steps (Fig.2).

Step 1. An analyst sends the authority a query

Q = 〈“OA”, s, att〉,

where “OA” is short for overall aggregation, s is samplesize the analyst wants, and att is any attribute in A.

Step 2. The authority would reject the query request ifcnt = k to limit the degradation of differential privacy level.Otherwise, it does sanity check once it receives Q. If thearguments are not matched, s < min, s > max, or att /∈ A,the query is also rejected. Otherwise, it goes on as follows.It computes the privacy parameter (as described in Section2.2):

p = exp(−ε/∆Q) = exp(−ε/max ‖Q(T1)−Q(T2)‖)).(16)

Then, the authority generates a Geo(1− p) random variableZ1 as a half-noise and encrypts it with pk:

E(Z1) = Encrypt(pk, Z1) = gZ1hr1 . (17)

Here E(pt) refers to the ciphertext of any plaintext pt. Atlast, the authority sends the following tuple to the aggrega-tor:

〈Q, p,E(Z1)〉.

Step 3. The aggregator runs Algorithm 1 (PPOAA). It firstcounts the tuples that have valid values for the requestedattribute. If the number m is less than requested samplesize, it returns an error which indicates data deficiency.Otherwise, it goes on to randomly choose a sample S fromvalid tuples and sums up their values on att:

E(Satt) =∏u∈S

cuatt = E

(∑u∈S

xuatt

). (18)

It also generates a Geo(1 − p) random variable Z2 as thesecond half-noise. Then, the aggregator combines two half-noises into the whole Laplacian noise in encrypted form andobliviously adds it to the sum:

E(Satt) = E(Satt)×E(Z1)×E(Z2)−1 = E(Satt+Z1−Z2).(19)

At last, the algorithm outputs an encrypted noisy aggregateresult E(Satt). Here Satt refers to sum over attribute att,and hat ( ) denotes noisy data. Then, the aggregator sendsthe result to the authority.

Algorithm 1 Privacy-Preserving Overall Aggregation Algo-rithm (PPOAA)

Input: Table T , attribute set A, query Q, noise parameter p,encrypted half-noise E(Z1), public key pk.

Output: Encrypted noisy sum E(Satt), or Error.1: m← 0.2: for each u ∈ U do3: if T (u, att) is not null then4: m← m+ 1.5: Mark user u.6: end if7: end for8: if m < s then9: return Error.

10: else11: Let S be a set of random s users that are marked.12: E(Satt)←

∏u∈S

cuatt.

13: Generate a Geo(1− p) random variable Z2.14: E(Z2)← gZ2hr2 .15: E(Satt)← E(Satt)× E(Z1)× E(Z2)−1.16: return E(Satt).17: end if

Step 4. If the result is Error, the authority sends an errormessage to inform the analyst of data deficiency. Otherwise,it decrypts the encrypted noisy result with its private keysk, gets the noisy result:

Satt = Decrypt(sk,E(Satt)), (20)

Page 7: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

7

and sends it to the analyst, who can calculate the noisy meanby himself. In addition, the authority increases the querycounter cnt. Here is the end of Query Evaluation.

3.3 Protocol for Selective AggregationWe consider aggregation selected by one single booleanattribute for simplicity. More complex scenarios will bediscussed in Section 4. To support selective aggregation,we need to make some changes to the third phase, QueryEvaluation.

Step 1. An analyst sends the authority a query

Q = 〈“SA”, s, att, b〉,

where “SA” is short for selective aggregation and b isany boolean attribute, att is the attribute to be aggregatedwhereas b is used for selection.

Step 2. The authority checks cnt and query validity,which is similar to the previous protocol. Then, it generates2 half-noises and sends the following tuple to the aggrega-tor:

〈Q, p,E(Z1), E(Z2)〉.

Step 3. The aggregator runs Algorithm 2 (PPSAA) toevaluate the query Q. PPSAA first computes a sum onattribute b, i.e., E(Sb). Then, it computes E(Sbatt) such that:

E(Sbatt) = E(∑u∈S

xuattxub ) =

∏u∈S

e(cuatt, cub ), (21)

where Sbatt represents sum over att selected by b. If useru meets the predicate of b, i.e., xub = 1, then his answerxuatt is added to the sum. Otherwise, it is skipped since it ismultiplied by a 0. At last, the aggregator obliviously addstwo noises to the two sums respectively and outputs them.Then, the aggregator sends them to the authority. Step 4. Ifthe results are not Error, the authority decrypts them andsends the noisy results to the analyst, who calculates noisymean value by dividing Sbatt by Sb. Note it also makes senseif att is a boolean attribute. For instance, att is “educationlevel is master” and b is “gender is female”. We can getfemales’ ratio of holding a master degree.

3.4 Privacy AnalysisThis subsection demonstrates the privacy-preserving prop-erty of PPSA. As mentioned before, we have assumed thatboth the aggregator and the authority are semi-honest.They run the specified protocol faithfully even if either ofthem colludes with others, so differential privacy is alwayspreserved.

3.4.1 AnalystsAn analyst sends queries to and receives results from theauthority. She normally does not communicate with theaggregator or any user. She can only get noisy results,which are differentially private so she cannot infer indi-vidual privacy with auxiliary information. If she colludeswith a user, they still cannot acquire other users’ data. Ifshe colludes with the aggregator, they have all encrypteddata but no private key to decrypt them. If she colludeswith the authority, they get the private key but no userdata. If analysts collude with each other, they get no moreinformation.

Algorithm 2 Privacy-Preserving Selective Aggregation Al-gorithm (PPSAA)

Input: Table T , attribute set A, query Q, noise parameter p,encrypted half-noises E(Z1), E(Z2), public key pk.

Output: Tuple 〈E(Sb), E(Sbatt)〉, or Error.1: m← 0.2: for each u ∈ U do3: if neither T (u, att) nor T (u, b) is null then4: m← m+ 1.5: Mark user u.6: end if7: end for8: if m < s then9: return Error.

10: else11: Let S be a set of s random users that are marked.12: E(Sb)←

∏u∈S

cub .

13: E(Sbatt)←∏u∈S

e(cuatt, cub ).

14: Generate two Geo(1− p) random variables Z3, Z4.15: E(Z3)← gZ3hr3 .16: E(Z4)← gZ4hr4 .17: E(Sb)← E(Sb)× E(Z1)× E(Z3)−1.18: E(Sbatt)← E(Sbatt)× E(Z2)× E(Z4)−1.19: return 〈E(Sb), E(Sbatt)〉.20: end if

3.4.2 AggregatorThe aggregator holds encrypted user data but only theauthority knows the private key. It is assumed they do notcollude with each other so it cannot decrypt those data.All of its computations are done obliviously. If it colludeswith a user, they still cannot get other users’ data. Halfof differential private noise added to a result comes fromthe aggregator. But it does not know the whole noise, so itcannot get a noise-free result by removing noise from thenoisy one (It may get the noisy result from an analyst).

3.4.3 AuthorityThe authority holds the private key but has no access touser data. It can access noisy results, but like the aggregator,it cannot remove noises.

3.4.4 ClientsA client is installed on the user side and only knows itsuser’s data. If some users collude with each other, they onlyknow their own data. Normally a client is not manipulatedby its user. Even if a user makes her client send considerableduplicate values, the aggregator will only keep the latestvalue as each user takes up exactly one row in table T .

3.4.5 Differential PrivacyIn our scheme, Laplacian noise is added to the answer ofevery query. The noisy is combined from two half-noisesand added by two parties, the aggregator and the authority,which are assumed not to collude with each other. Since no-body knows how much noise is added, nobody can removethe noise from the released noisy query answer and theninfer a person’s data from it. As a result, differential privacy

Page 8: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

8

for single queries is preserved. The parameter ε balances thetrade-off between privacy and accuracy; a smaller ε leads tostronger privacy protection but greater answer distortion.

However, differential privacy has its own limitationthat privacy degrades with growing number of queriesanswered. Even though the mechanism for answering asingle query is ε-differentially private, we cannot achieve itwhen many queries are answered (unless they are evaluatedon different disjoint subsets of the data set). In our settings,queries are evaluated on a random sample of the originaldata and the data keeps changing, but the data subsetswhere queries are computed are somewhat, if not greatly,overlapped and correlated to each other. This scenario canbe modeled as the composition of many different privacymechanisms on different data releases that are correlated.According to the composition theorem [22], [24], answeringk adaptive queries, of which each is ε-differentially private,can still achieve ε′-differential privacy though it is notas strong as the privacy of answering a single query. Asmentioned in Section 2.2, ε′ is a function of ε and the querynumber limit k. Given ε′ and k, the authority calculatesthe privacy budget ε and then applies it to answeringsingle queries. It rejects more query request when the querynumber limit has been reached so as to ensure the over-all privacy guarantee. Therefore, it is concluded that ε′-differential privacy is achieved in our scheme.

To answer more queries, we could choose to degradeeither the overall privacy level or the accuracy of answeringsingle queries. When the query number limited is reached,the intermediary can pause the service and wait for theupdate of all the data to finish. By then, the data set iscompletely new and disjointed with the previous one so thequery counter can be reset and the service can be open again,which is referred to as parallel composition [29].

4 EXTENSIONS

In this section, we first present steps to evaluate aggregationselected by multiple different boolean attributes. Then, wedescribe the protocol of aggregation selected by a singlenumeric attribute, based on a way of comparison betweentwo encrypted integers.

4.1 Aggregation Selected by Multiple Boolean At-tributesIn reality, an analyst may want to know the aggregation ofan attribute selected by multiple different boolean attributes,e.g., the average number of searches made by females whohave master degrees, which involves two boolean attributes:“gender is female” and “education level is master”. Toevaluate such queries, we present the Query Evaluationphase as described in Fig. 3. For simplicity, we assume thereare only two boolean attributes restricting the aggregation,from which we can easily extend to scenarios with moreboolean attributes.

Step 1. An analyst sends the authority a query

Q = 〈“SA”, s, att, b1, b2〉,

where b1, b2 represents two different boolean attributes. (Wenote that the aggregation makes sense if att is also a booleanattribute not equal to b1 or b2.)

Analyst

Ê

E !Ê

Time

② ③

Aggregator

Authority

Aggregator

Authorityut o ty

Interm

ediary

Fig. 3: Protocol for Query Evaluation in Extension 1: Theanalyst queries the aggregation of an attribute selected bymultiple different boolean attributes. The aggregator andthe authority communicate for two rounds and calculate thenoisy result.

Step 2. The authority does sanity check and sends thefollowing tuple to the aggregator:

〈Q, p,E(Z1), E(Z2)〉.

Step 3. The aggregator does a calculation:

E(flagu) = cub1cub2 = E(xb1 + xb2). (22)

We use flag as an indicator. The ciphertext E(flagu) isan encryption of 2 if and only if user u satisfies bothb1, b2 conditions. Then, the aggregator permutates all theseE(flagu) and sends them to the authority.

Step 4. Now the authority wonders which of the en-crypted flags are E(2). Noting that decrypting is unneces-sary, she just computes E(flagu)p (recall p is the privatekey) and compare them with (gp)0, (gp)1, (gp)2 (Equation10). Then, these E(flagu) are set to:

E(flagu) =

{E(1), if E(flagu)p = (gp)2

E(0), otherwise.(23)

Next, it sends these E(flagu) back to the aggregator.Step 5. Suppose Sb1b2 represents the number of users

who satisfy both b1, b2, Sb1b2att represents the sum over attselected by both b1, b2. The aggregator computes:

E(Sb1b2) =∏u∈S

E(flagu). (24)

E(Sb1b2att ) =∏u∈S

e(cuatt, E(flagu)). (25)

A user is counted only if she meets both conditions ofb1, b2. The aggregator then adds noise and sends them tothe authority.

Step 6. The analyst decrypts E(Sb1b2), E(Sb1b2att ), andsends the noisy results to the analyst.

4.2 Aggregation Selected by a Numeric AttributeTo be more practical, the analyst may wonder the selec-tive aggregation of an attribute A selected by a numericattribute, e.g., the average time online of 13 to 19 yearsold teenagers, where age is a numeric attribute for selec-tion and [13, 20) is an interval. Here, the attribute table is

Page 9: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

9

T (id, a1, a2, . . . , ai, b1, b2, . . . , bj , c1, c2, . . . , ck). a and c areboth numeric, but the former is used for aggregation, thelatter for selection. We tell them apart because they areencrypted differently. Client u encrypts each bit of xuc , whileencrypting xua entirely.

Before processing such queries, we introduce a wayof comparison between two encrypted integers, which ismotivated by Katti et al. [30]. Suppose two l-bit non-negativeinteger x and y in binary form: x = (xl, xl−1, . . . , x1), y =(yl, yl−1, . . . , y1). For each i = 1, 2, . . . , l, let

di = xi − yi, (26)

(Note di ∈ {−1, 0, 1}) and

ei = xi − yi + 1 +l∑

j=i+1

dj2l−j+1. (27)

Proposition 2 (See proof in [30]). For any two l-bit non-negative integer x and y, we have x < y if and only if thereexists i, such that ei = 0.

If we encrypt each bit of x and y, then we get c1 =(cl1, c

l−11 , . . . , c11), c2 = (cl2, c

l−12 , . . . , c12). To obliviously com-

pare x and y using c1, c2, the aggregator can utilize BGNcryptosystem to compute:

E(di) = ci1(ci2)−1, (28)

E(ei) = ci1(ci2)−1gl∏

j=i+1

E(dj)2l−j+1

. (29)

Then, it sends each E(ei) to the authority, which checkswhether there is any i such that E(ei) is the encryption of 0(by checking whether E(ei)p = 1).

We now present the process of Query Evaluation. Still,we consider one single attribute for selection.

Step 1. An analyst sends the authority a query:

Q = 〈“numSA”, s, att, c, z, w〉,

where c is any numeric attribute for selection, and z, w arebounds of interval [z, w).

Step 2. Same as before.Step 3. For each user data xuc (u ∈ S), the aggregator

computes E(eiu,z) and E(eiu,w) mentioned above. Then, itpermutates the two tables of E(e) and sends them to theauthority.

Step 4. The authority checks and learns the relationshipsbetween xuc and z, w. If xuc < w and xuc ≥ z, it sets flagu =1. Otherwise, flagu = 0. The authority finally encrypts andsends these flags to the aggregator.

Step 5, 6. Same as before.

5 EVALUATION

In this section, we implement PPSA and evaluate its perfor-mance.

5.1 MethodologyWe first implemented the BGN cryptosystem using thePairing-based Cryptography Library (PBC) [31], which isa library built on the GMP library to perform mathemati-cal operations in pairing-based cryptosystems. The securityparameter τ was set to 80. Based on the cryptosystem, weimplemented PPSA in C.

Then, we did a trace-driven simulation based on adataset of 1000 nationwide users’ demographics and onlinebehaviors in four weeks. It is extracted from a dataset fromChina Internet Network Information Center (CNNIC) [32].The behaviors include webpages browsed, time spent oneach webpage, browsers used. The demographic profilesinclude gender, age, education level, income, occupation,etc.

We perform all simulations on a PC with an Intel®

Core™ i3-2310M 2.10GHz CPU and an 8GB memory underUbuntu 12.04 OS.

5.2 Utility & Accuracy & Client Churn ResistanceWe set differential privacy parameter ε = 1 for every singlequery in our tests. To evaluate utility, we consider thefollowing 6 queries. They all aggregate on a sample of 1000users in the whole four weeks.

• Q1: Average number of times of using Internet Ex-plorer.

• Q2: Ratio of male users.• Q3: Average number of webpages browsed by users,

who are grouped by gender.• Q4: Average number of shopping webpages browsed

by users, who are grouped by gender.• Q5: Average number of times of using Internet Ex-

plorer by users who are female and have a bachelordegree.

• Q6: Average number of webpages browsed by users,who are grouped by age intervals.

After running simulations, we get all the noisy results.The results of Q1, Q2, Q5 are 1765, 0.773, 1994, respectively.The results of Q3, Q4 are shown in Fig. 4(a). Analysts canuse them to compare the online interests of males andfemales. The results of Q6 is shown in Fig. 4(b), which canbe used to tell the differences among people of differentages. Thus, it is shown that PPSA supports multiple typesof queries with acceptable accuracy.

Relative errors of the six results are 0.2%, 0.4%, 2.5%,4.3%, 6.3%, 11.3%, respectively. Here, relative error iscaused by the noise added, and depends on the privacyparameter ε and the real result of the query. The smallerthe real result is, the larger relative error the noisy resulthas, which is a common issue of differential privacy mech-anism. Fig. 5 depicts the impacts of s, ε on the accuracy ofevaluatingQ1. The relative error decreases when the samplesize increases since the real result rises with sample size. Itis also suggested that the smaller ε is, the greater inaccuracyis.

To prove PPSA is resistant to client churn, we compareit with a most recent system called SplitX [10] in terms ofaccuracy. We focus on Q2 because SplitX can only evaluatesuch queries. We vary online user ratio and sample size

Page 10: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

10

0

500

1000

1500

2000

2500

3000

3500

All Shopping

Avera

ge N

um

ber

of W

ebpages

Webpage Type

MalesFemales

(a) Grouped by gender

2000

2200

2400

2600

2800

3000

3200

3400

3600

[0,18) [18,30) [30,45) [45,60) [60,100)

Avera

ge N

um

ber

of w

ebpages

Age Interval

Users

(b) Grouped by age

Fig. 4: Average number of webpages browsed: The statis-tics can be used to compare the online interests of males andfemales, and online activeness of different age groups.

0

0.1

0.2

0.3

0.4

0.5

0.6

100 200 300 400 500 600 700 800 900 1000

Rel

ativ

e E

rror

Sample Size

ε=1.0ε=0.5ε=0.1

Fig. 5: Relative error vs s, ε: The relative error decreaseswhen the sample size s or the privacy budget ε in differentialprivacy increases.

0.00

0.05

0.10

0.15

0.20

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

Rel

ativ

e E

rror

Online User Ratio

SplitX s=100SplitX s=500SplitX s=1k

PPSA s=100PPSA s=500PPSA s=1k

Fig. 6: Comparison with SplitX: PPSA has higher accuracythan SplitX, especially when there are less than 10 percentusers online.

s, and use average relative error as a metric. As presentedin Fig. 6, PPSA has higher accuracy than SplitX, especiallywhen there are less than 10 percent users online. When fewusers are online, SplitX cannot operate at all.

5.3 Overhead5.3.1 Computation OverheadWe analyze run time by every phase and step of PPSA.The setup time is constant and less than 10ms. The datacollection time is up to the computational ability of clientsand end-to-end time between clients and the aggregator.And the encryption time is almost negligible because oneBGN encryption costs less than 1.5ms. The decryption timeof the authority is also a small constant. As for analysts,they only send their queries and do a few simple furthercalculations, also not time-consuming. Consequently, wewill not detail them.

0 1 2 3 4 5 6 7 8 9

10

0 2000 4000 6000 8000 10000

Ru

n T

ime(

s)

Sample Size

PPSAAPPOAA

(a) Two algorithms

0 5

10 15 20 25 30 35 40 45 50

0 2000 4000 6000 8000 10000

Run

Tim

e(s)

Sample Size

E2.S5E2.S4E2.S3E1.S5E1.S4E1.S3

(b) Two extensions

Fig. 7: Computation overhead vs sample size: The compu-tation overheads are in proportion to the sample size for anyof the four cases.

Fig. 7(a) presents computation overheads of two algo-rithms, which are run by the aggregator. PPOAA costs verylittle time. PPSAA has greater overhead due to pairing oper-ations, and it increases linearly as the growth of sample size,i.e., the number of users whose data are involved in queryevaluation. But they are still very efficient in computation,as the run time is less than 10s when the sample size is10,000. (The test data is generated from the real dataset.)

In Extensions 1 and 2, most of the overheads lie inSteps 3, 4, 5. We plot them in Fig. 7(b). Their computa-tion overheads are also in proportion to the sample size.Among them, Step 4 of Extension 2 (E2.S4) is the most time-consuming, whose run time is around 45s when the samplesize is 10,000. This is still acceptable. In addition, as depictedin Fig. 8, overheads of Steps 3 and 4 in Extension 2 are also

Page 11: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

11

0

2

4

6

8

10

12

7 8 9 10 11 12

Ru

n T

ime(s

)

Number of Bits l

E2.S3E2.S4E2.S5

Fig. 8: Computation overhead vs number of bits: Thecomputation overheads of Steps 3 and 4 in Extension 2are linearly related to l, the bit number of the attribute ofinterest.

linearly related to l, the number of digits of the data forattribute c (In Fig. 7(b), l is set to 7). From these tests weshow that PPSA has acceptable computation overhead.

5.3.2 Communication Overhead

Component OA Basic SA Extension 1 Extension 2Authority 50 87 30s+ 107 30s+ 114

Aggregator 30 60 30s+ 60 60ls+ 60

TABLE 2: Communication overheads (Bytes): For overallaggregation and basic selective aggregation, the commu-nication overheads are small and constant. For the twoextensions, they are linearly related to the sample size s (andthe bit number, for the aggregator in Extension 2).

In PPSA, clients encrypt their data and send the cipher-texts to the aggregator. Each ciphertext has a size of 30bytes. So the communication overhead for them is up tothe number of ciphertexts. For an analyst, she only sends aquery whose size is at most 30 bytes, which is negligible.Therefore, we only focus on the aggregator and the au-thority. Table 2 summarizes their communication overheadsin bytes. ( Recall s is sample size and l is the numberof digits of data for attribute c.) For overall aggregationand basic selective aggregation, communication overheadis very small and constant. For Extensions 1 and 2, it islinearly related to s and l. In the evaluation of query Q6,the overheads of authority and aggregator are about 30KBand 420KB respectively. We can make it clear that PPSA hasacceptable communication overhead.

6 RELATED WORK

Privacy-preserving aggregation on sensitive user data hasraised much attention recently, including health care data ([33], [34]), time-series data ( [8], [9]), wireless sensor networkdata ( [35], [36]), and online behavior data for analysis andadvertising ( [10], [17]). In general, there are two types ofsystems in previous work.

6.1 Centralized SystemsIn a centralized system, all the user data are stored on theserver. It is important that users encrypt or encode theirdata before sending them to the server. The server holds the

encrypted data, but it can only compute answers to queriesobliviously, e.g., [37]–[39]. However, these proposals havedifferent goals than our system and do not support selectiveaggregation. Moreover, they do not guarantee differentialprivacy. Homomorphic encryption is a common method toachieve aggregation of encrypted data without decryption,such as [14], [40], [41]. Chen et al. [42] used an order-preserving hash-based function to encode both data andqueries instead. But they do not have the same goal asus and cannot evaluate selective aggregation. Li et al. [43]proposed a system that processes range queries, which yetdoes not compute aggregation and assumes analysts to betrusted. On the contrary, PPSA combines differential privacyand homomorphic encryption, and is able to selectivelyaggregate encrypted user data.

6.2 Distributed Systems

In a distributed system, clients need to proactively (e.g.,[8], [10], [12], [44]), or passively (e.g., [4], [11], [45]) sendrequired data to the aggregator in a private way. But bothrely on the participation of clients. These systems all requireonline users, so analysis cannot go on when most of theusers are offline. Homomorphic encryption is also applica-ble in distributed system. For instance, PASTE [8] exploitsdifferential privacy and homomorphic cryptography but itallows only summation of user data and the aggregatorknows the private key. Castelluccia et al. [36] use symmetrichomomorphic encryption so they need a trusted aggregator,and they also allow only additive aggregation. DJoin [45]aims to support distributed and differentially private queryanswering service, but it applies to privacy-preserving datajoin between two parties, which is a different scenario withours. Secure Multi-Party Computation (SMC) [46] requiresthat all participants must be simultaneously online andinteract with each other periodically, which is infeasible forpractical scenarios. Some previous researches have noticedthe problem of client churn. For instance, Shi et al. [12]proposed a system that can tolerate some offline clients, buta trusted dealer and a trusted initial setup phase between allparticipants and the aggregator are still needed. Rottondi etal. [47] also discussed node failures but they addressed adifferent issue from the problem in this paper.

When participating in query evaluation, clients need topreprocess their data first to prevent individuals from beingidentified. There are basically two approaches.

6.2.1 Each Client Anonymizes or Adds Noise to Its OwnDataThis approach has obvious weak points. Clients still mightbe de-anonymized with auxiliary information or the utilityof user data is significantly degraded. Anonygator [48]assumes that the content of the data collected would notleak privacy and that clients can hide their identities byadopting an anonymous network. Compared with it, PPSAhas weaker assumptions. Dwork et al. [20] used expensivesecret-sharing protocols leading to complex computationsfor each client, which is opposite to our goal of achievingclient churn resistance, i.e. providing timely service evenwhen clients frequently shift between online and offline.Duchi et al. [49] proposed local differential privacy which

Page 12: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

12

does not assume a trusted data curator and is stronger thanstandard differential privacy. In our scheme, the aggregatoris trusted and thus local differential privacy is achieved.Other related works [8], [12], [17] allow duplicate answerssent by a single malicious client to pollute the query resultsubstantially.

6.2.2 Server Obliviously Adds Noise to Aggregate Results

Akkus et al. [4] presented the first system that providesweb analytics without tracking with differential privacyguarantee, yet it requires a public-key operation per singlebit of user data, causing high overheads. PDDP [11] also hasthis limitation. SplitX [10] utilizes XOR operation to achievelow-cost encryption, thus has much better scalability. All ofthe three designs define buckets to correspond to differentintervals of numeric values, which degrades data qualitybecause the aggregator can only get very obscure answersfrom each client.

7 CONCLUSION

In this paper, we have described the challenges of makingonline user data aggregation while preserving users’ pri-vacy. Based on BGN homomorphic cryptosystem, we havedesigned the first system that is able to securely and selec-tively aggregate user data, making it practical in realisticdata analytics. It guarantees strong privacy preservation byutilizing differential privacy mechanism to protect individ-uals’ privacy. We have presented PPSA to evaluate aggre-gation selected by one boolean attribute, and extended it toaggregation selected by multiple boolean attributes and byone numeric attribute. Extensive experiments have shownthat PPSA supports various selective aggregate queries withacceptable overhead and high accuracy.

REFERENCES

[1] R. E. Bucklin and C. Sismeiro, “Click here for internet insight:Advances in clickstream data analysis in marketing,” Journal ofInteractive Marketing, vol. 23, no. 1, pp. 35–48, 2009.

[2] R. Bose, “Advanced analytics: opportunities and challenges,” In-dustrial Management & Data Systems, vol. 109, no. 2, pp. 155–172,2009.

[3] H. Chen, R. H. Chiang, and V. C. Storey, “Business intelligenceand analytics: From big data to big impact.” MIS quarterly, vol. 36,no. 4, pp. 1165–1188, 2012.

[4] I. E. Akkus, R. Chen, M. Hardt, P. Francis, and J. Gehrke, “Non-tracking web analytics,” in Proceedings of the ACM Conference onComputer and communications security (CCS), 2012, pp. 687–698.

[5] F. Roesner, T. Kohno, and D. Wetherall, “Detecting and defendingagainst third-party tracking on the web,” in Proceedings of the 9thUSENIX conference on Networked Systems Design and Implementation,2012.

[6] Directive 2009/136/ec of the european parliament and ofthe council. [Online]. Available: http://eur-lex.europa.eu/LexUriServ/LexUriServ.do?uri=OJ:L:2009:337:0011:0036:en:PDF

[7] Web tracking protection. [Online]. Available: http://www.w3.org/Submission/web-tracking-protection/

[8] V. Rastogi and S. Nath, “Differentially private aggregation ofdistributed time-series with transformation and encryption,” inProceedings of the ACM International Conference on Management ofData (SIGMOD), 2010, pp. 735–746.

[9] B. Applebaum, H. Ringberg, M. J. Freedman, M. Caesar, andJ. Rexford, “Collaborative, privacy-preserving data aggregationat scale,” in Proceedings of the 10th Privacy Enhancing TechnologiesSymposium (PETS), 2010, pp. 56–74.

[10] R. Chen, I. E. Akkus, and P. Francis, “SplitX: high-performanceprivate analytics,” in Proceedings of the ACM Special Interest Groupon Data Communication (SIGCOMM), 2013, pp. 315–326.

[11] R. Chen, A. Reznichenko, P. Francis, and J. Gehrke, “Towards sta-tistical queries over distributed private user data,” in Proceedings ofthe 9th Symposium on Networked Systems Design and Implementation(NSDI), 2012.

[12] E. Shi, T.-H. H. Chan, E. G. Rieffel, R. Chow, and D. Song, “Privacy-preserving aggregation of time-series data,” in Proceedings of theNetwork and Distributed System Security Symposium (NDSS), 2011.

[13] T. Jung, X. Mao, X.-y. Li, S.-J. Tang, W. Gong, and L. Zhang,“Privacy-preserving data aggregation without secure channel:multivariate polynomial evaluation,” in Proceedings of the 32ndIEEE International Conference on Computer Communications (INFO-COM), 2013, pp. 2634–2642.

[14] D. Fiore, R. Gennaro, and V. Pastro, “Efficiently verifiable compu-tation on encrypted data,” in Proceedings of the 2014 ACM SIGSACConference on Computer and Communications Security (CCS), 2014,pp. 844–855.

[15] T. Jung, X.-Y. Li, and M. Wan, “Collusion-tolerable privacy-preserving sum and product calculation without secure channel,”IEEE Transactions on Dependable and Secure Computing (TDSC),vol. 12, no. 1, pp. 45–57, 2015.

[16] C. Dwork, “Differential privacy: A survey of results,” in Proceed-ings of 5th International Conference on Theory and Applications ofModels of Computation (TAMC), 2008, pp. 1–19.

[17] M. Hardt and S. Nath, “Privacy-aware personalization for mobileadvertising,” in Proceedings of the ACM conference on Computer andCommunications Security (CCS), 2012, pp. 662–673.

[18] C. Dwork, “Differential privacy,” in Proceedings of the 33rd Interna-tional Colloquium on Automata, Languages and Programming (ICALP),2006, pp. 1–12.

[19] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibratingnoise to sensitivity in private data analysis,” in Theory of Cryp-tography, 2006, pp. 265–284.

[20] C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor,“Our data, ourselves: Privacy via distributed noise generation,” inProceedings of 25th Annual International Conference on the Theory andApplications of Cryptographic Techniques (EUROCRYPT), 2006, pp.486–503.

[21] S. Inusah and T. J. Kozubowski, “A discrete analogue of the laplacedistribution,” Journal of statistical planning and inference, vol. 136,no. 3, pp. 1090–1102, 2006.

[22] C. Dwork, G. N. Rothblum, and S. Vadhan, “Boosting and dif-ferential privacy,” in Foundations of Computer Science (FOCS), 201051st Annual IEEE Symposium on. IEEE, 2010, pp. 51–60.

[23] S. Oh and P. Viswanath, “The composition theorem for differentialprivacy,” Technical Report, Tech. Rep., 2013.

[24] C. Dwork and A. Roth, “The algorithmic foundations of differen-tial privacy,” Foundations and Trends in Theoretical Computer Science,vol. 9, no. 3-4, pp. 211–407, 2014.

[25] D. Boneh, E.-J. Goh, and K. Nissim, “Evaluating 2-dnf formulas onciphertexts,” in Theory of Cryptography, 2005, pp. 325–341.

[26] K. Henry, “The theory and applications of homomorphic cryptog-raphy,” Ph.D. dissertation, University of Waterloo, Canada, 2008.

[27] V. S. Miller, “The weil pairing, and its efficient calculation,” Journalof Cryptology, vol. 17, no. 4, pp. 235–261, 2004.

[28] A. J. Menezes, P. C. Van Oorschot, and S. A. Vanstone, Handbook ofapplied cryptography. CRC press, 1996.

[29] F. D. McSherry, “Privacy integrated queries: an extensible platformfor privacy-preserving data analysis,” in Proceedings of the 2009ACM SIGMOD International Conference on Management of data.ACM, 2009, pp. 19–30.

[30] R. S. Katti and C. Ababei, “Secure comparison without explicitxor,” arXiv preprint arXiv:1204.2854, 2012.

[31] The pairing-based cryptography library. [Online]. Available:https://crypto.stanford.edu/pbc/

[32] An online user behavior dataset. [Online]. Available: http://www.datatang.com/data/43910

[33] M. P. Armstrong, G. Rushton, D. L. Zimmerman et al., “Geograph-ically masking health data to preserve confidentiality,” Statistics inmedicine, vol. 18, no. 5, pp. 497–525, 1999.

[34] HealthVault. [Online]. Available: https://www.healthvault.com[35] W. He, X. Liu, H. Nguyen, K. Nahrstedt, and T. Abdelzaher, “Pda:

Privacy-preserving data aggregation in wireless sensor networks,”in Proceedings of the 26th IEEE International Conference on ComputerCommunications (INFOCOM), 2007, pp. 2045–2053.

Page 13: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

13

[36] C. Castelluccia, A. C. Chan, E. Mykletun, and G. Tsudik, “Efficientand provably secure aggregation of encrypted data in wirelesssensor networks,” ACM Transactions on Sensor Networks (TOSN),vol. 5, no. 3, p. 20, 2009.

[37] R. A. Popa, C. Redfield, N. Zeldovich, and H. Balakrishnan,“Cryptdb: protecting confidentiality with encrypted query pro-cessing,” in Proceedings of the 33rd ACM Symposium on OperatingSystems Principles (SOSP), 2011, pp. 85–100.

[38] E. Shi, J. Bethencourt, T.-H. Chan, D. Song, and A. Perrig, “Multi-dimensional range query over encrypted data,” in Proceedings ofthe 39th IEEE Symposium on Security and Privacy (S&P), 2007, pp.350–364.

[39] D. X. Song, D. Wagner, and A. Perrig, “Practical techniques forsearches on encrypted data,” in Proceedings of the 32nd IEEESymposium on Security and Privacy (S&P), 2000, pp. 44–55.

[40] X. Yi, M. G. Kaosar, R. Paulet, and E. Bertino, “Single-databaseprivate information retrieval from fully homomorphic encryp-tion,” IEEE Transactions on Knowledge and Data Engineering (TKDE),vol. 25, no. 5, pp. 1125–1134, 2013.

[41] A. Lopez-Alt, E. Tromer, and V. Vaikuntanathan, “On-the-fly mul-tiparty computation on the cloud via multikey fully homomorphicencryption,” in Proceedings of the 44th IEEE Symposium on Theory ofComputing (STOC), 2012, pp. 1219–1234.

[42] F. Chen and A. X. Liu, “Privacy and integrity preserving multi-dimensional range queries for cloud computing,” in IFIP Network-ing, 2014, pp. 1–9.

[43] R. Li, A. X. Liu, A. L. Wang, and B. Bruhadeshwar, “Fast rangequery processing with strong privacy protection for cloud com-puting,” in Proceedings of the VLDB Endowment, vol. 7, no. 14, 2014.

[44] B. Mood, D. Gupta, K. Butler, and J. Feigenbaum, “Reuse it or loseit: More efficient secure computation through reuse of encryptedvalues,” in Proceedings of the 2014 ACM SIGSAC Conference onComputer and Communications Security (CCS), 2014, pp. 582–596.

[45] A. Narayan and A. Haeberlen, “DJoin: differentially private joinqueries over distributed databases,” in Proceedings of the 10thUSENIX Symposium on Operating Systems Design and Implementa-tion (OSDI), 2012, pp. 149–162.

[46] O. Goldreich, Secure multi-party computation. [Online]. Available:http://www.wisdom.weizmann.ac.il/∼oded/PS/prot.ps

[47] C. Rottondi, G. Verticale, and A. Capone, “A security frameworkfor smart metering with multiple data consumers,” in IEEE Con-ference on Computer Communications Workshops (INFOCOM WK-SHPS), 2012, pp. 103–108.

[48] K. P. Puttaswamy, R. Bhagwan, and V. N. Padmanabhan, “Anony-gator: privacy and integrity preserving data aggregation,” in Pro-ceedings of ACM/IFIP/USENIX 11th International Middleware Confer-ence (Middleware), 2010, pp. 85–106.

[49] J. C. Duchi, M. I. Jordan, and M. J. Wainwright, “Local privacyand statistical minimax rates,” in Foundations of Computer Science(FOCS), 2013 IEEE 54th Annual Symposium on. IEEE, 2013, pp.429–438.

Jianwei Qian is a Ph.D. student in the Depart-ment of Computer Science, Illinois Institute ofTechnology, USA. He received his B.E. degree inComputer Science and Technology from Shang-hai Jiao Tong University, P.R. China, in 2015. Hisresearch interests include privacy and securityissues in big data and social networking.

Fudong Qiu is a Ph.D. candidate in the De-partment of Computer Science and Engineering,Shanghai Jiao Tong University, P.R. China. Hereceived his B.S. degree in Computer Scienceand Technology from Xi’an Jiao Tong Universityin 2012. His research interests include securityand privacy preservation in wireless networks.He is a student member of ACM, CCF, and IEEE.

Fan Wu is an associate professor in the De-partment of Computer Science and Engineering,Shanghai Jiao Tong University. He received hisB.S. in Computer Science from Nanjing Univer-sity in 2004, and Ph.D. in Computer Scienceand Engineering from the State University ofNew York at Buffalo in 2009. He has visitedthe University of Illinois at Urbana-Champaign(UIUC) as a Post Doc Research Associate. Hisresearch interests include wireless networkingand mobile computing, algorithmic game the-

ory and its applications, and privacy preservation. He has publishedmore than 100 peer-reviewed papers in leading technical journals andconference proceedings. He is a recipient of China National Natu-ral Science Fund for Outstanding Young Scientists, CCF-Intel YoungFaculty Researcher Program Award, CCF-Tencent “Rhinoceros bird”Open Fund, and Pujiang Scholar. He has served as the chair of CCFYOCSEF Shanghai, on the editorial board of Elsevier Computer Com-munications, and as the member of technical program committees ofmore than 40 academic conferences. For more information, please visithttp://www.cs.sjtu.edu.cn/~fwu/.

Na Ruan received the B.S. degree in Informa-tion Engineering and the M.S. degree in Com-munication and Information System from ChinaUniversity of Mining and Technology in 2007and 2009 respectively. She received D.E. degreefrom the Faculty of Engineering, Kyushu Univer-sity, Japan in 2012. Since 2013, she joined theDepartment of Computer Science and Engineer-ing of Shanghai Jiaotong University as AssistantProfessor. Her current research interests are inwireless network security and game theory. Dr.

Ruan is a member of the Information Processing Society of Japan(IPSJ), China Computer Federation (CCF), ACM and IEEE.

Guihai Chen earned his B.S. degree from Nan-jing University in 1984, M.E. degree from South-east University in 1987, and Ph.D. degree fromthe University of Hong Kong in 1997. He is adistinguished professor of Shanghai Jiao TongUniversity, China. He had been invited as avisiting professor by many universities includingKyushu Institute of Technology, Japan in 1998,University of Queensland, Australia in 2000, andWayne State University, USA during September2001 to August 2003. He has a wide range of

research interests with focus on sensor networks, peer-to-peer com-puting, high-performance computer architecture and combinatorics. Hehas published more than 250 peer-reviewed papers, and more than170 of them are in well-archived international journals such as IEEETransactions on Parallel and Distributed Systems, Journal of Paralleland Distributed Computing, Wireless Networks, The Computer Journal,International Journal of Foundations of Computer Science, and Perfor-mance Evaluation, and also in well-known conference proceedings suchas HPCA, MOBIHOC, INFOCOM, ICNP, ICPP, IPDPS and ICDCS.

Page 14: Privacy-Preserving Selective Aggregation of Online User ...mypages.iit.edu/~jqian15/papers/PPSA-jnl.pdf3 Selective aggregation literally refers to selecting the users who satisfy some

14

Shaojie Tang is currently an assistant profes-sor of Naveen Jindal School of Management atUniversity of Texas at Dallas. He received hisPhD in computer science from Illinois Instituteof Technology in 2012. His research interest in-cludes social networks, mobile commerce, gametheory, e-business and optimization. He receivedthe Best Paper Awards in ACM MobiHoc 2014and IEEE MASS 2013. He also received theACM SIGMobile service award in 2014. Tangserved in various positions (as chairs and TPC

members) at numerous conferences, including ACM MobiHoc and IEEEICNP. He is an editor for Elsevier Information Processing in the Agricul-ture and International Journal of Distributed Sensor Networks.