44
PR SM PRiSM Lab. - UMR 8144 Privacy Preserving SQL Query Execution on Distributed Data Quoc-Cuong To, Benjamin Nguyen, Philippe Pucheral SMIS Project LaHDAK Seminar Orsay 4 th March 2014 Université de Versailles et St-Quentin INRIA Rocquencourt CNRS

Talk Benjamin NGUYEN

Embed Size (px)

Citation preview

Page 1: Talk  Benjamin NGUYEN

PR SMPRiSM Lab. - UMR 8144

Privacy Preserving SQL Query Execution on Distributed Data

Quoc-Cuong To, Benjamin Nguyen, Philippe PucheralSMIS Project

LaHDAK SeminarOrsay4th March 2014

Université de Versailles et St-Quentin INRIA RocquencourtCNRS

Page 2: Talk  Benjamin NGUYEN

PR SMPRiSM Lab. - UMR 8144

PART I The New Oil

I. The New OilII. Trusted CellsIII. Global SQL QueriesIV. Cost Model and

ExperimentsV. Conclusion

Page 3: Talk  Benjamin NGUYEN

PR SM 3

Mass-generation of (personal) data

Data sources have mostly turned digital

Analog processese.g., photography, films

Paper-based interactions e.g., banking, e-administration

Communicationse.g., email, SMS, MMS, Skype

Where is your personal data? … In data centers

112 new emails per day Mail servers

65 SMS sent per day Telcos

800 pages of social data Social networks

Web searches, list of purchases google, amazon

People recording

People listnening

St Peter's Place, Roma

WHY ?Is this a problem ?

/ 41

Everything is free…

Information Extraction

Page 4: Talk  Benjamin NGUYEN

PR SM 4

Personal data is the new oil

Is this good news ?

$2 billion a year spend by US companies

on third-party data about individuals

(Forrester Report)

$44.25 is the estimated return on $1

invested in email marketing (oil is up to 0.5$/yr)

High Market Value Companies Facebook: value / #accounts 50$ Google: $38 billion business sells ads based on how people search the Web Amazon (knows purchase intent), mail order systems companies (gmail), loyalty

programs (supermarkets), banks & insurrance, employement market (linkedIn,

viadeo), travel & transportation (voyages-sncf), the « love » market (meetic), etc.

/41

Page 5: Talk  Benjamin NGUYEN

PR SM 5

Personal data is the new oil

How would oil companies behave ?

• Exploit your oil field for free Know all about you

• Offer “extra” services Refine their knowledge

• Provide real services to their paying customers

(e.g. advertisement and profiling, location tracking and spying, …)

In other words : your personal data would be

processed by sophisticated data refineries…

REGARDLESS OF YOUR PRIVACY !

It’s the business model…

… or bad news ?

/41

Their choice

Your choice

Page 6: Talk  Benjamin NGUYEN

PR SM 6

Is the current centralised model good wrt privacy protection?

Intrinsic problem #1: personal data is exposed to sophisticated attacks

–High benefits to successful hack

–One person negligence may affect millions

Intrinsic problem #2: personal data is hostage of sudden privacy changes

–Centralised administration of data means delegation of control

–This leads to regular changes, with application (and business)

evolution, with mergers and acquisition, etc. (e.g Facebook 2012)

Increasing security is only a partial solution since does not solve those

intrinsic limitations

E.g., TrustedDB [BS12] proposes tamper-resistant hardware to secure

outsourced centralized databases.

/41

Page 7: Talk  Benjamin NGUYEN

PR SM 7

A New Hope

A Personal Data Ecosystem…

… built around user-centricity and trust,

achieved through a decentralized architecture

7

THE TRUSTED CELL !

I want my privacy back !!

/41

Our goals : Preserve current USER functionalities Hinder uncontrolled data exploitation & privacy violations

Our targets : General Data Management Applications : SQL “Low cost” solutions (i.e. acceptable by general public)

Page 8: Talk  Benjamin NGUYEN

PR SMPRiSM Lab. - UMR 8144

PART II

Trusted Cells

I. The New OilII. Trusted CellsIII. Global SQL QueriesIV. Cost Model and

ExperimentsV. Conclusion

Page 9: Talk  Benjamin NGUYEN

PR SM 9

The Secure (Trusted) Personal Data Server Approach [AAB+10]

9

Personal database is• Well-organized• Tamper resistant• Controlled by the owner

(sharing, retention, audit)• Accessible in disconnected

mode

Approach characteristics :

• Based on tamper-resistant HW

• Well Structured World (R-DB, limited apps)

• Uniform equipment

TRUSTED CELL

/41

Page 10: Talk  Benjamin NGUYEN

PR SM 10

Why trust personal secure HW solutions?

1. Users store their own data

minimize abusive usage

2. Auto-administered platform

no DBA attack (even by user)

3. Enforce privacy principles for externalized (shared) data

best if the recipient of the data is another TC

4. Tamper-resistance + certified code/secure execution + single user + physical access needed

ratio cost/benefit of an attack is very high

Tam

per

resi

stan

ce

Gemalto secure token

SMIS token (ZED)

Trust Zone architecture

Dedicated HW device

PC ? (social trust / open

source)

/41

Page 11: Talk  Benjamin NGUYEN

PR SM 11

The Trusted Cell Asymmetric Architecture

Durability,Availability

Secure Computation

Export Data

11

TC asymmetric architectureBuilt using Secure Portable Tokens as Trusted Cells (called here Trusted Data Server or TDS) /

Cloud as Supporting Server Infrastructure (SSI).Challenges :

Local (Embedded) data management (not my work : Anciaux, Bouganim, Pucheral et al.)Global querying (Part III)Data export management (MinExp Project with CG78 & LIX)

HIGH POWER / AVAILABILITYLOW / NO TRUST

LOW POWER / AVAILABILITYHIGH TRUST

ASYMMETRIC

Encrypted

Private Data Generated (e.g. sensor)

/41

Page 12: Talk  Benjamin NGUYEN

PR SMPRiSM Lab. - UMR 8144

PART IIIGlobal SQL Queries on the Asymmetric Architecture

I. The New OilII. Trusted CellsIII. Global SQL QueriesIV. Cost Model and

ExperimentsV. Conclusion

Page 13: Talk  Benjamin NGUYEN

PR SM 13

Example Trusted Cell : a Trusted Data Server (TDS)

Token Characteristics :

• High security:• High ratio Cost/Benefit of an attack;• Secure against its owner;

• Modest computing resources (~10Kb of RAM, 50MHz CPU);

• Low availability: physically controlled by its owner; connects and disconnects at it will

13

How to compute global queries over decentralized personal data stores while respecting users’ privacy?

AuthorizedQuerier

Average Salary in Orsay

Unauthorized Querier

Page 14: Talk  Benjamin NGUYEN

PR SM 14

TC can be : Unbreakable (honest)Broken (Weakly Malicious)

Infrastructure (SSI) can be :

Honest but curious (Semi-honest)

Weakly-Malicious (Covert Adversary = does not want to be detected)

Secure Global Computation on TCs

PROBLEM : How to perform global queries on the asymmetric

architecture? (i.e. using data from many/all cells)

The « classical » problem of Secure Global Computation (e.g SMC) is more general and makes no trust assumption.

THREAT MODEL :THREAT MODEL :

/41

HBC + Unbreakable “simple protocols” presented here (EDBT’14 [TNP14])WM + Broken Must be prevented ! (via security primitives) see [ANP13]

Page 15: Talk  Benjamin NGUYEN

PR SM 15

Is this a new problem ?

Several approaches are possible to securely perform global computations:

1. Use only an untrusted server/cloud/P2P and use generic (and costly)

algorithms. (e.g. Secure Multi-Party Computing [Yao82, GMW87, CKL06], fully

homomorphic encryption [Gent09]) Problem = COST

2. Use only an untrusted server/cloud/P2P and develop a specific algorithm for

each specific class of queries or applications. (e.g. DataMining Toolkit [CKV+02])

Problem = GENERICITY

3. Introduce a tangible element of trust, through the use of a trusted

component and develop a generic methodology to execute any centralized

algorithm in this context. ([Katz07, GIS+10, AAB+10]) Problem = TRUST

/41

Page 16: Talk  Benjamin NGUYEN

PR SM 16

Hypothesis on Querier and SSI

Querier:• Shares the secret key with TDSs (for encrypt the query & decrypt

result).

• Classical Access control policy (e.g. RBAC):– Cannot get the raw data stored in TDSs (get only the final result)

– Can obtain only authorized views of the dataset ( do not care about inferential attacks)

Supporting Server Infrastructure:• Doesn’t know query (so, attributes in GROUP BY clause) b/c query is

encrypted by Querier before sending to SSI.

• Has prior knowledge about data distribution.

• Honest-but-curious attacker: Frequency-based attack– SSI matches the plaintext and ciphertext of the same frequency.

e.g. investigates remarkable (very high/low) frequencies in dataset distribution

(e.g., X is the only person with a given (high) age and still working and earning money → if I

find a group with only one member I can deduct that X participates in the dataset). 16

Page 17: Talk  Benjamin NGUYEN

PR SM 17

Solution Overview

171) Query

Supporting ServerInfrastructure (SSI)

SELECT <attribute(s) and/or aggregate function(s)>FROM <Table(s) / SPTs>[WHERE <condition(s)>][GROUP BY <grouping attribute(s)>][HAVING <grouping condition(s)>][SIZE <size condition(s)>];

2) Collection andFiltering phase

3) Aggregation phase

Stop condition: max #tuples or max time

John, 35K Mary, 43K Paul, 100K

SELECT age, AVG(salary)FROM userWHERE town = “Orsay”GROUP BY ageHAVING MIN(salary) > 0SIZE

4) Aggregate Filtering phase

Page 18: Talk  Benjamin NGUYEN

PR SM 18

Proposed Solutions

The main difficulty is with AGGREGATE QUERIES !!

Solutions vary depending on which kind of encryption is used, how

the SSI constructs the partitions, and what information is revealed to

the SSI.

• Secure aggregation solution

• Noise-based solutions– random (white) noise

– noise controlled by the complementary domain

• Histogram-based solutions

We investigate these solutions along the directions of

performance and security. 18

Page 19: Talk  Benjamin NGUYEN

PR SM 19

Secure Aggregation

19

Supporting ServerInfrastructure (SSI)

encrypts its data using non-deterministic encryption

Form partitions (fit resource of a TDS)

Hold partial aggregation (Gij,AGGk)

Querier

}

(25y, Orsay, 35K)

(#x3Z, aW4r)

(45y, Orsay, 43K) (53y, Paris, 100K)

Q: SELECT Age, AVG(Salary) WHERE city = Orsay GROUP BY Age HAVING Min(Salary) > 0

($f2&, bG?3)

No answer ?

(#x3Z, aW4r)($f2&, bG?3)($&1z, kHa3)

…(T?f2, s5@a)

(#i3Z, afWE)(T?f2, s!@a)($f2&, bGa3)

(#x3Z, aW4r)($f2&, bG?3)($&1z, kHa3)

(?i6Z, af~E)(T?f2, s5@a)(5f2A, bG!3)

(25, 35K)(45, 43K)(45, 37K)

(25, [35K,1])(45, [40K,2])

(F!d2, s7@z)(ZL5=, w2^Z)

Final Agg(#f4R, bZ_a)(Ye”H, fw%g)(@!fg, wZ4#)

(25, 29.5K)(45, 43.7K)…

Evaluate HAVING clause

Final Result(#f4R, bZ_a)(Ye”H, fw%g)

Qi= <EK1(Q),Cred,Size>

Decrypt Qi Check AC rules

Decrypt Qi Check AC rules

Decrypt Qi Check AC rules

Page 20: Talk  Benjamin NGUYEN

PR SM 20

Noise Based Protocols

Secure Aggregation Efficiency problem :nDet_Enc on AG SSI cannot gather tuples belonging to the same group into same partition.

But :Det_Enc on AG frequency-based attack.

Idea : Add noise (fake tuples) to hide distribution of AG.

How many fake tuples (nf) needed? disparity in frequencies among AG – small nf: random noise

– big nf: white noise

– nf = n-1: controlled noise (n: AG domain cardinality)

Efficiency: – Each TDS handles tuples belonging to one group (instead of large partial

aggregation as in SAgg)– However, high cost of generating and processing the very large number of

fake tuples

Page 21: Talk  Benjamin NGUYEN

PR SM 21

Nearly Equi-Depth Histogram Solution

1. Distribution of AG is discovered

and distributed to all TDSs.

2. TDS allocates its tuple to

corresponding bucket.

3. TDS send to SSI:

{h(bucketId),nDet_Enc(tuple)}

Consequences :

21

We do not generate & process too many fake tuples

We do not handle too large partial aggregation

True Distribution Nearly equi-depth histogram

Problem : Distribution must be discovered

This can be done “offline” using secure

aggregation !

Page 22: Talk  Benjamin NGUYEN

PR SM 22

Information Exposure Analysis (DCJP+03)

22

To measure Information Exposure, we consider the probability that an attacker (here the Honnest but Curious SSI) can reconstruct the plaintext table (or part of the table) using

the encrypted table and his prior knowledge about global distributions of plaintext attributes.

Information Exposure is noted :

• n is the number of tuples

• k is the number of attributes

• ICi,j is the value in row i and column j of the inverse cardinality ( = 1/number of plaintext values that could correspond)

• Nj is the number of distinct plaintext values in the global distribution of attribute in column j (i.e., Nj ≤ n).

,1 1

1 kn

i ji j

ICn

Page 23: Talk  Benjamin NGUYEN

PR SM 23

23

_1 1 1

1 11/

k kn

S Agg ji j jj

Nn N

SAgg: ICi,j = 1/Nj for all i,j

•n: the number of tuples, •k: the number of attributes, •ICi,j : IC for row i and column j•Nj: the number of distinct plaintext values in the global distribution of attribute in column j (i.e., Nj ≤ n)

_1

min( ) 1/k

ED Hist jj

N

EDHist: requires finding all possible partitions of the plaintext values such that the sum of their occurrences is the cardinality of the hashed value: NP-Hard multiple subset sum problem

Noise_based & ED_Hist have a uniform distribution of the AG: ɛED_Hist = ɛNoise_based

Plaintext: _1 1

11 1

kn

P Texti jn

ɛS_Agg ≤ ɛED_Hist =ɛNoise_based <1

Information Exposure Analysis (Damiani et al. CCS 2003)

Page 24: Talk  Benjamin NGUYEN

PR SMPRiSM Lab. - UMR 8144

PART IV

Cost Model and experiments

I. The New OilII. Trusted CellsIII. Global SQL QueriesIV. Cost Model and

ExperimentsV. Conclusion

Page 25: Talk  Benjamin NGUYEN

PR SM 25

Unit Test Calibration

25

Internal time consumption

Eval Board•32 bit RISC CPU: 120 MHz•Crypto-coprocessor: AES, SHA•64KB RAM, 1GB NAND-Flash•USB full speed: 12 Mbps

}SMIS developped token (ZED electronics)Same technical characteristicsPrice = 50 EUR (small series)

Page 26: Talk  Benjamin NGUYEN

PR SM 26

Parameters for cost model

Dataset size Ttuple : varies from 5 to 65 million

Number of groups G : varies from 1 to 106

Number of TDSs participating in the computation as a percentage of all TDSs

connected at a given time Ttds : varies from 1% to 100%).

We fix two parameters and vary the other, measuring : execution time,

parallelism of the protocol, total load, maximum load on one TDS

When the parameters are fixed :

Ttuple =106, G=103, % of TDS connected = 10% of Ttuple.

We also compute and use the optimal value for all reduction factors as

well as for.

In the figures, we plot two curves for Rnf_Noise protocols RN (nf = 2) and

WN (nf = 1000) to capture the impact of the ratio of fake tuples.

Page 27: Talk  Benjamin NGUYEN

PR SM 27

EXECUTION TIME

27

Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000

Naïve, noise-based, ED&EW:•G increases, Ttuple fixed Number of tuples in each group decreases•Depend only on the total number of tuples in each group (because all groups are processed in parallel) exeTime decreases when G increases.

Secure Count: •G increases time for processing the big partial aggregation increases accordingly.•Cannot fully deploy the parallel computation (cannot divide each group for TDSs in parallel, each TDS has to handle the whole G groups) exeTime increases

Naïve, RN, ED&EW:•Ttuple increases, Ttds increases accordingly not much changes

Secure Count: • Number of recursive steps increases when Ttuple increases. exeTime increase

WN,CN: • Number of fake tuples increases linearly with the number of true tuples. exeTime also increases linearly to handle the fake & true tuples

Page 28: Talk  Benjamin NGUYEN

PR SM 28

NUMBER OF PARTICIPATING TDSS

28

Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000

Secure Count:•G increases level of convergence is low & the size of each aggregation is big need less participating TDSs to build the aggregations to gain the high convergence level

Other solutions:• Since each group is processed in parallel and independently when G increases, the level of parallelism increases more TDSs are needed to participate in the parallel computation

WN, CN:• When true Ttuple increases, the fake tuples increases as well more TDSs are needed to process fake tuples

Secure Count:• Level of parallelism is less than other solutions needs least TDS

Page 29: Talk  Benjamin NGUYEN

PR SM 29

TOTAL LOAD (NETWORK OVERHEAD)

29

Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000

Noised-based:• Highest load because of the fake tuples• When G increases but Tpds does not change number of tuples (both true and fake) do not change total load is the same

Others:Lower load since handle only true data

Noised-based:• When true Ttuple increases, the fake tuples increases linearlytotal load is highest and increases

Page 30: Talk  Benjamin NGUYEN

PR SM 30

MAXIMUM LOAD

30

Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000

Secure Count:•When G increases, size of each aggregation is bigeach PDS process bigger aggregation•When G increases, number of participating PDSs decrease each participating PDS incurs higher loadOthers:•When G increases, number of participating PDSs decrease & number of tuples in each group decreaseseach PDS process less tuples maxLoad decrease

WN, CN: •Use all available PDSs maxLoad increases linearly when Ttuple increasesOthers:when Ttuple increases, the number of participating PDSs also increase accordingly in general, the maxLoad does not increase too much

Page 31: Talk  Benjamin NGUYEN

PR SM 31

AVERAGE LOAD

31

Ttuple=106; G=1-106 Ttuple=5.106 - 35.106; G=1000

Secure Count:•Total load is unchanged but the number of participating TDSs is reduced when G increases the average load increases.WN,CN:•High total load is the same & all PTpds=10^5 participate in the computation every PDSs incur the same amount of load Others:•G increase, more participating PDSs & total load unchanged AvgLoad decreases

Although: TotalLoad(CN) > TotalLoad(SC)PTpds(CN) >> PTpds(SC)

AvgLoad(CN) < AvgLoad(SC)

Page 32: Talk  Benjamin NGUYEN

PR SM 32

CONSUMED MEMORY

32

Actual RAM size of TDS

Noise-based:•Need to store only 1 group regardless of G Require least RAM.Histogram-based:•Each PDS store h groups (h>1) regardless of G Require higher RAMSC:•Each PDS store all G groups•When G increases, RAM needed increases Require highest RAM•Exceed actual RAM’s size future work

Page 33: Talk  Benjamin NGUYEN

PR SM 33

AVERAGE TIME FOR PDS TO CONNECT

33

Ttuple=106; G=1-106Ttuple=5.106 - 35.106; G=1000

Secure Count:•The number of participating PDSs is reduced when G increases the average time increases.WN,CN:•High total load is unchanged & all PTpds=10^5 participate in the computation every PDSs take the same amount of time to process dataOthers:•G increase, more participating PDSs AvgTime decreases

High AvgTime:•WN,CN: because of too many fake tuples•SC: because of very few participating PDSs

Page 34: Talk  Benjamin NGUYEN

PR SM 34

Theoretical Scalability

34

Tpds = 1%Ttuple Tpds = 10%Ttuple

Tpds = 100%Ttuple Secure Count: has a (low) maximum number of participants.Others: WN have higher scalability than others (in the sense that adding participants count)

Page 35: Talk  Benjamin NGUYEN

PR SM 35

Experimental Scalability

Page 36: Talk  Benjamin NGUYEN

PR SM 36

COMPARISON WITH OTHER STATE-OF-THE-ART METHODS

36

Hardware:•Linux workstation; •AMD Athlon-64 2Ghz processor; •512 MB memory

•SC: depends mostly on G (slightly on Ttuple)•Others: not depends on G, but mostly on Ttuple

Answering aggregation queries in a secure system model. (Ge & Zdonic, VLDB 2007)

DES: each value is decrypted and the computation is performed on the plaintext. Server must have access to secret key & plaintext (violates security requirements)

Paillier: perform computation directly on the ciphertext using a secure homomorphic encryption scheme: enc(a + b) = enc(a) + enc(b) Server performs computation without having access to the secret key or plaintext. In the end, ciphertext are passed back to the trusted agent (i.e., Key Holder) to perform a final decryption and simple calculation of the final result

Page 37: Talk  Benjamin NGUYEN

PR SM 37

Metrics for the evaluation of the proposed solutions

37

Total Load

Average Time/Load

Query Response Time

Information Exposure

Throughput

Resource Variation

Page 38: Talk  Benjamin NGUYEN

PR SM 38

Trade-off between criteria

38

Select ..

From ..

Where ..

Group By AG

G = card (AG)

Security: S_Agg > ED_Hist

Performance:G > 10:

ED_Hist > S_Agg

G <= 10:

ED_Hist < S_Agg

Page 39: Talk  Benjamin NGUYEN

PR SMPRiSM Lab. - UMR 8144

PART V

Conclusion and perspectives

I. The New OilII. Trusted CellsIII. Global SQL QueriesIV. Cost Model and

ExperimentsV. Conclusion

Page 40: Talk  Benjamin NGUYEN

PR SM 40

Short/Middle term research :Data intensive Computing on an Asymmetric Architecture

SQLQueries here do not have joins !

Take into account Malicious SSI / Broken Tokens

Field experiment on usability (with ISN)

Private/Secure MapReduceInvestigate compatibility of our protocols.

Develop new protocols.

Check performance !

XML managementAdapt the work on XQ2P (Butnaru, Gardarin, Nguyen) to the Trusted

Cells context.

Distributed Window Queries.

/41

Page 41: Talk  Benjamin NGUYEN

PR SM 41

Promoting the Trusted Cells vision

Trusted Cells “Core” Open hardware and software bundle : basic functionalities

Local DB

Distributed DB

NoSQL DB

needed to develop PbD personal data management applications !

Promote an open source community around Trusted Cells.

UVSQ FabLab

Bring secure data management to the Versailles FabLab

Beyond Tamper Resistant HWResults are useable even with lower trust elements.

Include social trust / reputation.

/41

Page 42: Talk  Benjamin NGUYEN

PR SMPRiSM Lab. - UMR 8144

QUESTIONS ?

42

Page 43: Talk  Benjamin NGUYEN

PR SMPRiSM Lab. - UMR 8144

43

Page 44: Talk  Benjamin NGUYEN

PR SM 44