Privacy Preserving Data Mining â€“ Secure multiparty computation

Privacy Preserving Data Mining –

Secure multiparty computation and

random response techniques

Li Xiong

CS573 Data Privacy and Security

Outline

• Privacy preserving two-party decision tree mining using SMC protocols (Lindell & Pinkas ’00)

• Primitive SMC protocols

– Secure sum

– Secure union (encryption based)– Secure union (encryption based)

– Secure max (probabilistic random response based)

– Secure union (probabilistic and randomization based)

• Secure data mining using sub protocols

• Random response for privacy preserving data mining or data sanitization

Random response protocols

• Multi-round probabilistic protocols

• Randomization probability associated with

each round

• Random response with randomization • Random response with randomization

probability

• Multiple rounds

• Randomization Probability at round r :

– Pr(r) =

• Local algorithm at round r and node i:

Max Protocol – multi-round random

response

1

0*−rdP

• Local algorithm at round r and node i:

4

gi-1(r)>=vi gi-1(r)<vi

gi(r) gi-1(r) w/ prob Pr:

rand [gi-1(r), vi)

w/ prob 1-Pr:

vi

igi-1(r) gi(r)

vi

Max Protocol - Illustration

Start18 3532

D2D2

30 10

0

5

32 4035

D3D4

30

20 40

10

18 3532

32 4035

Min/Max Protocol - Correctness

• Precision bound:

– Converges with r

– Smaller p0 and d provides faster convergence

2

)1(

01*1)Pr(1

−

=−≥−∏

rrrr

jdPj

6

Min/Max Protocol - Cost

• Communication cost

– single round: O(n)

– Minimum # of rounds given

precision guarantee (1-e):

7

Min/Max Protocol - Security

• Probability/confidence based metric: P(C|IR,R)

– Different types of exposures based on claim

• Data value: v =a

0.50 1

Absolute Privacy Provable Exposure

8

• Data value: vi=a

• Data ownership: Vi contains a

– Change of beliefs

• P(C|IR,R) – P(C|R)

• P(C|IR, R) / P(C|R)

• Relationship to privacy in anonymization

– Change of beliefs P(C|D*, BR) – P(C|BR)

Min/Max Protocol – Security (Analysis)

• Upper bound for average expected change of beliefs:

max r 1/2r-1 * (1-P0*dr-1)

• Larger p0 and d provides better privacy

9

Min/Max Protocol – Security (Experiments)

10

• Loss of privacy decreases with increasing number of nodes

• Probabilistic protocol achieves better privacy (close to 0)

• When n is large, anonymous protocol is actually okay!

Union

• Commutative encryption based approach

– Number of rounds: 2 rounds

– Each round: encryption and decryption

• Multi-round random-response approach?• Multi-round random-response approach?

Vector

1

0b1

b2

…p1

0

1

…

p2

1

0

…

pc

OR OR OR… =

1

1

VG

…

• Each database has a boolean vector of the data items

• Union vector is a logical OR of all vectors

0bL

…

0

…

0

…

0

…

Privacy Preserving Indexing of Documents on the Network, Bawa, 2003

Group Vector Protocol

…

Pex=1/2r, Pin=1-Pex

for(i=1; i<L; i++)

if (Vs[i]=1 and VG’[i]=0)

Processing of VG’ at ps of round r…

0

1

0

v1

0

0

1

…

v2

0

1

0

…

vc

0

0

0

…

vG’

0

0

1

…

vG’

r=1, Pex=1/2, Pin=1/2


Set VG’[i]=1 with prob. Pin


Set VG’[i]=0 with prob. Pex

v1v2

vc

r=2, Pex=1/4, Pin=3/40

0

1

…

vG’

0

1

1

…

vG’

0

1

1

…

vG’

0

0

1

…

vG’

0

1

1

…

vG’

p1p2 pc

Random Shares based Secure Union• Phase 1: random item addition

– Multiple rounds with permutated ring

– Each node sends a random share of its item set and a random share of a random

item set

• Phase 2: random item removal

– Each node subtracts its random items set

14

Random Shares based Secure Union -

Analysis

• Item exposure attack

– An adversary makes a claim C on a particular item a node i contributes to the final result (C: vi in xi)

• Set exposure attack

– An adversary makes a claim C on the whole set of – An adversary makes a claim C on the whole set of items a node i contributes to the final union result X (C: xi = ai).

• Change of beliefs (posterior probability and prior probability)

– P(C|IR,X) - P(C|X)

– P(C|IR,X)/P(C|X)

15

Exposure Risk – Set Exposure

• Disclosure decreases with increasing number of generated

random items and increasing number of participating nodes

• Set exposure risk is or close to 0 for probabilistic and crypto

approach

16

Exposure Risk – Risk Exposure

• Item exposure risk decreases with increasing number of

generated random items and participating nodes

• Item exposure risk for probabilistic approach is quite high

17

Cost Comparison

• Commutative protocol and anonymous communication protocol efficient but sensitive to union size

• Probabilistic protocol efficient but sensitive to domain size

• Estimated runtime for the general circuit-based protocol implemented by FairplayMP framework is 15 days, 127 days and 1.4 years for the domain sizes tested

18

Open issues

� Tradeoff between accuracy, efficiency, and security

� How to quantify security

� How to design adjustable protocols

� Can we generalize the random-response algorithms � Can we generalize the random-response algorithms

and randomization algorithms for operators based on

their properties

� Operators: sum, union, max, min …

� Properties: commutative, associative, invertible,

randomizable

•Secure Sum

•Secure Comparison

•Association Rule Mining

•Decision Trees

Data Mining on Horizontally

Partitioned DataSpecific Secure Tools

•Secure Union

•Secure Logarithm

•Secure Poly. Evaluation

•EM Clustering

•Naïve Bayes Classifier

•Secure Comparison

•Secure Set Intersection

•Association Rule Mining

•Decision Trees

Data Mining on Vertically

Partitioned DataSpecific Secure Tools

•Secure Dot Product

•Secure Logarithm

•Secure Poly. Evaluation

•K-means Clustering

•Naïve Bayes Classifier

•Outlier Detection

Summary of SMC Based PPDDM

• Mainly used for distributed data mining.

• Efficient/specific cryptographic solutions for many distributed data mining problems are developed.

• Random response or randomization based • Random response or randomization based protocols offer tradeoff between accuracy, efficiency, and security

• Mainly semi-honest assumption(i.e. parties follow the protocols)

Ongoing research

• New models that can trade-off better

between efficiency and security

• Game theoretic / incentive issues in PPDM

Outline

• Privacy preserving two-party decision tree mining using SMC protocols (Lindell & Pinkas ’00)

• Primitive SMC protocols

– Secure sum

– Secure union (encryption based)– Secure union (encryption based)

– Secure max (probabilistic random response based)

– Secure union (probabilistic and randomization based)

• Secure data mining using sub protocols

• Random response for privacy preserving data mining or data collection

Data Collection Model

Data cannot be shared

directly because of privacy

concern

Randomized Response

Do you smoke?

Head Yes

The true

answer is

“Yes”

)5.0(

)(

≠

=

θ

θYesP

P'(Yes) = P(Yes) ⋅ θ + P(No) ⋅ (1−θ)

P'(No) = P(Yes) ⋅ (1−θ) + P(No) ⋅ θ

Head

TailNo

YesBiased coin:

5.0

)(

≠

=

θ

θHeadP

Randomized Response

• Multiple attributes encoded in bits

)5.0(

)(

≠

=

θ

θYesPHead True answer E: 110

Biased coin:

)( = θHeadP )5.0( ≠θ

TailFalse answer !E: 0015.0

)(

≠

=

θ

θHeadP

Using Randomized Response Techniques for Privacy-Preserving Data Mining, Du, 2003

Generalization for Multi-Valued Categorical

Data

Si

Si+1

Si+2

q1

q2

q3

q4

True Value: Si Si+3

q4

P'(s1)

P'(s2)

P'(s3)

P'(s4)

=

q1 q4 q3 q2

q2 q1 q4 q3

q3 q2 q1 q4

q4 q3 q2 q1

P(s1)

P(s2)

P(s3)

P(s4)

M

A Generalization

• RR Matrices [Warner 65], [R.Agrawal 05], [S. Agrawal 05]

• RR Matrix can be arbitrary

• Can we find optimal RR matrices?

M =

a11 a12 a13 a14

a21 a22 a23 a24

a31 a32 a33 a34

a41 a42 a43 a44

OptRR:Optimizing Randomized Response Schemes for Privacy-Preserving Data Mining, Huang,

2008

What is an optimal matrix?

• Which of the following is better?

M1 =

1 0 0

0 1 0

M2 =

13

13

13

13

13

13

M1 = 0 1 0

0 0 1

2 3 3 3

13

13

13

What is an optimal matrix?

• Which of the following is better?

M1 =

1 0 0

0 1 0

M2 =

13

13

13

13

13

13

M1 = 0 1 0

0 0 1

2 3 3 3

13

13

13

Privacy: M2 is better

Utility: M1 is better

So, what is an optimal matrix?

Optimal RR Matrix

• An RR matrix M is optimal if no other RR

matrix’s privacy and utility are both better

than M (i, e, no other matrix dominates M).

– Privacy Quantification– Privacy Quantification

– Utility Quantification

• A number of privacy and utility metrics have

been proposed.

– Privacy: how accurately one can estimate individual info.

– Utility: how accurately we can estimate aggregate info.

Optimization Methods

• Approach 1: Weighted sum:

w1 Privacy + w2 Utility

• Approach 2

– Fix Privacy, find M with the optimal Utility.– Fix Privacy, find M with the optimal Utility.

– Fix Utility, find M with the optimal Privacy.

– Challenge: Difficult to generate M with a fixed privacy or utility.

• Proposed Approach: Multi-Objective Optimization

Optimization algorithm

• Evolutionary Multi-Objective Optimization (EMOO)

• The algorithm

– Start with a set of initial RR matrices

– Repeat the following steps in each iteration

• Mating: selecting two RR matrices in the pool• Mating: selecting two RR matrices in the pool

• Crossover: exchanging several columns between the two RR matrices

• Mutation: change some values in a RR matrix

• Meet the privacy bound: filtering the resultant matrices

• Evaluate the fitness value for the new RR matrices.

Note : the fitness values is defined in terms of privacy and utility metrics

Illustration

Output of Optimization

Worse

M5M6

The optimal set is often plotted in the objective space as

Pareto front.

Privacy

Utility

Better

M1M2

M4

M3

M7M8

For First attribute of Adult data

Summary

• Privacy preserving data mining

– Secure multi-party computation protocols

– Random response techniques for computation

and data collection

• Knowledge sensitive data mining