17
Privacy Preserving K- means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton

Privacy Preserving K-means Clustering on Vertically Partitioned Data Presented by: Jaideep Vaidya Joint work: Prof. Chris Clifton

  • View
    223

  • Download
    2

Embed Size (px)

Citation preview

Privacy Preserving K-means Clustering on Vertically

Partitioned Data

Presented by: Jaideep Vaidya

Joint work: Prof. Chris Clifton

Overview

• Global Problem– Privacy Preserving Distributed Data Mining

• Specific Problem– Clustering (K-Means)

• For– Vertically Partitioned Data

• Using– Cryptographic Tools

Medical Records

RPJ Yes Diabetic

CAC No Tumor No

PTR No Tumor Diabetic

Cell Phone Data

RPJ 5210 Li/Ion

CAC none none

PTR 3650 NiCd

Global Database ViewTID Brain Tumor? Diabetes? Model Battery

Vertical Partitioning of Data

Is the problem trivial?

Privacy Preserving Data Mining

• Perturbation– Agrawal & Srikant, Agrawal & Aggarwal, – Rizvi & Haritsa, Evfimievski et al.

• Cryptographic– Lindell & Pinkas, Du & Zhan– Vaidya & Clifton, Kantarcioglu & Clifton

Secure Multiparty Computation (SMC)

• Given a function f and n inputs, distributed at n sites, compute

the result

while revealing nothing to any site except its own input(s) and the result.

xxx n,...,,

21

nxxxfy ,,, 21

Results

• Cluster assignment for entities– Not private

• Cluster centers– Semi-private

2.3 34 19 15.5 5210 Li/Ion Piezo

Secure K-means clustering

Arbitrarily select k starting points

Repeat– Assign to respectively– (re)assign each object to closest cluster

based on distance from mean– Re-compute the cluster means

Until no change

''2

'1 ,,, k

k ,,, 21 ''2

'1 ,,, k

''2

'1 ,,, k

K-means clustering

Assigning objects to closest cluster

k

i

r

D

PPP

O,

O,ity object/entevery For

j

2

1

21

rj

ijki

x 11

minarg Compute

Key Idea

• Disguise site components with random values

• Compare distances while revealing only comparison result

• Permute order of clusters to conceal meaning of comparison results

Closest Cluster Computation

• 3 special sites, P1, P2 and Pr

• P1 generates

– r random vectors such that– Permutation π (over 1 .. K)

iV 01

r

iiV

Permutation ProtocolDu and Atallah ’01

A B,

V

X

EXE ),(

))((

VXE

Homomorphic encryption: Ek(x)*Ek(y) = Ek(x+y)

)(

VX

Closest Cluster Computation

P1

P2

,

V i

2X222 ),( EXE

))(( 222

VXE

Pr

rX

rrr EXE ),(

))((

rrr VXE

Stage 1

P1

Pr-1

P3

Pr

)( 33

VX

)( 11

VX

)( 11

rr VX

Stage 2

2i

ii VX

Closest Cluster Computation

• Stage 3– P2 and Pr determine i, the index of the cluster

with minimum distance

• Stage 4– P1 computes and broadcasts i1

When to stop?

• Locally compute difference in means

• Globally known threshold

• Use simple random-adding technique to disguise actual values– First party adds random value to its distance and

sends to next party– Each party adds its value to total and sends on– Last party compares with first party’s random

+threshold

Communication Cost

• r parties, n data elements, m bit distances

Bits Rounds

Basic Algorithm

O(knr) O(r+k)

Optimized Algorithm

O(kmr) O(r)

Generic Method

O(kmnr3) 1

Non-Secure Method

O(n) 1

Conclusion

• Presented a solution for Privacy Preserving K-Means Clustering problem

• How to use clusters?

• Will parties share required information for the possible benefits?

• Improve Efficiency

• Working on EM-Clustering, implementations