25
A Secure Clustering Algorithm for Distributed Data Streams Geetha Jagannathan Rutgers University Joint work with Krishnan Pillaipakkamnatt and D. Umano

A Secure Clustering Algorithm for Distributed Data Streams

  • Upload
    heidi

  • View
    40

  • Download
    0

Embed Size (px)

DESCRIPTION

A Secure Clustering Algorithm for Distributed Data Streams. Geetha Jagannathan Rutgers University Joint work with Krishnan Pillaipakkamnatt and D. Umano. Outline. The problem Prior results Clustering data streams Experimental results and comparison A privacy-preserving protocol - PowerPoint PPT Presentation

Citation preview

Page 1: A Secure Clustering Algorithm for Distributed Data Streams

A Secure Clustering Algorithm for Distributed Data Streams

Geetha Jagannathan

Rutgers University

Joint work with Krishnan Pillaipakkamnatt and D. Umano

Page 2: A Secure Clustering Algorithm for Distributed Data Streams

Outline

The problemPrior resultsClustering data streamsExperimental results and comparisonA privacy-preserving protocolConclusion

Page 3: A Secure Clustering Algorithm for Distributed Data Streams

The problem

Alice and Bob each have a data stream, defined on the same attributes.

(horizontal partition)

The wish to compute a clustering on the combined data.

Page 4: A Secure Clustering Algorithm for Distributed Data Streams

Bob Alice

1 DInput : Data stream 2 DInput : Data stream

1 2m nk D DOutput : - clustering of

1 1mD m DThe first elements of

2 2nD n DThe first elements of

Page 5: A Secure Clustering Algorithm for Distributed Data Streams

Clustering on joint dataAlice’s Data

k = 4

Page 6: A Secure Clustering Algorithm for Distributed Data Streams

Clustering on joint dataBob’s Data

k = 4

Page 7: A Secure Clustering Algorithm for Distributed Data Streams

Clustering on joint dataCombined Data

k = 4

Page 8: A Secure Clustering Algorithm for Distributed Data Streams

Trusted third party

AliceBob

1mD 2

nD

k-clustering

k-clustering

Page 9: A Secure Clustering Algorithm for Distributed Data Streams

Privacy requirements

Parties are semi-honest

Same as trusted third party

Reveals nothing but the final output

In this case – the k cluster centers

Page 10: A Secure Clustering Algorithm for Distributed Data Streams

Prior results

PPDM protocols convert distributed DM algorithms into private ones

The k-means algorithm is the basis for many clustering protocols [VC03, JKM05, JW05, BO07]

“Leak” intermediate information[JPW05] presents a leak-free clustering

protocol based on a new clustering algorithm.

Page 11: A Secure Clustering Algorithm for Distributed Data Streams

Our Contributions

A leak free privacy-preserving protocol for distributed data streams.

A data stream clustering algorithmBetter than k-means (on average)Comparable performance with BIRCH on

many data sets, but with lower memory needs.

Page 12: A Secure Clustering Algorithm for Distributed Data Streams

Data Stream Algorithms

Data arrives in “stream” fashion: d1, d2, …, dn, … (the “end” of the stream is not known ahead of time).

Data is too large to fit entirely in memory.Data can be accessed only in the order

that it arrives.Each data item can only be “read” once.

Page 13: A Secure Clustering Algorithm for Distributed Data Streams

The clustering algorithm

“Incrementally agglomerative”: It merges intermediate clusters without waiting for all the data to be available.

Runs in time linear in n.

Page 14: A Secure Clustering Algorithm for Distributed Data Streams

Overview of clustering algorithm

K = 5

Level 0 clustering

Level 1 clustering

Level 2 clustering

Output

Output expected after n = 25 data points

Page 15: A Secure Clustering Algorithm for Distributed Data Streams

Clustering Algorithm Outline

The algorithm maintains a list of k-clusterings (each clustering is on some partial data).

In each iteration: Input the next k data points as a level-0

clustering. If two clusterings at level i are in the list,

“merge” them into a level-(i + 1) k-clustering.

Page 16: A Secure Clustering Algorithm for Distributed Data Streams

Clustering algorithm outline

If output is needed after some n points have been read, all k-clusterings are “merged” into a single k-clustering.

Page 17: A Secure Clustering Algorithm for Distributed Data Streams

“Merging” clusterings

Have a set S clusters, which |S| > k.Need a set S' of k clusters.

S' = SRepeat

Compute merge error for every pair of clustersTake the union of the pair with lowest error

Until |S'| = k

Page 18: A Secure Clustering Algorithm for Distributed Data Streams

Error (C1 U C2) =

C1.weight * C2.weight * (dist(C1, C2))2

Page 19: A Secure Clustering Algorithm for Distributed Data Streams

Sample results (offset grid)

Page 20: A Secure Clustering Algorithm for Distributed Data Streams

Sample results (vs k-means)

Page 21: A Secure Clustering Algorithm for Distributed Data Streams

Sample result (vs. BIRCH)

Page 22: A Secure Clustering Algorithm for Distributed Data Streams

Realistic Data (Network Intrusion)

Algorithm Mem. Allowed

(× 24000 bytes)

ESS

StreamCluster 1 4.1E14

BIRCH 1 *

BIRCH 2 *

BIRCH 4 *

BIRCH 32 *

BIRCH 64 4.8E17

BIRCH 128 4.8E17

Page 23: A Secure Clustering Algorithm for Distributed Data Streams

The Secure Protocol

Input: Alice owns data stream D1

Bob owns data stream D2

Output : k-clusters on D1m U D2

n

1. Alice computes O(k log ( )) cluster centers and Bob computes O(k log ( )) cluster centers

2. Alice and Bob securely share their cluster centers

3. They securely merge clusters

km

kn

Page 24: A Secure Clustering Algorithm for Distributed Data Streams

Sample Run(Distributed non-private protocol)

Page 25: A Secure Clustering Algorithm for Distributed Data Streams

Complexity

Communication complexity: O((k log(mn/k2)2)

Non-private setting (one party sends the intermediate clusters to the other)Comm complexity: O(k log (m/k))

kmn