56
Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center Cornell University

Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Embed Size (px)

Citation preview

Page 1: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Privacy Preserving Mining of Association Rules

Alexandre Evfimievski,

Ramakrishnan Srikant,

Rakesh Agrawal,

Johannes Gehrke

IBM Almaden Research CenterCornell University

Page 2: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Data Mining and Privacy

• The primary task in data mining: development of models about aggregated data.

• Can we develop accurate models without access to precise information in individual data records?

Page 3: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Data Mining and Privacy

• The primary task in data mining: development of models about aggregated data.

• Can we develop accurate models without access to precise information in individual data records?

• Answer: yes, by randomization.– R. Agrawal, R. Srikant “Privacy Preserving Data Mining,”

SIGMOD 2000– for numerical attributes, classification

• How about association rules?

Page 4: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Randomization Overview

Recommendation Service

Alice

Bob

B. Spears,baseball,cnn.com,…

J.S. Bach,painting,nasa.gov,…

Chris

B. Marley,camping,linux.org,…

Page 5: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Recommendation Service

Alice

Bob

J.S. Bach,painting,nasa.gov,…

J.S. Bach,painting,nasa.gov,…

B. Spears,baseball,cnn.com,…

B. Spears,baseball,cnn.com,…

B. Marley,camping,linux.org,…

B. Marley,camping,linux.org,…

B. Spears,baseball,cnn.com,…

J.S. Bach,painting,nasa.gov,…

Chris

B. Marley,camping,linux.org,…

Randomization Overview

Page 6: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Recommendation Service

Associations

Recommendations

Alice

Bob

J.S. Bach,painting,nasa.gov,…

J.S. Bach,painting,nasa.gov,…

B. Spears,baseball,cnn.com,…

B. Spears,baseball,cnn.com,…

B. Marley,camping,linux.org,…

B. Marley,camping,linux.org,…

B. Spears,baseball,cnn.com,…

J.S. Bach,painting,nasa.gov,…

Chris

B. Marley,camping,linux.org,…

Randomization Overview

Page 7: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Recommendation Service

Associations

Recommendations

Alice

Bob

Metallica,painting,nasa.gov,…

Metallica,painting,nasa.gov,…

B. Spears,soccer,bbc.co.uk,…

B. Spears,soccer,bbc.co.uk,…

B. Marley,camping,microsoft.com…

B. Marley,camping,microsoft.com…

B. Spears,baseball,cnn.com,…

J.S. Bach,painting,nasa.gov,…

Support Recovery

Chris

B. Marley,camping,linux.org,…

Randomization Overview

Page 8: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Associations Recap

• A transaction t is a set of items (e.g. books)• All transactions form a set T of transactions• Any itemset A has support s in T if

• Itemset A is frequent if s smin

• If A B , then supp (A) supp (B).

T

tATtAs

|#supp

Page 9: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Associations Recap

• A transaction t is a set of items (e.g. books)• All transactions form a set T of transactions• Any itemset A has support s in T if

• Itemset A is frequent if s smin

• If A B , then supp (A) supp (B).

• Example:– 20% transactions contain beer,– 5% transactions contain beer and diapers;– Then: confidence of “beer diapers” is 5/20 = 0.25 = 25%.

T

tATtAs

|#supp

Page 10: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

The Problem

• How to randomize transactions so that– we can find frequent itemsets– while preserving privacy at transaction level?

Page 11: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Talk Outline

• Introduction• Privacy Breaches• Our Solution• Experiments• Conclusion

Page 12: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Uniform Randomization

• Given a transaction,– keep item with 20% probability,– replace with a new random item with 80%

probability.

Page 13: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Example: {x, y, z}

1% have

{x, y, z}

5% have{x, y}, {x, z},or {y, z} only

10 M transactions of size 10 with 10 K items:

94%have one or zeroitems of {x, y, z}

Page 14: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Example: {x, y, z}

1% have

{x, y, z}

5% have{x, y}, {x, z},or {y, z} only

10 M transactions of size 10 with 10 K items:

94%have one or zeroitems of {x, y, z}

Uniform randomization: How many have {x, y, z} ?

Page 15: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Example: {x, y, z}

1% have

{x, y, z}

5% have{x, y}, {x, z},or {y, z} only

10 M transactions of size 10 with 10 K items:

94%have one or zeroitems of {x, y, z}

0.008%800 ts.

0.00016%16 trans.

less than 0.00002%2 transactions

Uniform randomization: How many have {x, y, z} ?

• 0.22 • 8/10,000• 0.23at most

• 0.2 • (9/10,000)2

Page 16: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Example: {x, y, z}

1% have

{x, y, z}

5% have{x, y}, {x, z},or {y, z} only

10 M transactions of size 10 with 10 K items:

94%have one or zeroitems of {x, y, z}

0.008%800 ts.97.8%

0.00016%16 trans.

1.9%

less than 0.00002%2 transactions

0.3%

Uniform randomization: How many have {x, y, z} ?

• 0.22 • 8/10,000• 0.23at most

• 0.2 • (9/10,000)2

Page 17: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Example: {x, y, z}• Given nothing, we have only 1% probability that {x, y, z}

occurs in the original transaction

• Given {x, y, z} in the randomized transaction, we have about 98% certainty of {x, y, z} in the original one.

• This is what we call a privacy breach.

• Uniform randomization preserves privacy “on average,” but not “in the worst case.”

Page 18: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Privacy Breaches• Suppose:

– t is an original transaction;– t’ is the corresponding randomized transaction;– A is a (frequent) itemset.

• Definition: Itemset A causes a privacy breach of level (e.g. 50%) if, for some item z A,

– Assumption: no external information besides t’.

tAtz |Pr

Page 19: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Talk Outline

• Introduction• Privacy Breaches• Our Solution• Experiments• Conclusion

Page 20: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Our Solution

• Insert many false items into each transaction• Hide true itemsets among false ones

“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?”

“He grows a forest to hide it in.”

G.K. Chesterton

Page 21: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Our Solution

• Insert many false items into each transaction• Hide true itemsets among false ones

Can we still find frequent itemsets while having sufficient privacy?

“Where does a wise man hide a leaf? In the forest. But what does he do if there is no forest?”

“He grows a forest to hide it in.”

G.K. Chesterton

Page 22: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Definition of cut-and-paste• Given transaction t of size m, construct t’:

a, b, c, u, v, w, x, y, zt =

t’ =

Page 23: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Definition of cut-and-paste• Given transaction t of size m, construct t’:

– Choose a number j between 0 and Km (cutoff);

a, b, c, u, v, w, x, y, zt =

t’ =j = 4

Page 24: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Definition of cut-and-paste• Given transaction t of size m, construct t’:

– Choose a number j between 0 and Km (cutoff);

– Include j items of t into t’;

a, b, c, u, v, w, x, y, zt =

b, v, x, zt’ =j = 4

Page 25: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Definition of cut-and-paste• Given transaction t of size m, construct t’:

– Choose a number j between 0 and Km (cutoff);

– Include j items of t into t’;

– Each other item is included into t’ with probability pm .

The choice of Km and pm is based on the desired level of privacy.

a, b, c, u, v, w, x, y, zt =

b, v, x, zt’ = œ, å, ß, ξ, ψ, €, א, ъ, ђ, …j = 4

Page 26: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Partial SupportsTo recover original support of an itemset, we need

randomized supports of its subsets.• Given an itemset A of size k and transaction size m,• A vector of partial supports of A is

– Here sk is the same as the support of A.

– Randomized partial supports are denoted by

lAtTtT

s

ssss

l

k

#|#1

,,...,, 10 where

.s

Page 27: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Transition Matrix• Let k = |A|, m = |t|. • Transition matrix P = P (k, m) connects randomized

partial supports with original ones:

• Randomized supports are distributed as a sum of multinomial distributions.

lAtlAtP

sPs

ll

#|#Pr

,E

,

where

Page 28: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

The Unbiased Estimators

• Given randomized partial supports, we can estimate original partial supports:

1, PQsQs whereest

Page 29: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

The Unbiased Estimators

• Given randomized partial supports, we can estimate original partial supports:

• Covariance matrix for this estimator:

• To estimate it, substitute sl with (sest)l .– Special case: estimators for support and its variance

1, PQsQs whereest

ljlijiliji

Tk

ll

PPPlD

QlDQsT

s

,,,,

0

][

][1

Cov

where

,est

Page 30: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Class of Randomizations

• Our analysis works for any randomization that satisfies two properties:– A per-transaction randomization applies the same

procedure to each transaction, using no information about other transactions;

– An item-invariant randomization does not depend on any ordering or naming of items.

Page 31: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Class of Randomizations

• Our analysis works for any randomization that satisfies two properties:– A per-transaction randomization applies the same

procedure to each transaction, using no information about other transactions;

– An item-invariant randomization does not depend on any ordering or naming of items.

• Both uniform and cut-and-paste randomizations satisfy these two properties.

Page 32: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

AprioriLet k = 1, candidate sets = all 1-itemsets.

Repeat:1. Count support for all candidate sets

2. Output the candidate sets with support smin

3. New candidate sets = all (k + 1)-itemsets s.t. all their k-subsets are candidate sets with support smin

4. Let k = k + 1

Stop when there are no more candidate sets.

Page 33: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

The Modified AprioriLet k = 1, candidate sets = all 1-itemsets.

Repeat:1. Estimate support and variance (σ2) for all candidate sets

2. Output the candidate sets with support smin

3. New candidate sets = all (k + 1)-itemsets s.t. all their k-subsets are candidate sets with support smin - σ

4. Let k = k + 1

Stop when there are no more candidate sets, or the estimator’s precision becomes unsatisfactory.

Page 34: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Privacy Breach Analysis• How many added items are enough to protect

privacy?– Have to satisfy Pr [z t | A t’] < ( no privacy breaches)– Select parameters so that it holds for all itemsets.– Use formula ( ):

k

llkl

k

llkl PsPstAtz

0,

0,|Pr

0,,#Pr 0 stzlAtsl

Page 35: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Privacy Breach Analysis• How many added items are enough to protect

privacy?– Have to satisfy Pr [z t | A t’] < ( no privacy breaches)– Select parameters so that it holds for all itemsets.– Use formula ( ):

• Parameters are to be selected in advance!– Construct a privacy-challenging test: an itemset whose all

subsets have maximum possible support.– Enough to know maximal support of an itemset for each size.

k

llkl

k

llkl PsPstAtz

0,

0,|Pr

0,,#Pr 0 stzlAtsl

Page 36: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Graceful Tradeoff

• Want more precision or more privacy?– Adjust privacy breach level– A small relaxation of privacy restrictions results in

a small increase in precision of estimators.

Page 37: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Talk Outline

• Introduction• Privacy Breaches• Our Solution• Experiments

– Support recovery vs. parameters– Real-life data

• Conclusion

Page 38: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Lowest Discoverable Support• LDS is s.t., when predicted, is 4 away from zero.• Roughly, LDS is proportional to

LDS vs. number of transactions

0

0.2

0.4

0.6

0.8

1

1.2

1 10 100Number of transactions, millions

LD

S,

%

1-itemsets 2-itemsets 3-itemsets

|t| = 5, = 50%

T1

Page 39: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

LDS vs. Breach Level

0

0.5

1

1.5

2

2.5

30 40 50 60 70 80 90

Privacy Breach Level, %

LDS

, %

1-itemsets

2-itemsets

3-itemsets

|t| = 5, |T| = 5 M

• Reminder: breach level is the limit on Pr [z t | A t’]

Page 40: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Talk Outline

• Introduction• Privacy Breaches• Our Solution• Experiments

– Support recovery vs. parameters– Real-life data

• Conclusion

Page 41: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Real datasets: soccer, mailorder• Soccer is the clickstream log of WorldCup’98 web site,

split into sessions of HTML requests.– 11 K items (HTMLs), 6.5 M transactions– Available at http://www.acm.org/sigcomm/ITA/

• Mailorder is a purchase dataset from a certain on-line store– Products are replaced with their categories– 96 items (categories), 2.9 M transactions

A small fraction of transactions are discarded as too long.

– longer than 10 (for soccer) or 7 (for mailorder)

Page 42: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Modified Apriori on Real Data

Itemset Size

True Itemsets

True Positives

False Drops

False Positives

1 266 254 12 31

2 217 195 22 45

3 48 43 5 26

Itemset Size

True Itemsets

True Positives

False Drops

False Positives

1 65 65 0 0

2 228 212 16 28

3 22 18 4 5

Soccer:

smin = 0.2%

0.07% for 3-itemsets

Mailorder:

smin = 0.2%

0.05% for 3-itemsets

Breach level = 50%. Inserted 20-50% items to each transaction.

Page 43: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

False Drops False Positives

Size < 0.1 0.1-0.15 0.15-0.2 0.2

1 0 2 10 254

2 0 5 17 195

3 0 1 4 43

Size < 0.1 0.1-0.15 0.15-0.2 0.2

1 0 7 24 254

2 7 10 28 195

3 5 13 8 43

Size < 0.1 0.1-0.15 0.15-0.2 0.2

1 0 0 0 65

2 0 1 15 212

3 0 1 3 18

Size < 0.1 0.1-0.15 0.15-0.2 0.2

1 0 0 0 65

2 0 0 28 212

3 1 2 2 18

Soccer

Mailorder

Pred. supp%, when true supp 0.2%

Pred. supp%, when true supp 0.2%

True supp%, when pred. supp 0.2%

True supp%, when pred. supp 0.2%

Page 44: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Actual Privacy Breaches

• Verified actual privacy breach levels• The breach probabilities are counted in the datasets

for frequent and near-frequent itemsets.• If maximum supports were estimated correctly, even

worst-case breach levels fluctuated around 50%– At most 53.2% for soccer,– At most 55.4% for mailorder.

Page 45: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Talk Outline

• Introduction• Privacy Breaches• Our Solution• Experiments• Conclusion

Page 46: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Summary

• Privacy breaches: identified problem and provided a solution for controlling breaches

• Derived estimators of support and variance for a class of randomization operators

• Algorithm for discovering associations in randomized data

• Validated on real-life datasets• Can find associations while preserving privacy at the

level of individual transactions

Page 47: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Future Work

• Control of more general privacy breaches– What about other properties of transactions, for example

item z t breach caused by A t’ = ?– What about external information?

• Theoretical limits of discoverability for a given privacy breach level– How to compute theoretical limits?– How to attain them by an algorithm?

Page 48: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Thank You!

Page 49: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

BACK-UPS

Page 50: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Our Solution: Example• Old set-up:

– Given 10,000 items, 10 M transactions of size 10– 100,000 transactions (1%) contain A = {x, y, z}

• In addition to uniform randomization with p = 80%, insert 500 new random items to each transaction.– ~ 800 transactions contain {x, y, z} before and after;– Roughly (10 M) • (500 / 10,000)3 = 1250 transactions contain

none before and full {x, y, z} after.

• Presence of {x, y, z} in a randomized transaction now says little about the original transaction.

Page 51: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Privacy Breach Analysis

• GIVEN: itemset A, and item z A • WANTED:• Assume that partial supports are probabilities:

• Define: • Then we have:

lAtAs ll #Prsupp

0,,#Pr 0 stzlAtsl

tAtz |Pr

k

llkl

k

llkl PsPstAtz

0,

0,|Pr

Page 52: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Limiting Privacy Breaches

• We want to make sure that always

• But we do not know supports in advance.• Solution: For each itemset size k, give “privacy-

challenging” test values to .– It is an itemset whose subsets have maximum supports – We need to estimate maximum support values prior to

randomization

k

llkl

k

llkl PsPs

0,

0,

ll ss ,

Page 53: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

LDS vs. Transaction Size = 50%, |T| = 5 M

• Too long transactions cannot be used for prediction

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 2 3 4 5 6 7 8 9 10Transaction Size

LDS

, %

1-itemsets

2-itemsets3-itemsets

Page 54: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Related Work

R. Agrawal, R. Srikant “Privacy Preserving Data Mining,” SIGMOD 2000:

• Each client has a numerical attribute xi

• Client i sends xi + yi , where yi = random offset, with known distribution

• Server reconstructs the distribution of original attributes (~ EM algorithm)

• The distribution is then used for classification– Numerical attributes only

Page 55: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Related Work

• Y. Lindell and B. Pinkas “Privacy Preserving Data Mining,” Crypto 2000

• J. Vaidya and C. Clifton “Privacy Preserving Association Rule Mining in Vertically Partitioned Data”

• …

Page 56: Privacy Preserving Mining of Association Rules Alexandre Evfimievski, Ramakrishnan Srikant, Rakesh Agrawal, Johannes Gehrke IBM Almaden Research Center

Privacy Concern

• Popular press: – “The End of Privacy”, “The Death of Privacy”

• Government directives:– European directive on privacy protection (Oct 98)– Canadian Personal Information Protection Act (Jan 2001)

• Surveys of Web users:– 17% fundamentalists, 56% pragmatic majority, 27%

marginally concerned (April 99)– 82% said having privacy would matter (July 99)