Privacy Preservation Issues in Association Rule Mining in Horizontally Partitioned Databases

Association Rule Mining with Privacy Preservation

In Horizontally Distributed Databases

Group 1 – Abhra Basak, Apoorva Kumar, Sachin K. Saini, Shiv Sankar, Suraj B. Malode

Introduction

Look before you leap

The Flow

Association Rule Mining

Privacy Preservation

Horizontally Distributed Datasets

Before we start mining!

trends or patterns in large datasets

extracting useful information

useful and unexpected

insights

analyze and predicting system

behavior

Data Mining

Scalability ?

Artificial Engineeri

ng

Machine Learning

Statistics

Database Systems

Association Rule Learning

By Rakesh Agarwal, IBM Almaden Research Center

• 80% of people who buy bread + butter, buy milk

• {Bread, Butter} → {Milk}

What is an Association Rule?

Antecedent

Consequent

Antecedent

Consequent

Definitions

• 80% of people who buy bread + butter, buy milk

• {Bread, Butter} → {Milk}

Antecedent

• Prerequisites for the rule to be applied

Consequent

• The outcome

Support

• Percentage of transaction containing the itemset

Confidence

• Faction of transaction satisfying the rule

• Two different forms of constraints are used to generate the required association rules

• Syntactic Constraints: Restricts the attributes that may be present in a rule.

• Support Constraints: No of transactions that support a rule from the set of transactions.

Constraints

Association Rule Learning in Large Datasets

large datasets

• To find association rules

Generating Large Items

et

• combinations of itemsets which are above a minimum support threshold

Generating Association Rules

• Mining all rules which are satisfied in that itemset

Association Rule Learning in Distributed Datasets

And Privacy Preservation

• Most tools used for mining association rules assume that data to be analyzed can be collected at one central site.

• But issues like Privacy Preservation restrict the collection of data.

• Alternative methods for mining have to be devised for distributed datasets to the mining process feasible while ensuring privacy.

Preview

• Dataset• Combined data of Twitter and Facebook

• Rule• How many percentage of people login into a social

networking site and post within the next 2 minutes?


• Horizontally Partitioned (Example: Insurance Companies)

• Rule Being Mined: Does a procedure have an unusual rate

of complication?

• Implications:

• A company may have high cases of the procedure

failing and they may change policies to help.

• At the same time if this rule is exposed it may be a

huge problem for the company.

• The risks outweigh the gains.


Patient ID

Disease Prescription

Effect

Patient ID


Effect

Patient ID


Effect

Company A

Company C

Company B

• Vertically Partitioned


Credit Card No. Bought tablet

2365987545623526 1

3639871526589414 1

4365845698742563 1

5962845632561200 1

6621563289657412 1

Credit Card No. Bought TCover

2365987545623526 0

7639871526589414 1

4365845698742563 1

9962845632561200 0

6621563289657412 1

Common Property

Not One We can exploit.

Mining of Association Rules

In Horizontally Partitioned Databases

What we want• Computing Association Rules without revealing private information and

getting • The global support • The global confidence

What we have• Only the following information is available

• Local Support • Local Confidence• Size of the DB

Fundamental Steps

Even this information may not be shared freely between sites. But we’ll get to that.

Calculating Required Values

𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝐴𝐵⇒ C=∑i=1

sites

supportcount ABC (i )

∑i=1

sites

database¿ (i ¿)

𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝐴𝐵=∑i=1

sites

supportcount AB (i)

∑i=1

sites

database¿i ¿¿

𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝐴𝐵⇒C=support AB⇒ Csupport AB

• It protects individual privacy but each site has to disclose information.

• It reveals the local support and confidence in a rule at each site.

• This information if revealed can be harmful to an organization.

Problems with the approach

• We will be exploring two algorithms that have been used.

• One algorithm that has been used incorporates encryption with data distortion

while data sharing between sites.

• The second algorithm uses a particular Check Sum as the method of

encryption.

Introducing the two Algorithms

Algorithm Uno

Some people are honest

• Phase 1: Uses encryption for mining of the large itemsets

• Phase 2: Uses a random number to preserve the privacy of each site (assuming a 3 or more party system)

Two phased algorithm

Phase 1: Commutative Encryption

Phase 2: Data Distortion

Site AABC:5

Size=100

Site BABC:6

Size=200

Site CABC:20

Size=300

R+count-5%*Size=17+5-5%*100

13+20-5%*300 17+6-5%*20013

1718 >= R

R=17

• Doesn’t work for a 2 party system

• Assumes honest parties

• Assumes Boolean responses to variable for support of rules

rather than a subjective or weighted approach.

• As the no of candidate itemsets increases the encryption

overhead increases.

• The encryption overhead also varies directly proportional to the

no of sites or partitions.

Problems with the Algorithm

I got ……

Algorithm Dua

Don’t trust anyone

• Primarily used for to tackle semi honest sites.

• Data of each site is broken down into segments.

• Two interleaved nodes have a probability of hacking the one in between them.

• The neighbors are changed for each round. Hence, they can only obtain one such

segment.

CK Secure Sum

P1

P2

P3

P4

Changing Neighbors

P1

P2

P4

P3

P1

P4

P2

P3

Round 1

Round 2

Round 3

Conclusion

The moral of the story...

Before you leave

• It is interesting that association rules play a vital role in data mining.

• Through this, what appears to be unrelated can have a logical explanation

through careful analysis.

• This aspect of data mining can be very useful in predicting patterns and

foreseeing trends in consumer behavior, choices and preferences.

• Association rules are indeed one of the best ways to succeed in business and

enjoy the harvest from data mining.

There are no dumb questions

(No questions please shhhh…)

Technology

Privacy Preservation Issues in Association Rule Mining in Horizontally Partitioned Databases