Upload
abhra-basak
View
743
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Privacy Preservation Issues in Association Rule Mining in Horizontally Partitioned Databases
Citation preview
Association Rule Mining with Privacy Preservation
In Horizontally Distributed Databases
Group 1 – Abhra Basak, Apoorva Kumar, Sachin K. Saini, Shiv Sankar, Suraj B. Malode
Introduction
Look before you leap
The Flow
Association Rule Mining
Privacy Preservation
Horizontally Distributed Datasets
Before we start mining!
trends or patterns in large datasets
extracting useful information
useful and unexpected
insights
analyze and predicting system
behavior
Data Mining
Scalability ?
Artificial Engineeri
ng
Machine Learning
Statistics
Database Systems
Association Rule Learning
By Rakesh Agarwal, IBM Almaden Research Center
• 80% of people who buy bread + butter, buy milk
• {Bread, Butter} → {Milk}
What is an Association Rule?
Antecedent
Consequent
Antecedent
Consequent
Definitions
• 80% of people who buy bread + butter, buy milk
• {Bread, Butter} → {Milk}
Antecedent
• Prerequisites for the rule to be applied
Consequent
• The outcome
Support
• Percentage of transaction containing the itemset
Confidence
• Faction of transaction satisfying the rule
• Two different forms of constraints are used to generate the required association rules
• Syntactic Constraints: Restricts the attributes that may be present in a rule.
• Support Constraints: No of transactions that support a rule from the set of transactions.
Constraints
Association Rule Learning in Large Datasets
large datasets
• To find association rules
Generating Large Items
et
• combinations of itemsets which are above a minimum support threshold
Generating Association Rules
• Mining all rules which are satisfied in that itemset
Association Rule Learning in Distributed Datasets
And Privacy Preservation
• Most tools used for mining association rules assume that data to be analyzed can be collected at one central site.
• But issues like Privacy Preservation restrict the collection of data.
• Alternative methods for mining have to be devised for distributed datasets to the mining process feasible while ensuring privacy.
Preview
• Dataset• Combined data of Twitter and Facebook
• Rule• How many percentage of people login into a social
networking site and post within the next 2 minutes?
Privacy Preservation
• Horizontally Partitioned (Example: Insurance Companies)
• Rule Being Mined: Does a procedure have an unusual rate
of complication?
• Implications:
• A company may have high cases of the procedure
failing and they may change policies to help.
• At the same time if this rule is exposed it may be a
huge problem for the company.
• The risks outweigh the gains.
Privacy Preservation
Patient ID
Disease Prescription
Effect
Patient ID
Disease Prescription
Effect
Patient ID
Disease Prescription
Effect
Company A
Company C
Company B
• Vertically Partitioned
Privacy Preservation
Credit Card No. Bought tablet
2365987545623526 1
3639871526589414 1
4365845698742563 1
5962845632561200 1
6621563289657412 1
Credit Card No. Bought TCover
2365987545623526 0
7639871526589414 1
4365845698742563 1
9962845632561200 0
6621563289657412 1
Common Property
Not One We can exploit.
Mining of Association Rules
In Horizontally Partitioned Databases
What we want• Computing Association Rules without revealing private information and
getting • The global support • The global confidence
What we have• Only the following information is available
• Local Support • Local Confidence• Size of the DB
Fundamental Steps
Even this information may not be shared freely between sites. But we’ll get to that.
Calculating Required Values
𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝐴𝐵⇒ C=∑i=1
sites
supportcount ABC (i )
∑i=1
sites
database¿ (i ¿)
𝑠𝑢𝑝𝑝𝑜𝑟𝑡𝐴𝐵=∑i=1
sites
supportcount AB (i)
∑i=1
sites
database¿i ¿¿
𝑐𝑜𝑛𝑓𝑖𝑑𝑒𝑛𝑐𝑒𝐴𝐵⇒C=support AB⇒ Csupport AB
• It protects individual privacy but each site has to disclose information.
• It reveals the local support and confidence in a rule at each site.
• This information if revealed can be harmful to an organization.
Problems with the approach
• We will be exploring two algorithms that have been used.
• One algorithm that has been used incorporates encryption with data distortion
while data sharing between sites.
• The second algorithm uses a particular Check Sum as the method of
encryption.
Introducing the two Algorithms
Algorithm Uno
Some people are honest
• Phase 1: Uses encryption for mining of the large itemsets
• Phase 2: Uses a random number to preserve the privacy of each site (assuming a 3 or more party system)
Two phased algorithm
Phase 1: Commutative Encryption
Phase 2: Data Distortion
Site AABC:5
Size=100
Site BABC:6
Size=200
Site CABC:20
Size=300
R+count-5%*Size=17+5-5%*100
13+20-5%*300 17+6-5%*20013
1718 >= R
R=17
• Doesn’t work for a 2 party system
• Assumes honest parties
• Assumes Boolean responses to variable for support of rules
rather than a subjective or weighted approach.
• As the no of candidate itemsets increases the encryption
overhead increases.
• The encryption overhead also varies directly proportional to the
no of sites or partitions.
Problems with the Algorithm
I got ……
Algorithm Dua
Don’t trust anyone
• Primarily used for to tackle semi honest sites.
• Data of each site is broken down into segments.
• Two interleaved nodes have a probability of hacking the one in between them.
• The neighbors are changed for each round. Hence, they can only obtain one such
segment.
CK Secure Sum
P1
P2
P3
P4
Changing Neighbors
P1
P2
P4
P3
P1
P4
P2
P3
Round 1
Round 2
Round 3
Conclusion
The moral of the story...
Before you leave
• It is interesting that association rules play a vital role in data mining.
• Through this, what appears to be unrelated can have a logical explanation
through careful analysis.
• This aspect of data mining can be very useful in predicting patterns and
foreseeing trends in consumer behavior, choices and preferences.
• Association rules are indeed one of the best ways to succeed in business and
enjoy the harvest from data mining.
There are no dumb questions
(No questions please shhhh…)