Upload
alice-barker
View
223
Download
0
Tags:
Embed Size (px)
Citation preview
1
Anonymization of Set-Valued Data via Top-Down, Local Generalization
Yeye HeJeffrey F. Naughton
University of Wisconsin-Madison
2
Overview
• The problem:– Anonymizing set-valued data presents challenges
not seen in relational data– Previous solutions explored parts but not all of the
problem space• Our goals:– Develop a scalable algorithm for the new variant
of the problem– Perform experiments to explore strengths and
weaknesses of the approach
3
What’s set-valued data• “Relational data”– One sensitive attribute for each tuple
• “Set-valued data”– Logically: (personid, {item1, item2, …, itemn})– Multiple sensitive values in one record possible
Zipcode Gender Age … Medical diagnosis
53705 male 30 … flu
98072 female 40 … diabetes
Person ID Item set
001 {milk, sunglasses, viagra}
002 {beer, diapers, shampoo}
003 {beer, milk, diapers, pregnancy test, diabetes medicine}
4
An attack scenario• Retailer publishes market basket data
• The adversary knows Alice has bought milk, beer, and diapers
• The adversary infers Alice has also bought pregnancy test and diabetes medicine
PID Item set001 {milk, sunglasses, viagra}
002 {beer, diapers, shampoo}
003 {beer, milk, diapers, pregnancy test, diabetes medicine}
beer, milk, diapersbeer, milk, diapers, pregnancy test, diabetes medicine
5
Existing work: a priori QI/SI partition
• Scenarios where a priori partitioning set elements into Quasi-Identifier Item & Sensitive Item possible– {beer, milk, diapers, pregnancy test, diabetes medicine}
• Substantial existing work & good algorithms– [Ghinita+08] [Xu+08a] [Xu+08b] [Nergiz+07]
• But what if a priori partitioning not possible?– Individuals may have different privacy requirements– The adversary may see sensitive items and use as QI
a priori QI/SI partition possible
?
Set-valued data anonymization
6
Existing work: no QI/SI partition
• Prior work [Terrovitis+08] proposed the km-anonymity model
• km-anonymity– For any transaction (data record) T, for any subset
of m items in T, there are at least k-1 other transactions with the same m items
a priori QI/SI partition possible
No a priori QI/SI
partition
Set-valued data anonymization
7
The m in km-anonymity [Terrovitis+08]
• Attack revisited– The data 103 anonymized, the adversary sees {beer, milk, diapers}
• Cannot tell Alice’s transaction from the other 9– Effective assuming the adversary never sees more than m=3 items
• m in km-anonymity – requires some identified m s.t. no adversary will ever see more than m items
• What about the case where there is no such m? – The case we consider
a priori QI/SI partition possible
No a priori QI/SI
partition
Set-valued data anonymization
No identified m
Has identified m
8
Our model: k-anonymity for set-valued data
• Transactional database D is k-anonymous if– Every transaction (data record) occurs at least k
times
• Different from km-anonymity [Terrovitis+08] – no limit on m, i.e., valid for any m– thus a stronger privacy model
9
k-anonymity subsumes km-anonymity [Terrovitis+08]
• Every database D that satisfies k-anonymity also satisfies km-anonymity
• There exists a database D that satisfies km-anonymity for all m but not k-anonymity– Example: 23-anonymous but not 2-anonymous
T1 = {A, B, C}T2 = {A, B, C}T3 = {A, B}
a priori QI/SI partition possible
Set-valued data anonymization
km-anonk-anonNo QI/SI
partition
10
Problem statement
• Given a transactional database D, find a transformation D’ of D s.t.:– D’ satisfies k-anonymity– the transformation minimizes information loss
between (D, D’)
11
Hierarchical generalizationAll
Alcohol Health care
Pregnancy testDiaperWineBeer
• Transaction generalizationTi: {“Beer”, “Wine”, “Diaper”} {“Alcohol”, “Health care”}
• Duplicates removed
12
Information loss metric
• Normalized Certainty Penalty (NCP) [Xu+06]– Also used in previous work [Terrovitis+08]
All
Alcohol Health care
Pregnancy testDiaperWineBeer
• Example:– Generalize “Beer” to “Alcohol”: (2/4 = 0.5) info loss– Generalize “Beer” to “All”: (4/4 = 1) info loss
13
Our algorithm: Partition-based anonymization
• Top-down– Generalize everything to the root representation– Resulting one initial partition
• Divide and conquer– Choose a node to specialize for each partition
• Based on information gain heuristics
– Recursively partition on resulting sub-partitions
14
Example: 2-anonymization
TID Original Data 2-anonymization
T1 {a1}
T2 {a1, a2}
T3 {b1, b2}
T4 {b1, b2}
T5 {a1, a2, b2}
T6 {a1, a2, b2}
T7 {a1, a2, b1, b2}
15
Generalize all data to root
TID Original Data Current Representation
T1 {a1}
T2 {a1, a2}
T3 {b1, b2}
T4 {b1, b2}
T5 {a1, a2, b2}
T6 {a1, a2, b2}
T7 {a1, a2, b1, b2}
{ALL}
{ALL}
{ALL}
{ALL}
{ALL}
{ALL}
{ALL}
{a1}
{a1, a2}
{b1, b2}
{b1, b2}
{a1, a2, b2}
{a1, a2, b2}
{a1, a2, b1, b2}
• One initial partition
16
Initial partition:specialize using ALL {A, B}
TID Original Data Current Representation
T1 {a1}
T2 {a1, a2}
T3 {b1, b2}
T4 {b1, b2}
T5 {a1, a2, b2}
T6 {a1, a2, b2}
T7 {a1, a2, b1, b2}
{ALL}
{ALL}
{ALL}
{ALL}
{ALL}
{ALL}
{ALL}
{A}
{A}
{B}
{B}
{A, B}
{A, B}
{A, B}
• Produces three sub-partitions
17
Green partition: specialize using A {a1, a2}
TID Original Data Current Representation
T1 {a1}
T2 {a1, a2}
T3 {b1, b2}
T4 {b1, b2}
T5 {a1, a2, b2}
T6 {a1, a2, b2}
T7 {a1, a2, b1, b2}
{a1}
{a1, a2}
{B}
{B}
{A, B}
{A, B}
{A, B}
{A}
{A}
• Specialization violates 2-anonymity, rolls back
18
Blue partition: specialize using B {b1, b2}
TID Original Data Current Representation
T1 {a1}
T2 {a1, a2}
T3 {b1, b2}
T4 {b1, b2}
T5 {a1, a2, b2}
T6 {a1, a2, b2}
T7 {a1, a2, b1, b2}
{B}
{B}
{A, B}
{A, B}
{A, B}
{A}
{A}
{b1, b2}
{b1, b2}
• Specialization ok, reaches leave level, stop
19
Red partition: specialize using A {a1, a2}
TID Original Data Current Representation
T1 {a1}
T2 {a1, a2}
T3 {b1, b2}
T4 {b1, b2}
T5 {a1, a2, b2}
T6 {a1, a2, b2}
T7 {a1, a2, b1, b2}
{A, B}
{A, B}
{A, B}
{A}
{A}
{b1, b2}
{b1, b2}
{a1, a2, B}
{a1, a2, B}
{a1, a2, B}
• Choosing A over B based on max info gain heurisitcs
20
Red partition: specialize using B {b1, b2}
TID Original Data Current Representation
T1 {a1}
T2 {a1, a2}
T3 {b1, b2}
T4 {b1, b2}
T5 {a1, a2, b2}
T6 {a1, a2, b2}
T7 {a1, a2, b1, b2}
{A}
{A}
{b1, b2}
{b1, b2}
{a1, a2, B}
{a1, a2, B}
{a1, a2, B}
{a1, a2, b2}
{a1, a2, b2}
{a1, a2, b1, b2}
• Specializing B violating 2-anonymity, rolls back
21
Main advantages
• Effective (less information loss)– Even though we impose a stronger privacy criteria– Local recoding vs. Global recoding
• Efficient (less execution time)– Divide and conquer vs. bottom-up (exhaustive)
enumeration– Linear in the input data & level of the hierarchy vs. worst case exponential in previous work
22
Experimental setup: market basket data
• Real-world benchmark data• BMS-WebView-1, BMS-WebView-2, BMS-POS
• No accompanying hierarchy data• Used synthetic hierarchy (as in the previous work)
• Comparing our Partition-based algorithm (Partition), with previous Apriori-Anonymization (AA) [Terrovitis+08]
23
An order of magnitude faster on market basket data
POS WV1 WV21
10
100
1000
10000
100000
Time Comparison
AA (m=max)partition
Dataset
Tim
e (s
ec)
24
Less information losson market basket data
• Why? Local recoding
POS WV1 WV20.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
Quality Comparison
AA (m=max)partition
Dataset
Info
Loss
25
Sensitivity analysis: consistently faster with varied parameters
1 2 3 4 5 6 max0
200
400
600
800
vary m
AAPartition
m
Tim
e (s
ec)
2 5 10 25 500
200
400
600
800
1000
1200
vary k
AAPartition
k
Tim
e (s
ec)
2 5 10 25 500
200
400
600
800
1000
1200
vary f
AAPartition
f
Tim
e (s
ec)
2 5 10 25 500
100
200
300
400
500
600
vary|D|
AAPartition
|D|
Tim
e (s
ec)
26
Sensitivity analysis: less information loss in most cases
1 2 3 4 5 6 max0.0%2.0%4.0%6.0%8.0%
10.0%12.0%14.0%
vary m
AAPartition
m
Info
Loss
1 2 3 4 5 6 max0.0%2.0%4.0%6.0%8.0%
10.0%12.0%14.0%16.0%
vary k
AAPartition
k
Info
Loss
1 2 3 4 5 6 max0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
vary f
AAPartition
f
Info
Loss
1 2 3 4 5 6 max0.0%
5.0%
10.0%
15.0%
20.0%
25.0%
30.0%
vary|D|
AAPartition
|D|
Info
Loss
27
Experimental setup: AOL query log
• From a set-valued perspective
• No accompanying hierarchy data, again– Use alphabetical hierarchy– Use WordNet hierarchy
• Compare with an early work [Adar07]
28
Less information loss than [Adar07] on AOL query log
2 3 4 5 60%
10%
20%
30%
40%
50%
60%
70%
quality comparisoninfo loss (Partition) info loss (query removal) [Adar07]
k
29
Reasonably efficient on AOL query log
5 10 25 50 100 250 5000%2%4%6%8%
10%12%14%16%18%
0
500
1000
1500
2000
2500
WordNet Hierarchy
info loss
time
k5 10 25 50 100 250 500
0.0%2.0%4.0%6.0%8.0%
10.0%12.0%14.0%16.0%18.0%20.0%
050010001500200025003000350040004500
Alphabetical Hierarchy
info loss
time
k
• Efficient given the size of the query log (2.2GB)• Information loss not as satisfactory as in market
basket data– Words generalized to “event”, “process”, “thing”…
30
Conclusion
• Developed faster, better information preserving anonymization algorithm – for set-valued data with no QI/SI distinction
• Performed well on market basket data– less satisfying for search log data
• Open and important question: stronger privacy models– what is a good stronger privacy model than k-anonymity for
set-valued data with no QI/SI distinction?