Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1

1

Anonymization of Set-Valued Data via Top-Down, Local Generalization

Yeye HeJeffrey F. Naughton

University of Wisconsin-Madison

2

Overview

• The problem:– Anonymizing set-valued data presents challenges

not seen in relational data– Previous solutions explored parts but not all of the

problem space• Our goals:– Develop a scalable algorithm for the new variant

of the problem– Perform experiments to explore strengths and

weaknesses of the approach

3

What’s set-valued data• “Relational data”– One sensitive attribute for each tuple

• “Set-valued data”– Logically: (personid, {item1, item2, …, itemn})– Multiple sensitive values in one record possible

Zipcode Gender Age … Medical diagnosis

53705 male 30 … flu

98072 female 40 … diabetes

Person ID Item set

001 {milk, sunglasses, viagra}

002 {beer, diapers, shampoo}

003 {beer, milk, diapers, pregnancy test, diabetes medicine}

4

An attack scenario• Retailer publishes market basket data

• The adversary knows Alice has bought milk, beer, and diapers

• The adversary infers Alice has also bought pregnancy test and diabetes medicine

PID Item set001 {milk, sunglasses, viagra}

002 {beer, diapers, shampoo}

003 {beer, milk, diapers, pregnancy test, diabetes medicine}

beer, milk, diapersbeer, milk, diapers, pregnancy test, diabetes medicine

5

Existing work: a priori QI/SI partition

• Scenarios where a priori partitioning set elements into Quasi-Identifier Item & Sensitive Item possible– {beer, milk, diapers, pregnancy test, diabetes medicine}

• Substantial existing work & good algorithms– [Ghinita+08] [Xu+08a] [Xu+08b] [Nergiz+07]

• But what if a priori partitioning not possible?– Individuals may have different privacy requirements– The adversary may see sensitive items and use as QI

a priori QI/SI partition possible

?

Set-valued data anonymization

6

Existing work: no QI/SI partition

• Prior work [Terrovitis+08] proposed the km-anonymity model

• km-anonymity– For any transaction (data record) T, for any subset

of m items in T, there are at least k-1 other transactions with the same m items


No a priori QI/SI

partition


7

The m in km-anonymity [Terrovitis+08]

• Attack revisited– The data 103 anonymized, the adversary sees {beer, milk, diapers}

• Cannot tell Alice’s transaction from the other 9– Effective assuming the adversary never sees more than m=3 items

• m in km-anonymity – requires some identified m s.t. no adversary will ever see more than m items

• What about the case where there is no such m? – The case we consider


No a priori QI/SI

partition


No identified m

Has identified m

8

Our model: k-anonymity for set-valued data

• Transactional database D is k-anonymous if– Every transaction (data record) occurs at least k

times

• Different from km-anonymity [Terrovitis+08] – no limit on m, i.e., valid for any m– thus a stronger privacy model

9

k-anonymity subsumes km-anonymity [Terrovitis+08]

• Every database D that satisfies k-anonymity also satisfies km-anonymity

• There exists a database D that satisfies km-anonymity for all m but not k-anonymity– Example: 23-anonymous but not 2-anonymous

T1 = {A, B, C}T2 = {A, B, C}T3 = {A, B}



km-anonk-anonNo QI/SI

partition

10

Problem statement

• Given a transactional database D, find a transformation D’ of D s.t.:– D’ satisfies k-anonymity– the transformation minimizes information loss

between (D, D’)

11

Hierarchical generalizationAll

Alcohol Health care

Pregnancy testDiaperWineBeer

• Transaction generalizationTi: {“Beer”, “Wine”, “Diaper”} {“Alcohol”, “Health care”}

• Duplicates removed

12

Information loss metric

• Normalized Certainty Penalty (NCP) [Xu+06]– Also used in previous work [Terrovitis+08]

All

Alcohol Health care

Pregnancy testDiaperWineBeer

• Example:– Generalize “Beer” to “Alcohol”: (2/4 = 0.5) info loss– Generalize “Beer” to “All”: (4/4 = 1) info loss

13

Our algorithm: Partition-based anonymization

• Top-down– Generalize everything to the root representation– Resulting one initial partition

• Divide and conquer– Choose a node to specialize for each partition

• Based on information gain heuristics

– Recursively partition on resulting sub-partitions

14

Example: 2-anonymization

TID Original Data 2-anonymization

T1 {a1}

T2 {a1, a2}

T3 {b1, b2}

T4 {b1, b2}

T5 {a1, a2, b2}

T6 {a1, a2, b2}

T7 {a1, a2, b1, b2}

15

Generalize all data to root

TID Original Data Current Representation

T1 {a1}

T2 {a1, a2}

T3 {b1, b2}

T4 {b1, b2}

T5 {a1, a2, b2}

T6 {a1, a2, b2}

T7 {a1, a2, b1, b2}

{ALL}

{ALL}

{ALL}

{ALL}

{ALL}

{ALL}

{ALL}

{a1}

{a1, a2}

{b1, b2}

{b1, b2}

{a1, a2, b2}

{a1, a2, b2}

{a1, a2, b1, b2}

• One initial partition

16

Initial partition:specialize using ALL {A, B}


T1 {a1}

T2 {a1, a2}

T3 {b1, b2}

T4 {b1, b2}

T5 {a1, a2, b2}

T6 {a1, a2, b2}

T7 {a1, a2, b1, b2}

{ALL}

{ALL}

{ALL}

{ALL}

{ALL}

{ALL}

{ALL}

{A}

{A}

{B}

{B}

{A, B}

{A, B}

{A, B}

• Produces three sub-partitions

17

Green partition: specialize using A {a1, a2}


T1 {a1}

T2 {a1, a2}

T3 {b1, b2}

T4 {b1, b2}

T5 {a1, a2, b2}

T6 {a1, a2, b2}

T7 {a1, a2, b1, b2}

{a1}

{a1, a2}

{B}

{B}

{A, B}

{A, B}

{A, B}

{A}

{A}

• Specialization violates 2-anonymity, rolls back

18

Blue partition: specialize using B {b1, b2}


T1 {a1}

T2 {a1, a2}

T3 {b1, b2}

T4 {b1, b2}

T5 {a1, a2, b2}

T6 {a1, a2, b2}

T7 {a1, a2, b1, b2}

{B}

{B}

{A, B}

{A, B}

{A, B}

{A}

{A}

{b1, b2}

{b1, b2}

• Specialization ok, reaches leave level, stop

19

Red partition: specialize using A {a1, a2}


T1 {a1}

T2 {a1, a2}

T3 {b1, b2}

T4 {b1, b2}

T5 {a1, a2, b2}

T6 {a1, a2, b2}

T7 {a1, a2, b1, b2}

{A, B}

{A, B}

{A, B}

{A}

{A}

{b1, b2}

{b1, b2}

{a1, a2, B}

{a1, a2, B}

{a1, a2, B}

• Choosing A over B based on max info gain heurisitcs

20

Red partition: specialize using B {b1, b2}


T1 {a1}

T2 {a1, a2}

T3 {b1, b2}

T4 {b1, b2}

T5 {a1, a2, b2}

T6 {a1, a2, b2}

T7 {a1, a2, b1, b2}

{A}

{A}

{b1, b2}

{b1, b2}

{a1, a2, B}

{a1, a2, B}

{a1, a2, B}

{a1, a2, b2}

{a1, a2, b2}

{a1, a2, b1, b2}

• Specializing B violating 2-anonymity, rolls back

21

Main advantages

• Effective (less information loss)– Even though we impose a stronger privacy criteria– Local recoding vs. Global recoding

• Efficient (less execution time)– Divide and conquer vs. bottom-up (exhaustive)

enumeration– Linear in the input data & level of the hierarchy vs. worst case exponential in previous work

22

Experimental setup: market basket data

• Real-world benchmark data• BMS-WebView-1, BMS-WebView-2, BMS-POS

• No accompanying hierarchy data• Used synthetic hierarchy (as in the previous work)

• Comparing our Partition-based algorithm (Partition), with previous Apriori-Anonymization (AA) [Terrovitis+08]

23

An order of magnitude faster on market basket data

POS WV1 WV21

10

100

1000

10000

100000

Time Comparison

AA (m=max)partition

Dataset

Tim

e (s

ec)

24

Less information losson market basket data

• Why? Local recoding

POS WV1 WV20.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

Quality Comparison

AA (m=max)partition

Dataset

Info

Loss

25

Sensitivity analysis: consistently faster with varied parameters

1 2 3 4 5 6 max0

200

400

600

800

vary m

AAPartition

m

Tim

e (s

ec)

2 5 10 25 500

200

400

600

800

1000

1200

vary k

AAPartition

k

Tim

e (s

ec)

2 5 10 25 500

200

400

600

800

1000

1200

vary f

AAPartition

f

Tim

e (s

ec)

2 5 10 25 500

100

200

300

400

500

600

vary|D|

AAPartition

|D|

Tim

e (s

ec)

26

Sensitivity analysis: less information loss in most cases

1 2 3 4 5 6 max0.0%2.0%4.0%6.0%8.0%

10.0%12.0%14.0%

vary m

AAPartition

m

Info

Loss

1 2 3 4 5 6 max0.0%2.0%4.0%6.0%8.0%

10.0%12.0%14.0%16.0%

vary k

AAPartition

k

Info

Loss

1 2 3 4 5 6 max0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

vary f

AAPartition

f

Info

Loss

1 2 3 4 5 6 max0.0%

5.0%

10.0%

15.0%

20.0%

25.0%

30.0%

vary|D|

AAPartition

|D|

Info

Loss

27

Experimental setup: AOL query log

• From a set-valued perspective

• No accompanying hierarchy data, again– Use alphabetical hierarchy– Use WordNet hierarchy

• Compare with an early work [Adar07]

28

Less information loss than [Adar07] on AOL query log

2 3 4 5 60%

10%

20%

30%

40%

50%

60%

70%

quality comparisoninfo loss (Partition) info loss (query removal) [Adar07]

k

29

Reasonably efficient on AOL query log

5 10 25 50 100 250 5000%2%4%6%8%

10%12%14%16%18%

0

500

1000

1500

2000

2500

WordNet Hierarchy

info loss

time

k5 10 25 50 100 250 500

0.0%2.0%4.0%6.0%8.0%

10.0%12.0%14.0%16.0%18.0%20.0%

050010001500200025003000350040004500

Alphabetical Hierarchy

info loss

time

k

• Efficient given the size of the query log (2.2GB)• Information loss not as satisfactory as in market

basket data– Words generalized to “event”, “process”, “thing”…

30

Conclusion

• Developed faster, better information preserving anonymization algorithm – for set-valued data with no QI/SI distinction

• Performed well on market basket data– less satisfying for search log data

• Open and important question: stronger privacy models– what is a good stronger privacy model than k-anonymity for

set-valued data with no QI/SI distinction?

Documents

Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison 1