43
Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hon g Kong

Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Embed Size (px)

Citation preview

Page 1: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Anatomy:Simple and Effective Privacy Preservation

Xiaokui Xiao, Yufei Tao

Chinese University of Hong Kong

Page 2: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Privacy preserving data publishing

Microdata

• Purposes:– Allow researchers to effectively study the correlation b

etween various attributes – Protect the privacy of every patient

Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu

Linda 65 F 25000 gastritisAlice 65 F 25000 flu

Mandy 70 F 30000 bronchitis

Page 3: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

A naïve solution

• It does not work. See next.

publish

Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu

Linda 65 F 25000 gastritisAlice 65 F 25000 flu

Mandy 70 F 30000 bronchitis

Age Sex Zipcode Disease23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis

Page 4: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Inference attack

• An adversary knows that Bob – has been hospitalized

before– is 23 years old– lives in an area with zi

pcode 11000

Age Sex Zipcode Disease23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis

Published table

Quasi-identifier (QI) attributes

Page 5: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Generalization

A generalized tableAge Sex Zipcode Disease

[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis

Name Age Sex ZipcodeBob 23 M 11000

• Transform each QI value into a less specific form

How much generalization do we need?

Page 6: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

l-diversity

• A QI-group with m tuples is l-diverse, iff each sensitive value appears no more than m / l times in the QI-group.

• A table is l-diverse, iff all of its QI-groups are l-diverse.

• The above table is 2-diverse.

2 QI-groups

Quasi-identifier (QI) attributes Sensitive attribute

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis

Page 7: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

What l-diversity guarantees

• From an l-diverse generalized table, an adversary (without any prior knowledge) can infer the sensitive value of each individual with confidence at most 1/l

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis

Name Age Sex ZipcodeBob 23 M 11000

A 2-diverse generalized table

A. Machanavajjhala et al. l-Diversity: Privacy Beyond k-Anonymity.

ICDE 2006

Page 8: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Defect of generalization• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis

• Estimated answer: 2 * p, where p is the probability that each of the two tuples satisfies the query conditions

Page 9: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Defect of generalization (cont.)

• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

• p = Area( R1 ∩ Q ) / Area( R1 ) = 0.05

• Estimated answer for query A: 2 * p = 0.1

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] pneumonia

20

10k

7060504030

60k

50k

40k

30k

20k

AgeZ

ipco

de

Q

R1

Page 10: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Defect of generalization (cont.)• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

• Estimated answer from the generalized table: 0.1

Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu

Linda 65 F 25000 gastritisAlice 65 F 25000 flu

Mandy 70 F 30000 bronchitis

• The exact answer should be: 1

Page 11: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Research Works on Generalization

1. V. S. Iyengar. Transforming data to satisfy privacy constraints. KDD 2002.2. K. Wang, P. S. Yu and S. Chakraborty. Bottom-Up Generalization: A Data Mini

ng Solution to Privacy Protection. ICDM 2004.3. R. J. Bayardo Jr. and R. Agrawal. Data Privacy through Optimal k-Anonymizati

on. ICDE 2005.4. B. C. M. Fung, K. Wang and P. S. Yu. Top-Down Specialization for Information

and Privacy Preservation. ICDE 2005.5. K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Incognito: Efficient Full-Domai

n K-Anonymity. SIGMOD 2005.6. K. LeFevre, D. J. DeWitt and R. Ramakrishnan. Mondrian Multidimensional K-

Anonymity. ICDE 2006.7. D. Kifer and J. Gehrke. Injecting utility into anonymized datasets.

SIGMOD 2006.8. X. Xiao and Y. Tao. Personalized privacy preservation. SIGMOD 2006.9. K. Wang and B. C. M. Fung. Anonymization for Sequential Releases.

KDD 2006.10. K. LeFevre, D. DeWitt and R. Ramakrishnan. Workload-Aware Anonymization.

KDD 2006.11. J. Xu, Wei Wang, J. Pei, etc. Utility-Based Anonymization Using Local Recodin

gs. KDD 2006.12. …

Page 12: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Contributions

1. We propose an alternative technique for generalization called Anatomy, which allows much more accurate data analysis while still preserving privacy.

2. We develop an algorithm for computing anatomized tables that

• runs in linear I/Os• (nearly) minimizes information loss

Page 13: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Outline

• Basic Idea of Anatomy

• Preserving Correlation

• Algorithm for Anatomy

• Experimental Results

Page 14: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Basic Idea of Anatomy

• For a given microdata table, Anatomy releases a quasi-identifier table (QIT) and a sensitive table (ST)

Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2

Quasi-identifier Table (QIT)

Sensitive Table (ST)

Age Sex Zipcode Disease23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis

microdata

Page 15: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Basic Idea of Anatomy (cont.)

1. Select a partition of the tuplesAge Sex Zipcode Disease

23 M 11000 pneumonia27 M 13000 dyspepsia35 M 59000 dyspepsia59 M 12000 pneumonia

61 F 54000 flu65 F 25000 gastritis65 F 25000 flu70 F 30000 bronchitis

QI group 1

QI group 2

a 2-diverse partition

Page 16: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition

Disease

pneumoniadyspepsiadyspepsia

pneumonia

flugastritis

flubronchitis

Age Sex Zipcode

23 M 1100027 M 1300035 M 5900059 M 12000

61 F 5400065 F 2500065 F 2500070 F 30000

group 1

group 2

quasi-identifier table (QIT) sensitive table (ST)

Page 17: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition

Group-ID Disease

1 pneumonia1 dyspepsia1 dyspepsia1 pneumonia

2 flu2 gastritis2 flu2 bronchitis

Age Sex Zipcode Group-ID

23 M 11000 127 M 13000 135 M 59000 159 M 12000 1

61 F 54000 265 F 25000 265 F 25000 270 F 30000 2

quasi-identifier table (QIT) sensitive table (ST)

Page 18: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Basic Idea of Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition

Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2

quasi-identifier table (QIT)

sensitive table (ST)

Page 19: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Privacy Preservation

• From a pair of QIT and ST generated from an l-diverse partition, the adversary can infer the sensitive value of each individual with confidence at most 1/l

Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2quasi-identifier table (QIT)

sensitive table (ST)

Name Age Sex ZipcodeBob 23 M 11000

Page 20: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Accuracy of Data Analysis• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

Group-ID Disease Count1 dyspepsia 21 pneumonia 22 bronchitis 12 flu 22 gastritis 1

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 161 F 54000 265 F 25000 265 F 25000 270 F 30000 2quasi-identifier table (QIT)

sensitive table (ST)

Page 21: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Accuracy of Data Analysis (cont.)• Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

• 2 patients have contracted pneumonia

• 2 out of 4 patients satisfies the query condition on Age and Zipcode

• Estimated answer for query A: 2 * 2 / 4 = 1, which is also the actual result from the original microdata

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 1

20

10k

7060504030

60k

50k

40k

30k

20k

x (Age)

y (Z

ipco

de)

t1

Q

t2

t3

t4

t1t2t3t4

Page 22: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Outline

• Rationale of Anatomy

• Preserving Correlation

• Algorithm for Anatomy

• Experimental Results

Page 23: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Preserving Correlation

• Let us first examine the correlation between Age and Disease in our running example

• Each tuple in the microdata can be mapped to a point in the (Age, Disease) domain

• The above tuple can be mapped to (23, pneumonia).

Age Sex Zipcode Disease23 M 11000 pneumonia.... … … …

t1

Page 24: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Preserving Correlation (cont.)

• We model this tuple using a probability density function (pdf):

20 60504030Age

dysp

epsia

pneu

monia

Diseas

e0.2

10.80.60.4

0

Page 25: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Preserving Correlation (cont.)

• In the generalized table, the tuple becomes:

• Its corresponding pdf becomes:

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia… … … …

20 60504030Age

0.2

10.80.60.4

0

dysp

epsia

pneu

monia

Diseas

e

Page 26: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Preserving Correlation (cont.)

• In the anatomized tables, the tuple becomes:

• Its corresponding pdf becomes:

Age Sex Zipcode Group-ID23 M 11000 1… … … …

Group-ID Disease Count1 dyspepsia 21 pneumonia 2… … …

20 60504030Age

dysp

epsia

pneu

monia

Diseas

e0.2

10.80.60.4

0

Page 27: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Preserving Correlation (cont.)

20 60504030Age

dysp

epsia

pneu

monia

Diseas

e0.2

10.80.60.4

020 60504030

Age

0.2

10.80.60.4

0

dysp

epsia

pneu

monia

Diseas

e

20 60504030Age

dysp

epsia

pneu

monia

Diseas

e0.2

10.80.60.4

0

Page 28: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Outline

• Rationale of Anatomy

• Preserving Correlation

• Algorithm for Anatomy

• Experimental Results

Page 29: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Quality Metric

20 60504030Age

dysp

epsia

pneu

monia

Diseas

e0.2

10.80.60.4

0 20 60504030Age

dysp

epsia

pneu

monia

Diseas

e0.2

10.80.60.4

0

• For each approximated pdf , we measure its error from the original pdf by their “L2 distance”:

• We aim at obtaining anatomized tables that minimize the following re-construction error (RCE):

the original pdf the approximated pdf

Page 30: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Anatomize

• An algorithm for computing anatomized tables that

– runs in I/O cost linear to the cardinality n of the microdata table

– minimizes the RCE when n is a multiple of l, otherwise achieves an RCE that is higher than the lower-bound by a factor of at most 1 + 1/n

Page 31: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Outline

• Rationale of Anatomy

• Preserving Correlation

• Algorithm for Anatomy

• Experimental Results

Page 32: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Experimental Settings

• Goal: to compare the accuracy of data analysis on the generalized / anatomized tables.

• Real dataset with 9 attributes:– Age, Gender, Education, Marital-status, Race, Work-class,

Country,– Occupation, Salary-class

• OCC-d, SAL-d, (d = 3, 4, 5, 6, 7)– OCC-3:

– SAL-4:

• Cardinality: 100k, 200k, 300k, 400k, 500k

Age Gender Education Occupation

Age Gender Education Marital-status Salary-class

Page 33: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Experimental Settings (cont.)

• competitor: multi-dimensional generalization• l = 10

• avg. relative error for 10000 aggregate queries:|act – est| / act

• qd = 1, 2, …, d

• • s = 1%, …, 5%, …, 10%

Page 34: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Accuracy of Data Analysis (cont.)

C.C. Aggarwal. On k-anonymity and the curse of dimensionality. VLDB 2005

Page 35: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Accuracy of Data Analysis (cont.)

Page 36: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Accuracy of Data Analysis (cont.)

Page 37: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Computation Overhead

Page 38: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Summary

• Anatomy outperforms generalization by allowing much more accurate data analysis on the published data.

• Anatomized tables (with nearly optimal quality guarantee) can be computed in I/O cost linear to the database cardinality.

Page 39: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Thank you!

Datasets and implementation are available for download at

http://www.cse.cuhk.edu.hk/~taoyf

Page 40: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Anatomy vs. Generalization Revisit

• Sometimes the adversary is not sure whether an individual appears in the microdata or not

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] gastritis[61, 70] F [10001, 60000] flu[61, 70] F [10001, 60000] bronchitis

A 2-diverse generalized tableName Age Sex ZipcodeBob 23 M 11000Ken 27 M 13000Peter 35 M 59000Mark 40 M 30000Ric 50 M 40000Sam 59 M 12000… … … …

A Voter Registration List

Page 41: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Anatomy vs. Generalization Revisit

• From the adversary’s perspective:– Bob has 4 / 6 probability to be in the microdata– If Bob indeed appears the microdata, there is 2 / 4 probability that h

e has contracted pneumonia– So Bob has 4/6 * 2/4 = 1/3 probability to have contracted pneumoni

a

Age Sex Zipcode Disease[21, 60] M [10001, 60000] pneumonia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] dyspepsia[21, 60] M [10001, 60000] pneumonia

… … … …

A 2-diverse generalized table

Name Age Sex ZipcodeBob 23 M 11000Ken 27 M 13000Peter 35 M 59000Mark 40 M 30000Ric 50 M 40000Sam 59 M 12000… … … …

A Voter Registration List

Page 42: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Anatomy vs. Generalization Revisit

• The adversary knows that– Bob must appear the microdata– There is 1/2 probability that Bob

has contracted pneumonia

Group-ID Disease Count1 dyspepsia 21 pneumonia 2… … …

Age Sex Zipcode Group-ID23 M 11000 127 M 13000 135 M 59000 159 M 12000 1… … … …

2-diverse QIT

2-diverse ST

Name Age Sex ZipcodeBob 23 M 11000Ken 27 M 13000Peter 35 M 59000Mark 40 M 30000Ric 50 M 40000Sam 59 M 12000… … … …

Xiaokui Xiao
This does not mean that anatomy fail to protect privacy, since ....
Page 43: Anatomy: Simple and Effective Privacy Preservation Xiaokui Xiao, Yufei Tao Chinese University of Hong Kong

Anatomy vs. Generalization Revisit

• For a given value of l, l-diverse generalization may lead to higher privacy protection than l-diverse anatomy does.

• But is not always the case, since:– the external database may not contain any irrelevant individuals– the adversary may know that some individuals indeed appear in

the microdataName Age Sex ZipcodeBob 23 M 11000Ken 27 M 13000Peter 35 M 59000Mark 40 M 30000Ric 50 M 40000Sam 59 M 12000… … … …

Xiaokui Xiao