Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong

Privacy Preserving Data Publication

Yufei Tao

Department of Computer Science and Engineering

Chinese University of Hong Kong

Centralized publication

Assume that a hospital wants to publish the following table, called the microdata.

The publication must preserve the privacy of patients. Prevent an adversary from knowing who-contracted-

what.Microdata

Centralized publication (cont.)

A simple solution: Remove column ‘Name’. It does not work. See next.

publish

Linking attacks

The published table A voter registration list

Quasi-identifier (QI) attributes

An adversary

These are real threats

Fact: 87% of Americans can be uniquely identified by {Zipcode, gender, date-of-birth}.

A famous experiment by Sweeney [International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]

finds the medical record of an ex-governor of Massachusetts.

Objectives

Publish a distorted version of the dataset so that [Privacy] the privacy of all individuals is “adequately”

protected; [Utility] the dataset is useful for analyzing the

characteristics of the microdata.

Paradox: Privacy protection , utility .

Issues

Privacy principleWhat is adequate privacy protection?

Distortion approachHow to achieve the privacy principle?

The literature has discussed other issues as well.Complexities, improving the utility of the published

data, etc.

Principle 1: k-anonymity

2-anonymous generalization:QI attributes

Sensitive attribute

4 Q

I gr

oups

A voter registration list

[Sweeney, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 2002]

Defects of k-anonymity

What is the disease of Joe?

No “diversity” in this QI group.A voter registration list

Principle 2: l-diversity

Each QI group should have at least l “well-represented” sensitive values.

Different ways to interpret “well-represented”.

[Machanavajjhala et al., ICDE, 2006]

Naive interpretation

Each QI-group has l different sensitive values.

A 2-diverse table

Age Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia

[6, 10] M [15001, 20000] pneumonia[6, 10] M [15001, 20000] bronchitis

[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] gastritis[21, 60] F [30001, 60000] flu[21, 60] F [30001, 60000] flu

Defects of the naive interpretation

Assume that Joe is identified in the QI group. What is the probability that he contracted HIV?

Implication: The most frequent sensitive value in a QI group cannot be too frequent.

But accomplishing only is still vulnerable against attacks with background knowledge.

Disease

...

HIV

HIV

HIV

pneumonia

...

...

bronchitis

...

A QI group with 100 tuples 98 tuples

Background knowledge attack

Let Joe be an individual in the QI group having HIV. A friend of Joe has the background knowledge: “Joe does not have

pneumonia”. How likely would this friend assume that Joe had HIV?

A QI group with 100 tuples

50 tuples

Disease

...

HIV

HIVpneumonia

...

...

bronchitis

...

pneumonia

...

49 tuples

Controlling also the 2nd most frequent value

Even if an adversary can eliminate pneumonia, s/he can only assume that Joe has HIV with 40 / 70 probability.

A QI group with 100 tuples

40 tuples

Disease

...

HIV

HIVpneumonia

...

...

bronchitis

...

pneumonia

...

bronchitis

...

30 tuples

30 tuples

An example of 4-diversity

A QI group

Disease

...

...

...

The most frequent value

The 2nd most frequent value

The 3rd most frequent valueThe 4th most frequent value

The other values

An example of 4-diversity (cont.)

A QI group

Disease

...

...

...

The most frequent value

The other values

Same cardinality

Assume that Joe is a person in the QI group. Property: If an adversary can eliminate only 3 diseases,

s/he can correctly guess the disease of Joe with at most 50% probability.

An example of 4-diversity (cont.)

A QI group

HIV

pneumonia

bronchitiscancer

The other values

Disease

...

...

...

l-diversity

Consider a QI group. m is the number of sensitive values in the group. r1 is the number of tuples having the most sensitive value.

r2 is the number of tuples having the 2nd most sensitive value.

… rm is the number of tuples having the m-th most sensitive value.

Then, r1 c (rl + … + rm), where c is a constant.

If an adversary can eliminate only l – 1 sensitive values, s/he can infer the disease of a person with probability at most 1 / (c + 1).

Called (c, l)-diversity precisely.

Defects of l-diversity

Andy does not want anyone to know that he had a stomach problem. Sarah does not mind at all if others find out that she had flu.

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000Sarah 28 F 37000Mary 56 F 58000

A 2-diverse table A voter registration listAge Sex Zipcode Disease[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia



Does not work if an individual can have multiple tuples in the microdata.

Defects of l-diversity (cont.)

Microdata

Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerAndy 4 M 12000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis

Sarah 28 F 37000 fluMary 56 F 58000 flu

Defects of l-diversity (cont.)

Name Age Sex ZipcodeAndy 4 M 12000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

A 2-diverse table A voter registration listAge Sex Zipcode Disease

4 M 12000 gastric ulcer4 M 12000 dyspepsia



Principle 3: Personalized anonymity

Key ideas: Guarding node + sensitive attribute (SA) generalization Assume a publicly-known hierarchy on the sensitive attribute.

any illness

stomach diseaserespiratory infection

flu pneumonia gastricbronchitis dyspepsia

respiratory system problem digestive system problem

gastritisulcer

[Xiao and Tao, SIGMOD, 2006]

Guarding nodeany illness




gastritisulcer

Andy does not want anyone to know that he had a stomach problem. He can specify “stomach disease” as the guarding node for his tuple.

Protect Andy from being conjectured to have any disease in the subtree of the guarding node.

Name Age Sex Zipcode Disease guarding node

Andy 4 M 12000 gastric ulcer stomach disease

Guarding node (cont.)any illness




gastritisulcer

Sarah is willing to disclose her exact symptom. She can specify Ø as the guarding node for her tuple.


Sarah 28 F 37000 flu Ø

Guarding node (cont.)any illness




gastritisulcer

Bill does not have any special preference. He sets the guarding node of his tuple to be the same as his sensitive value.


Bill 5 M 14000 dyspepsia dyspepsia

A personalized approachany illness




gastritisulcer

Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis ØSarah 28 F 37000 flu ØMary 56 F 58000 flu flu

Personalized anonymity

No adversary should be able to breach the privacy requirement of any guarding node with a probability above pbreach..

If pbreach = 0.3, then no adversary can have more than 30% probability to find out that: Andy had a stomach disease Bill had dyspepsia …

Name Age Sex Zipcode Disease guarding nodeAndy 4 M 12000 gastric ulcer stomach diseaseBill 5 M 14000 dyspepsia dyspepsiaKen 6 M 18000 pneumonia respiratory infectionNash 9 M 19000 bronchitis bronchitisAlice 12 F 22000 flu fluBetty 19 F 24000 pneumonia pneumoniaLinda 21 F 33000 gastritis gastritisJane 25 F 34000 gastritis ØSarah 28 F 37000 flu ØMary 56 F 58000 flu flu

Why SA generalization?

How many female patients are there with age above 30? 4 ∙ (60 – 30 + 1) / (60 – 21 + 1) = 3 Real answer: 1

Pure QI generalization




Name Age Sex Zipcode DiseaseAndy 4 M 12000 gastric ulcerBill 5 M 14000 dyspepsiaKen 6 M 18000 pneumoniaNash 9 M 19000 bronchitisAlice 12 F 22000 fluBetty 19 F 24000 pneumoniaLinda 21 F 33000 gastritisJane 25 F 34000 gastritis

Sarah 28 F 37000 fluMary 56 F 58000 flu

Microdata

SA generalization (cont.)

With SA generalizationAge Sex Zipcode Disease

[1, 5] M [10001, 15000] gastric ulcer[1, 5] M [10001, 15000] dyspepsia


[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] gastritis[21, 30] F [30001, 40000] flu

56 F 58000respiratory infection

Pure QI generalization




any illness




gastritisulcer

Evaluation of disclosure risk

What is the probability that the adversary can find out that “Andy had a stomach disease”?

Name Age Sex ZipcodeAndy 4 M 12000Bill 5 M 14000Ken 6 M 18000Nash 9 M 19000Mike 7 M 17000Alice 12 F 22000Betty 19 F 24000Linda 21 F 33000Jane 25 F 34000

Sarah 28 F 37000Mary 56 F 58000

Age Sex Zipcode Disease[1, 10] M [10001, 20000] gastric ulcer[1, 10] M [10001, 20000] dyspepsia[1, 10] M [10001, 20000] pneumonia[1, 10] M [10001, 20000] bronchitis[11, 20] F [20001, 25000] flu[11, 20] F [20001, 25000] pneumonia

21 F 33000 stomach disease25 F 34000 gastritis28 F 37000 flu56 F 58000 respiratory infection

A voter registration listThe published data

Combinatorial reconstruction (cont.)

Can each individual appear more than once? No = the primary case Yes = the non-primary case

Some possible reconstructions:

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

The primary case

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

The non-primary case

Combinatorial reconstruction (cont.)

Can each individual appear more than once? No = the primary case Yes = the non-primary case

Some possible reconstructions:

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

The primary case

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

The non-primary case

Breach probability (primary)

Totally 120 possible reconstructions

If Andy is associated with a stomach disease in nb reconstructions The probability that the adversary should associate Andy with some stomach problem

is nb / 120

Andy is associated with gastric ulcer in 24 reconstructions dyspepsia in 24 reconstructions gastritis in 0 reconstructions

nb = 48

The breach probability for Andy’s tuple is 48 / 120 = 2 / 5.

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

any illness




gastritisulcer

Breach probability (non-primary)

Totally 625 possible reconstructions

Andy is associated with gastric ulcer or dyspepsia or gastritis in 225 reconstructions.

nb = 225 The breach probability for Andy’s tuple is

225 / 625 = 9 / 25

any illness




gastritisulcer

Andy

Bill

Ken

Nash

Mike

gastric ulcer

dyspepsia

pneumonia

bronchitis

A defect of personalized anonymity

Does not guard against background knowledge.Recall that l-diversity can achieve this purpose.

But it seems possible to adapt the personalized approach to tackle background knowledge.Future work?

Other privacy principles

k-gather. Due to [Aggarwal et al., PODS, 2006]

Suffers from the problems of k-anonymity.

(a, k)-anonymity Due to [Wong et al., KDD, 2006]

t-closeness. Recently proposed by [Li and Li, ICDE, 2007]

Issues

Privacy principleWhat is adequate privacy protection?

Distortion approachHow to achieve the privacy principle?

Three approaches Suppression

We do not discuss it because the utility of the resulting table is low; it can be regarded as a special case of generalization.

Generalization Due to [Sweeney, International Journal on Uncertainty, Fuzziness and

Knowledge-based Systems, 2002]

Anatomy (also called “bucketization”) Due to [Xiao and Tao, VLDB, 2006]

Each of the above approaches can be integrated with all the privacy principles discussed earlier.

A multidimensional view of generalization

20

10k

7060504030

60k

50k

40k

30k

20k

x (Age)y

(Zip

code

)

1 2

3

4

5

6 and 7

8

R1 R2

Taxonomy of generalization

Local recoding (Generalized) rectangles

may overhalp.Suppression is a special case

of local recoding.

Global recodingAll rectangles are disjoint.

[LeFevre et al. SIGMOD, 2005]

Taxonomy of generalization (cont.)

Global recoding can be further divided.

Single-dimension recoding Rectangles form a grid.

Multi-dimension recodingThe opposite of single-

dimension recoding.


Single-dimension recoding can be further divided. Full-domain recoding Full-subtree recoding

Both assume a hierarchy on each QI attribute. Example: A hierarchy on Age

[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]

[1, 30] [31, 60] [61, 90]

[1, 90]

1, 2, 3, …, 10 ...


Full-domain recoding All age values must be generalized to the same level of the

hierachy.

[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]

[1, 30] [31, 60] [61, 90]

[1, 90]

1, 2, 3, …, 10 ...


Full-subtree recoding The subtrees of all generalized values must be disjoint. Permissible generalization:

[1, 30], [31, 40], [41, 50], [51, 60], [61, 90]. Illegal generalization:

[1, 10], [1, 30], [31, 60], [61, 90].

[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]

[1, 30] [31, 60] [61, 90]

[1, 90]

1, 2, 3, …, 10 ...

Why all these generalization types?

Reason 1:If a dataset is generalized in a more restricted manner, less preprocessing is required before it can be analyzed by a standard statistical tool (such as SAAS).


Reason 2: More restrictive generalization is usually faster to compute and easier to analyze.

[1, 10][11, 20][21, 30] [31, 40][41, 50][51, 60] [61, 70][71, 80][81, 90]

[1, 30] [31, 60] [61, 90]

[1, 90]

1, 2, 3, …, 10 ... level 0

level 1

level 2

level 3


Reason 3: Less restrictive generalization promises more accurate data analysis, provided that a sophisticated analytical method is used.

Generalization algorithms

Operate on a quality metric. Examples: The generalization level (for full-domain recoding) Total rectangle size (for local recoding) …

Mostly heuristics-based. Finding the optimal generalization is often

NP hard.

level 0

level 1

level 2

level 3

Defect of generalization Query A: SELECT COUNT(*) from Unknown-Microdata

WHERE Disease = ‘pneumonia’ AND Age in [0, 30]

AND Zipcode in [10001, 20000]

Age Sex Zipcode Disease

[21, 60] M [10001, 60000] pneumonia

[21, 60] M [10001, 60000] dyspepsia

[21, 60] M [10001, 60000] dyspepsia

[21, 60] M [10001, 60000] pneumonia

[61, 70] F [10001, 60000] flu

[61, 70] F [10001, 60000] gastritis

[61, 70] F [10001, 60000] flu

[61, 70] F [10001, 60000] bronchitis

Estimated answer: 2p, where p is the probability that each of the two tuples satisfies the query conditions on the Age and Zipcode.

Defect of generalization (cont.) Query A: SELECT COUNT(*) from Unknown-Microdata



p = Area( R1 ∩ Q ) / Area( R1 ) = 0.05

Estimated answer for Query A: 2p = 0.1


[21, 60] M [10001, 60000] pneumonia

[21, 60] M [10001, 60000] pneumonia

Defect of generalization (cont.) Query A:SELECT COUNT(*) from Unknown-Microdata


AND Zipcode in [10001, 20000] Estimated answer = 0.1

Name Age Sex Zipcode DiseaseBob 23 M 11000 pneumoniaKen 27 M 13000 dyspepsiaPeter 35 M 59000 dyspepsiaSam 59 M 12000 pneumoniaJane 61 F 54000 flu

Linda 65 F 25000 gastritisAlice 65 F 25000 flu

Mandy 70 F 30000 bronchitis

The exact answer = 1

Defect of generalization (cont.) Cause of inaccuracy:

QI distribution inside each QI group is lost!


[21, 60] M [10001, 60000] pneumonia

[21, 60] M [10001, 60000] pneumonia

Anatomy

Releases a quasi-identifier table (QIT) and a sensitive table (ST).

Group-ID Disease Count

1 dyspepsia 2

1 pneumonia 2

2 bronchitis 12 flu 2

2 gastritis 1

Age Sex Zipcode Group-ID

23 M 11000 127 M 13000 1

35 M 59000 1

59 M 12000 161 F 54000 2

65 F 25000 2

65 F 25000 2

70 F 30000 2

Quasi-identifier table (QIT)

Sensitive table (ST)


23 M 11000 pneumonia

27 M 13000 dyspepsia



61 F 54000 flu

65 F 25000 gastritis

65 F 25000 flu

70 F 30000 bronchitis

Microdata

Anatomy (cont.)1. Decide an l-diverse partition of the tuples.






61 F 54000 flu

65 F 25000 gastritis

65 F 25000 flu

70 F 30000 bronchitis

QI group 1

QI group 2

A 2-diverse partition

Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the selected partition.

Disease

pneumonia

dyspepsia

dyspepsia

pneumonia

flu

gastritis

flu

bronchitis

Age Sex Zipcode

23 M 1100027 M 1300035 M 5900059 M 12000

61 F 5400065 F 2500065 F 2500070 F 30000

group 1

group 2

quasi-identifier table (QIT) sensitive table (ST)

Anatomy (cont.)

2. Generate a quasi-idnetifier table (QIT) and a sensitive table (ST) based on the decided partition.

Group-ID Disease

1 pneumonia1 dyspepsia1 dyspepsia1 pneumonia

2 flu2 gastritis2 flu2 bronchitis


23 M 11000 127 M 13000 135 M 59000 159 M 12000 1

61 F 54000 265 F 25000 265 F 25000 270 F 30000 2

quasi-identifier table (QIT) sensitive table (ST)

Privacy preservation

Given a pair of QIT and ST generated from an l-diverse partition, an adversary can infer the sensitive value of each individual with confidence at most 1 / l.


1 dyspepsia 2

1 pneumonia 22 bronchitis 1

2 flu 2

2 gastritis 1


23 M 11000 1

27 M 13000 1

35 M 59000 1

59 M 12000 1

61 F 54000 2

65 F 25000 2

65 F 25000 2

70 F 30000 2

quasi-identifier table (QIT)

sensitive table (ST)

Name Age Sex Zipcode

Bob 23 M 11000

Accuracy of data analysis Query A: SELECT COUNT(*) from Unknown-Microdata




1 dyspepsia 2

1 pneumonia 22 bronchitis 1

2 flu 2

2 gastritis 1


23 M 11000 1

27 M 13000 1

35 M 59000 1

59 M 12000 1

61 F 54000 2

65 F 25000 2

65 F 25000 2

70 F 30000 2

Quasi-identifier table (QIT)

Sensitive table (ST)

Accuracy of data analysis

Query A:SELECT COUNT(*) from Unknown-Microdata



2 patients contracted pneumonia 2 out of 4 patients satisfy the query conditions on Age and Zipcode Estimated answer = 2 * 2 / 4 = 1.


23 M 11000 1

27 M 13000 1

35 M 59000 1

59 M 12000 1

t1t2t3t4

A defect of anatomy

Existence breach: Does an individual exist in the microdata?

Future work

Re-publication

Tackle stronger background knowledgeRecent work [Martin et al., ICDE, 2007]

Improving utilityPioneering work [Kifer and Gehrke, SIGMOD, 2006]

Application to specific (non-trivial) applicationsLocation privacy

Pioneering work [Mokbel et al., VLDB, 2006]

Documents

Privacy Preserving Data Publication Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong