1 Probabilistic Inference Protection on Anonymized Data Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the

1

Probabilistic Inference Protection on Anonymized Data

Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology)

Ada Wai-Chee Fu (the Chinese University of Hong Kong)Ke Wang (Simon Fraser University)Yabo Xu (Sun Yat-sen University)Jian Pei (Simon Fraser University)

Philip S. Yu (Univerisity of Illinois at Chicago)

Prepared by Raymond Chi-Wing WongPresented by Raymond Chi-Wing Wong

2

Outline1. Introduction

l-diversity

2. Background Knowledge3. Proposed Model4. Conclusion

3

1. l-diversityPatient Gender Age Disease

Alan Male 41 Lung Cancer

Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female

64 HIVRelease the data set to public

Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

Knowledge 1GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVQI Table Sensitive Table

I also know Alan with (Male, 41)

Knowledge 2

Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancerwith probability=1/2.

In other words, P(Alan is linked to Lung Cancer) is at most 1/2.

Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2

This dataset satisfies 2-diversity.

4



Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female


Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2


L1 Lung Cancer

L1 Hypertension

L2 Flu



Knowledge 2



Knowledge 3

p() Lung Cancer

Not Lung Cancer

Male 0.1 0.9

Female

0.003 0.997

QI Based Distribution

This can be obtained from statistical reports from the US department of Health and Human Services and other statistical data sources discussed in previous studies

5



Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female


Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2


L1 Lung Cancer

L1 Hypertension

L2 Flu



Knowledge 2



Knowledge 3

p() Lung Cancer

Not Lung Cancer

Male 0.1 0.9

Female

0.003 0.997


Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancerwith very high probability (much greater than 1/2).

It is more likely that a male patient is linked to Lung Cancer compared with a female patient.

Why?

6



Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female


Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2


L1 Lung Cancer

L1 Hypertension

L2 Flu



Knowledge 2



Knowledge 3

p() Lung Cancer

Not Lung Cancer

Male 0.1 0.9

Female

0.003 0.997


Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancerwith very high probability (much greater than 1/2).

We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3

Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).

7

1. l-diversity



8

1. l-diversity

Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive.



9

1. l-diversity

Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.

Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.


10

1. l-diversity


Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.


Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2

11

1. l-diversity




Related Work: There is a closely related work [LLZ09] for this problem.

[LLZ09] T. Li, N. Li and J. Zhang, “Modeling and Integrating BackgroundKnowledge in Data Anonymization”, ICDE 2009

[LLZ09] approximates the formula for this probability.Thus, there is no solid guarantee on the privacy protection.

12

1. l-diversity




Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 )Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically,(1) Computing the condition is computationally cheap, and(2) The condition involves a monotonic function on the A-group size.

13

1. l-diversity The major idea of the condition includes

some simple calculations based on the statistics of an A-group



Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 )Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically,(1) Computing the condition is computationally cheap, and(2) The condition involves a monotonic function on the A-group size.

1. The size of the A-group (N)2. The privacy requirement (r)3. The global probabilities of each tuple in the A-group to a

sensitive value

14

1. l-diversity The major idea of the condition includes

some simple calculations based on the statistics of an A-group



1. The size of the A-group (N)2. The privacy requirement (r)3. The global probabilities of each tuple in the A-group to a

sensitive value

Condition Check

N

r

Global probabilities

Satisfied/Not Satisfied

If it is satisfied, we deduce that the privacy

requirement is satisfied(e.g., P(Alan is linked to Lung Cancer) ≤

1/2)

15

4. Conclusion

1. Background Knowledge QI-based Probability Distribution

2. Two Challenges Challenge 1: The formula for the

probability is computationally expensive Challenge 2: The formula is not

monotonic

3. Proposed Condition overcomes Challenge 1 and Challenge 2

16

Q&A

17



Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female


Gender Age Disease

Male 41 Lung Cancer

Female

42 Hypertension

Female

63 Flu

Female

64 HIV

Bucketization

GID = L1

These two tuples form an anonymized group (A-group)

These two tuples form another A-group.

GID = L2

A way to prevent this linkage.

There is another way to prevent this linkage called Generalization. The following principle to be discussed can also be applied to Generalization.

18



Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female


Gender Age Disease

Male 41 Lung Cancer

Female

42 Hypertension

Female

63 Flu

Female

64 HIV

GID = L1

GID = L2

Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu


19



Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female


Bucketization

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu


20



Betty Female

42 Hypertension

Catherine

Female

63 Flu

Diana Female


Gender Age Disease

Male 41 Lung Cancer

Female

42 Hypertension

Female

63 Flu

Female

64 HIV

Knowledge 1


Knowledge 2

Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer.

21

1. l-diversity Monotonicity

Consider two A-groups

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIV

An A-group with GID = L1


P(an individual is linked to a sensitive value) = 0.5


Merging

An A-group “merged”from these two A-groups

P(an individual is linked to a sensitive value) ≤ 0.5

The probability is monotonically decreasing when the size of the A-gourp increases.


22

1. l-diversity Non-Monotonicity

Consider two A-groups

Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2

GID Disease

L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIV





Merging

An A-group “merged”from these two A-groups

It is possible that P(an individual is linked to a sensitive value) > 0.5

The probability is not monotonically decreasing when the size of the A-gourp increases.


23

1. l-diversityObjective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2


Condition Check

N

r



If it is satisfied, we deduce that the privacy

requirement is satisfied(e.g., P(Alan is linked to Lung Cancer) ≤

1/2)

Knowledge 3

p() Lung Cancer

Not Lung Cancer

Male 0.1 0.9

Female

0.003 0.997


Gender Age GID

Male 41 L1

Female

42 L1

Female

63 L2

Female

64 L2


L1 Lung Cancer

L1 Hypertension

L2 Flu

L2 HIVSuppose we are interested in knowing whether P(Alan is linked to Lung Cancer) ≤ 1/2.


Knowledge 2

2

2

0.10.003

For the sake of illustration, we focus onattribute Gender only.

24

Condition Check

N

r



2

2

0.10.003

What is the condition check?

In the condition check,there is an expression ceil in terms of N, r and global probabilities to compute.

25


Theorem 1: If the condition is satisfied, then the privacy requirement is satisfied.

In the condition check,there is an expression ceil in terms of N, r and global probabilities to compute.

26

Theorem 2: Computing ceil can be done in O(1) time.

This means that we overcome Challenge 1.Challenge 1: Calculating the probability is computationally expensive.

This means that we overcome Challenge 2.Challenge 2: The formula for the original

probabilityis not monotonic with respect to the A-group

size.

Theorem 3: ceil is a monotonically increasing function on N where N is the A-group size.

27

Condition Check

N

r



2

2

0.10.003


f1

f2

fmax = max{f1, f2}= max{0.1, 0.003}= 0.1

1 = fmax – f1 = 0.1 – 0.1 = 0

2 = fmax – f2 = 0.1 – 0.003 = 0.097

The greatest global probability

The difference between the greatest global probability and the “current” global probability

The condition is whether this difference 1 (and 2) is at most an expression ceil

ceil = (N-r)/fmax

fmax(r-1)/(1-fmax) + (N-1)

in terms of N, r and fmax.

28


fmax = max{f1, f2}= max{0.1, 0.003}= 0.1

1 = fmax – f1 = 0.1 – 0.1 = 0

2 = fmax – f2 = 0.1 – 0.003 = 0.097

The greatest global probability

The difference between the greatest global probability and the “current” global probability

The condition is whether this difference 1 (and 2) is at most an expression ceil

ceil = (N-r)/fmax

fmax(r-1)/(1-fmax) + (N-1)

Theorem 1: If i ≤ceil is satisfied, then the privacy requirement is satisfied.

29

Anonymization

The condition check gives hints for anonymization Initially, each tuple forms an A-group. Repeat the following until each A-

group satisfies the condition. If there is an A-group violating the

condition, merge this A-group with some other A-group such that the “merged” A-group satisfies the condition.

30

B.1.2 K-AnonymityCustomer Gender District Birthday Cancer

Raymond Male Shatin 29 Jan None

Peter Male Fanling 16 July Yes

Kitty Female Shatin 21 Oct None

Mary Female Shatin 8 Feb None

Gender District Birthday Cancer

Male NT * None

Male NT * Yes

Female Shatin * None

Female Shatin * None

Release the data set to public

Problem: to generate a data set such that each possible value appears at least TWO times.

This data set is 2-anonymous

Two Kinds of Generalisations1. ShatinNT2. 16 July*

“ShatinNT” causes LESS distortion than “16 July*”Question: how can we

measure the distortion?

31

B.1.2 K-Anonymity

Shatin Fanling Mongkok Jordon

NT KLN

HKG

29 Jan 16 July 21 Oct 8 Feb

Jan July Oct Feb

*

Measurement= 1/2 =0.5

Measurement= 2/2=1.0

Male Female

*


Conclusion: We propose a measurement of distortion of the modified/anonymized data.

32

B.1.2 K-Anonymity

Shatin Fanling Mongkok Jordon

NT KLN

HKG

29 Jan 16 July 21 Oct 8 Feb

Jan July Oct Feb

*

Measurement= 1/2 =0.5


Male Female

*


Can we modify the measurement?e.g. different weightings to each level

Documents

1 Probabilistic Inference Protection on Anonymized Data Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology) Ada Wai-Chee Fu (the