View
215
Download
2
Tags:
Embed Size (px)
Citation preview
1
Probabilistic Inference Protection on Anonymized Data
Raymond Chi-Wing Wong (the Hong Kong University of Science and Technology)
Ada Wai-Chee Fu (the Chinese University of Hong Kong)Ke Wang (Simon Fraser University)Yabo Xu (Sun Yat-sen University)Jian Pei (Simon Fraser University)
Philip S. Yu (Univerisity of Illinois at Chicago)
Prepared by Raymond Chi-Wing WongPresented by Raymond Chi-Wing Wong
3
1. l-diversityPatient Gender Age Disease
Alan Male 41 Lung Cancer
Betty Female
42 Hypertension
Catherine
Female
63 Flu
Diana Female
64 HIVRelease the data set to public
Bucketization
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
Knowledge 1GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIVQI Table Sensitive Table
I also know Alan with (Male, 41)
Knowledge 2
Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancerwith probability=1/2.
In other words, P(Alan is linked to Lung Cancer) is at most 1/2.
Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2
This dataset satisfies 2-diversity.
4
1. l-diversityPatient Gender Age Disease
Alan Male 41 Lung Cancer
Betty Female
42 Hypertension
Catherine
Female
63 Flu
Diana Female
64 HIVRelease the data set to public
Bucketization
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
Knowledge 1GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIVQI Table Sensitive Table
I also know Alan with (Male, 41)
Knowledge 2
Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2
This dataset satisfies 2-diversity.
Knowledge 3
p() Lung Cancer
Not Lung Cancer
Male 0.1 0.9
Female
0.003 0.997
QI Based Distribution
This can be obtained from statistical reports from the US department of Health and Human Services and other statistical data sources discussed in previous studies
5
1. l-diversityPatient Gender Age Disease
Alan Male 41 Lung Cancer
Betty Female
42 Hypertension
Catherine
Female
63 Flu
Diana Female
64 HIVRelease the data set to public
Bucketization
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
Knowledge 1GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIVQI Table Sensitive Table
I also know Alan with (Male, 41)
Knowledge 2
Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2
This dataset satisfies 2-diversity.
Knowledge 3
p() Lung Cancer
Not Lung Cancer
Male 0.1 0.9
Female
0.003 0.997
QI Based Distribution
Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancerwith very high probability (much greater than 1/2).
It is more likely that a male patient is linked to Lung Cancer compared with a female patient.
Why?
6
1. l-diversityPatient Gender Age Disease
Alan Male 41 Lung Cancer
Betty Female
42 Hypertension
Catherine
Female
63 Flu
Diana Female
64 HIVRelease the data set to public
Bucketization
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
Knowledge 1GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIVQI Table Sensitive Table
I also know Alan with (Male, 41)
Knowledge 2
Simplified 2-diversity: to generate a data set such that each individual is linked to a sensitive value (e.g., Lung Cancer) with probability at most 1/2
This dataset satisfies 2-diversity.
Knowledge 3
p() Lung Cancer
Not Lung Cancer
Male 0.1 0.9
Female
0.003 0.997
QI Based Distribution
Combining Knowledge 1, 2 and 3, we can deduce that Alan is linked to Lung Cancerwith very high probability (much greater than 1/2).
We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
7
1. l-diversity
We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
8
1. l-diversity
Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive.
We need to formulate how to calculate the probability (e.g., P(Alan is linked to Lung Cancer) ) according to Knowledge 1, 2 and 3
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
9
1. l-diversity
Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.
Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
10
1. l-diversity
Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.
Most existing privacy studies involve some formulae which are monotonic. Thus, most existing algorithms (e.g., Incognito and Mondrian) rely on this monotonic property.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2
11
1. l-diversity
Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2
Related Work: There is a closely related work [LLZ09] for this problem.
[LLZ09] T. Li, N. Li and J. Zhang, “Modeling and Integrating BackgroundKnowledge in Data Anonymization”, ICDE 2009
[LLZ09] approximates the formula for this probability.Thus, there is no solid guarantee on the privacy protection.
12
1. l-diversity
Challenge 1: Calculating the probability (e.g., P(Alan is linked to Lung Cancer)) is computationally expensive. Challenge 2: The formula for this probability is not monotonic with respect to the A-group size.
Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 )Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically,(1) Computing the condition is computationally cheap, and(2) The condition involves a monotonic function on the A-group size.
13
1. l-diversity The major idea of the condition includes
some simple calculations based on the statistics of an A-group
Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
Contributions: We propose a condition. If this condition is satisfied, we canguarantee the privacy requirement (i.e., P(Alan is linked to Lung Cancer) ≤ 1/2 )Besides, this condition can overcome Challenge 1 and Challenge 2. Specifically,(1) Computing the condition is computationally cheap, and(2) The condition involves a monotonic function on the A-group size.
1. The size of the A-group (N)2. The privacy requirement (r)3. The global probabilities of each tuple in the A-group to a
sensitive value
14
1. l-diversity The major idea of the condition includes
some simple calculations based on the statistics of an A-group
Objective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
1. The size of the A-group (N)2. The privacy requirement (r)3. The global probabilities of each tuple in the A-group to a
sensitive value
Condition Check
N
r
Global probabilities
Satisfied/Not Satisfied
If it is satisfied, we deduce that the privacy
requirement is satisfied(e.g., P(Alan is linked to Lung Cancer) ≤
1/2)
15
4. Conclusion
1. Background Knowledge QI-based Probability Distribution
2. Two Challenges Challenge 1: The formula for the
probability is computationally expensive Challenge 2: The formula is not
monotonic
3. Proposed Condition overcomes Challenge 1 and Challenge 2
17
1. l-diversityPatient Gender Age Disease
Alan Male 41 Lung Cancer
Betty Female
42 Hypertension
Catherine
Female
63 Flu
Diana Female
64 HIVRelease the data set to public
Gender Age Disease
Male 41 Lung Cancer
Female
42 Hypertension
Female
63 Flu
Female
64 HIV
Bucketization
GID = L1
These two tuples form an anonymized group (A-group)
These two tuples form another A-group.
GID = L2
A way to prevent this linkage.
There is another way to prevent this linkage called Generalization. The following principle to be discussed can also be applied to Generalization.
18
1. l-diversityPatient Gender Age Disease
Alan Male 41 Lung Cancer
Betty Female
42 Hypertension
Catherine
Female
63 Flu
Diana Female
64 HIVRelease the data set to public
Gender Age Disease
Male 41 Lung Cancer
Female
42 Hypertension
Female
63 Flu
Female
64 HIV
GID = L1
GID = L2
Bucketization
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIVQI Table Sensitive Table
19
1. l-diversityPatient Gender Age Disease
Alan Male 41 Lung Cancer
Betty Female
42 Hypertension
Catherine
Female
63 Flu
Diana Female
64 HIVRelease the data set to public
Bucketization
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIVQI Table Sensitive Table
20
1. l-diversityPatient Gender Age Disease
Alan Male 41 Lung Cancer
Betty Female
42 Hypertension
Catherine
Female
63 Flu
Diana Female
64 HIVRelease the data set to public
Gender Age Disease
Male 41 Lung Cancer
Female
42 Hypertension
Female
63 Flu
Female
64 HIV
Knowledge 1
I also know Alan with (Male, 41)
Knowledge 2
Combining Knowledge 1 and Knowledge 2, we can deduce that Alan is linked to Lung Cancer.
21
1. l-diversity Monotonicity
Consider two A-groups
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIV
An A-group with GID = L1
An A-group with GID = L2
P(an individual is linked to a sensitive value) = 0.5
P(an individual is linked to a sensitive value) = 0.4
Merging
An A-group “merged”from these two A-groups
P(an individual is linked to a sensitive value) ≤ 0.5
The probability is monotonically decreasing when the size of the A-gourp increases.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
22
1. l-diversity Non-Monotonicity
Consider two A-groups
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIV
An A-group with GID = L1
An A-group with GID = L2
P(an individual is linked to a sensitive value) = 0.5
P(an individual is linked to a sensitive value) = 0.4
Merging
An A-group “merged”from these two A-groups
It is possible that P(an individual is linked to a sensitive value) > 0.5
The probability is not monotonically decreasing when the size of the A-gourp increases.
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
23
1. l-diversityObjective: to make sure that P(Alan is linked to Lung Cancer) ≤ 1/2
Objective: to make sure that the probability is bounded by a threshold (e.g., 1/2).
Condition Check
N
r
Global probabilities
Satisfied/Not Satisfied
If it is satisfied, we deduce that the privacy
requirement is satisfied(e.g., P(Alan is linked to Lung Cancer) ≤
1/2)
Knowledge 3
p() Lung Cancer
Not Lung Cancer
Male 0.1 0.9
Female
0.003 0.997
QI Based Distribution
Gender Age GID
Male 41 L1
Female
42 L1
Female
63 L2
Female
64 L2
Knowledge 1GID Disease
L1 Lung Cancer
L1 Hypertension
L2 Flu
L2 HIVSuppose we are interested in knowing whether P(Alan is linked to Lung Cancer) ≤ 1/2.
I also know Alan with (Male, 41)
Knowledge 2
2
2
0.10.003
For the sake of illustration, we focus onattribute Gender only.
24
Condition Check
N
r
Global probabilities
Satisfied/Not Satisfied
2
2
0.10.003
What is the condition check?
In the condition check,there is an expression ceil in terms of N, r and global probabilities to compute.
25
What is the condition check?
Theorem 1: If the condition is satisfied, then the privacy requirement is satisfied.
In the condition check,there is an expression ceil in terms of N, r and global probabilities to compute.
26
Theorem 2: Computing ceil can be done in O(1) time.
This means that we overcome Challenge 1.Challenge 1: Calculating the probability is computationally expensive.
This means that we overcome Challenge 2.Challenge 2: The formula for the original
probabilityis not monotonic with respect to the A-group
size.
Theorem 3: ceil is a monotonically increasing function on N where N is the A-group size.
27
Condition Check
N
r
Global probabilities
Satisfied/Not Satisfied
2
2
0.10.003
What is the condition check?
f1
f2
fmax = max{f1, f2}= max{0.1, 0.003}= 0.1
1 = fmax – f1 = 0.1 – 0.1 = 0
2 = fmax – f2 = 0.1 – 0.003 = 0.097
The greatest global probability
The difference between the greatest global probability and the “current” global probability
The condition is whether this difference 1 (and 2) is at most an expression ceil
ceil = (N-r)/fmax
fmax(r-1)/(1-fmax) + (N-1)
in terms of N, r and fmax.
28
What is the condition check?
fmax = max{f1, f2}= max{0.1, 0.003}= 0.1
1 = fmax – f1 = 0.1 – 0.1 = 0
2 = fmax – f2 = 0.1 – 0.003 = 0.097
The greatest global probability
The difference between the greatest global probability and the “current” global probability
The condition is whether this difference 1 (and 2) is at most an expression ceil
ceil = (N-r)/fmax
fmax(r-1)/(1-fmax) + (N-1)
Theorem 1: If i ≤ceil is satisfied, then the privacy requirement is satisfied.
29
Anonymization
The condition check gives hints for anonymization Initially, each tuple forms an A-group. Repeat the following until each A-
group satisfies the condition. If there is an A-group violating the
condition, merge this A-group with some other A-group such that the “merged” A-group satisfies the condition.
30
B.1.2 K-AnonymityCustomer Gender District Birthday Cancer
Raymond Male Shatin 29 Jan None
Peter Male Fanling 16 July Yes
Kitty Female Shatin 21 Oct None
Mary Female Shatin 8 Feb None
Gender District Birthday Cancer
Male NT * None
Male NT * Yes
Female Shatin * None
Female Shatin * None
Release the data set to public
Problem: to generate a data set such that each possible value appears at least TWO times.
This data set is 2-anonymous
Two Kinds of Generalisations1. ShatinNT2. 16 July*
“ShatinNT” causes LESS distortion than “16 July*”Question: how can we
measure the distortion?
31
B.1.2 K-Anonymity
Shatin Fanling Mongkok Jordon
NT KLN
HKG
29 Jan 16 July 21 Oct 8 Feb
Jan July Oct Feb
*
Measurement= 1/2 =0.5
Measurement= 2/2=1.0
Male Female
*
Measurement= 1/1=1.0
Conclusion: We propose a measurement of distortion of the modified/anonymized data.