Upload
paula-haynes
View
222
Download
2
Embed Size (px)
Citation preview
1
Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss
Author: Barzan Mozafari and Carlo Zaniolo
Speaker: Hongwei Tian
2
Outline
• Motivation• Brief background on NBC• Privacy breach for views• Transformation from unsafe views to safe views• Extension for arbitrary prior distributions• Experiments• Conclusion
3
Motivation
• PPDM methods seek to achieve the benefits from data mining on the data, without compromising privacy of individuals in the data.
– data collection phase– data publishing phase– data mining phase
4
Motivation
• Privacy breaches when publishing NBCs– Bob knows that Alice lives on Westwood and she is in 40s– Bob’s prior belief on Alice earning 70K was 5/7 = 71%– After seeing the views, Bob infers that with a probability of
1/10 × (4/5 + 4×3/4 + 5×1) = 88% Alice earns a 70K salary.
5
Motivation
• Publishing better views
– Bob’s posterior belief 1/6 × (2/3+1/2+1/2+1+1+1) = 78%
– 71%-to-78% is safer than 71%-to-88%
6
Motivation
• Achieve same classification results– Test input is <P, 30>– The NBC built on V1 predicts the class label as 50K, because
5/7×1/5×1/5 < 2/7×1/2×1/2– The prediction from the second classifier (built on V2) is again
50K, because 3/5×1/3×1/3 < 2/5×1/2×1/2
7
Motivation
• NBC has proved to be one of the most effective classifiers in practice and in theory.
• Given an unsafe NBC, it is possible to find an equivalent one that is safer to publish.
• The objective is determining whether a set of NBC-enabling views are safe to publish
• And if not, how to find a secure database that produces the same NBC model satisfying privacy requirements.
8
Brief Background on NBC
• The original database T is an instance of a relation
• In order to build an NBC, the only views that need to be published are for all , and
• Equivalent to publishing these views, one can instead publish the following counts. For , , ,
1,..., ,nR A A C
, ( )iA CT ( )C T
1 i n
1 i n
it A c C
, ( )i
it c A t C cN T
( )c C cP T
9
Brief Background on NBC
• Using these counts, we can express the NBC’s probability estimation as follows. For all and for all
, the NBC’s prediction is:1 1( ,..., )n nt t A A
c C
1 1
,
...
Pr ( )( ) /
i
n n
it cc
ic
A t A t
NP
T PClass c
T T
,,
i
it c
c c ic
NX P
P , ', '
'
i
it c
c c ic
NX P
P
10
0,2 0 0
0
. , . ( )
. , .
c I
d D
d S
cd
d S
P P t C c t d t I I P d V d V
P t C c t d t I I P d
P d
Privacy Breach for Views
• Prior and posterior knowledge
where
Quasi-identifier:
Family of all table instances:
all instances satisfying the given views:
1 3,I A A 0 1 3 1 3,I t t A A
0, .D d t d t I I
0,1 0. , .c I
d D
P P t C c t d t I I P d
0( )S d D V d V
11
• For a given table T, publishing V(T) = V0 causes a privacy breach with respect to a pair of given constants 0 < L1 < L2 < 1, if either of the following holds:
or,
• For example, 0.5-to-0.8 does not satisfy the privacy requirement L1 = 0.51 and L2 = 0.8, but 0.5-to-0.78 does.
0 0, ,1 1 2 2c I c IP L L P
0 0, ,2 1 2 1c I c IP L L P
Privacy Breach for Views
12
• assume a uniform distribution of the database instances; assume a uniform distribution of class values.
0,1 0
1. , .c I
d D
P P t C c t d t I I P dC
Privacy Breach for Views
0,2
1c I c cd d
d S d S
P P dS
13
Privacy Breach for Views
• Let I0 be the value of a given quasi-identifier I, and let V0 be the value of a given view V(T). If there exist some m1,m2 > 0 such that for all :
then for any c and any pair of L1,L2 > 0 publishing V0 will not cause any privacy breaches w.r.t. L1 and L2, provided that the following amplification criterion holds:
c C1 21 c
dd S
m m
C S C
2 2 1
1 1 2
1
1
m L L
m L L
14
Privacy Breach for Views
• For a given quasi-identifier I = I0, a given view V(T) = V0 is safe to publish against any L1-to-L2 privacy breaches, if there exists such that the following conditions hold:
and for all :
• select the largest possible• for a given , recast the privacy goal as that of checking/enforcing
the second condition
1
, 'c c C
22 1
1 2
( 1) 1
1 1
C L L
C L L
'
cdd Scdd S
15
Privacy Breach for Views
• With respect to a given I0 as the value of a quasi-identifier I, and a given amplification ratio , the viewset (P,N) is safe to publish, if for all , and , the following conditions hold:
, 'c c C1 i n
it A
,
, '
0i
It c
it c
N
N '0 Ic
c
P
P
16
Privacy Breach for Views
• Two observations– all quasi-identifiers that have the same cardinality
(i.e., number of attributes) can be blocked at the same time, since the conditions are functions of |I|, and not of I or I0.
– all privacy breaches for all quasi-identifiers of any cardinality can be blocked by simply blocking the one with largest cardinality, namely n, because
1 1n n
17
Privacy Breach for Views
• With respect to a given amplification ratio , the viewset (P,N) is safe to publish, if for all , and , the following conditions hold:
, 'c c C1 i n
it A
,
, '
0it c nit c
N
N '0 c n
c
P
P
18
Transformation from unsafe views to safe views
• NBC-Equivalence• Let f and f’ be two functions that map each element of
to a non-negative real number. We call f and f’
NBC-equivalent, ifi
i
A Ci
i
A , 'c c C 'c c
( , ) ( , ') '( , ) '( , ')
( , ) ( , ') '( , ) '( , ')
f c f c f c f c
f c f c f c f c
19
Transformation from unsafe views to safe views
• Transformation algorithms– Input: V is the given view set consisting of and ;
amplification ratio
– Description:• Step 1: Replace all those that are 0 to non-zero• Step 2: Scale down all to new rational numbers that
satisfy the given amplification ratio• Step 3: Adjust the numbers such that again• Step 4: Normalize the numbers or turn them into integers
– Output: V
,it cN cP
,it c c
t
N P
,it cN
,it cN
(1) Raising all the counts to the same power does not change the classification;
(2) In other words a set of NBC-equivalent viewsets is closed under exponentiation.
Example: 100 and 16, 10>4 100-16>10-4
20
Extension for arbitrary prior distributions
• See an tiny example
21
Experiments
• Adult dataset containing 32,561 tuples• The attributes used were Age, Years of education, Work
hours per week, and Salary.• an NBC trained on the k-anonymous data vs. an NBC
trained on the output of Safety Views Transformation
22
Conclusion
• Reformulated privacy breach for view publishing• Presented sufficient conditions that are easy to
check/enforce• Provided algorithms that guarantee the privacy of the
individuals who provided the training data, and incur zero accuracy loss in terms of building an NBC.