1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian

1

Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss

Author: Barzan Mozafari and Carlo Zaniolo

Speaker: Hongwei Tian

2

Outline

• Motivation• Brief background on NBC• Privacy breach for views• Transformation from unsafe views to safe views• Extension for arbitrary prior distributions• Experiments• Conclusion

3

Motivation

• PPDM methods seek to achieve the benefits from data mining on the data, without compromising privacy of individuals in the data.

– data collection phase– data publishing phase– data mining phase

4

Motivation

• Privacy breaches when publishing NBCs– Bob knows that Alice lives on Westwood and she is in 40s– Bob’s prior belief on Alice earning 70K was 5/7 = 71%– After seeing the views, Bob infers that with a probability of

1/10 × (4/5 + 4×3/4 + 5×1) = 88% Alice earns a 70K salary.

5

Motivation

• Publishing better views

– Bob’s posterior belief 1/6 × (2/3+1/2+1/2+1+1+1) = 78%

– 71%-to-78% is safer than 71%-to-88%

6

Motivation

• Achieve same classification results– Test input is <P, 30>– The NBC built on V1 predicts the class label as 50K, because

5/7×1/5×1/5 < 2/7×1/2×1/2– The prediction from the second classifier (built on V2) is again

50K, because 3/5×1/3×1/3 < 2/5×1/2×1/2

7

Motivation

• NBC has proved to be one of the most effective classifiers in practice and in theory.

• Given an unsafe NBC, it is possible to find an equivalent one that is safer to publish.

• The objective is determining whether a set of NBC-enabling views are safe to publish

• And if not, how to find a secure database that produces the same NBC model satisfying privacy requirements.

8

Brief Background on NBC

• The original database T is an instance of a relation

• In order to build an NBC, the only views that need to be published are for all , and

• Equivalent to publishing these views, one can instead publish the following counts. For , , ,

1,..., ,nR A A C

, ( )iA CT ( )C T

1 i n

1 i n

it A c C

, ( )i

it c A t C cN T

( )c C cP T

9

Brief Background on NBC

• Using these counts, we can express the NBC’s probability estimation as follows. For all and for all

, the NBC’s prediction is:1 1( ,..., )n nt t A A

c C

1 1

,

...

Pr ( )( ) /

i

n n

it cc

ic

A t A t

NP

T PClass c

T T

,,

i

it c

c c ic

NX P

P , ', '

'

i

it c

c c ic

NX P

P

10

0,2 0 0

0

. , . ( )

. , .

c I

d D

d S

cd

d S

P P t C c t d t I I P d V d V

P t C c t d t I I P d

P d

Privacy Breach for Views

• Prior and posterior knowledge

where

Quasi-identifier:

Family of all table instances:

all instances satisfying the given views:

1 3,I A A 0 1 3 1 3,I t t A A

0, .D d t d t I I

0,1 0. , .c I

d D

P P t C c t d t I I P d

0( )S d D V d V

11

• For a given table T, publishing V(T) = V0 causes a privacy breach with respect to a pair of given constants 0 < L1 < L2 < 1, if either of the following holds:

or,

• For example, 0.5-to-0.8 does not satisfy the privacy requirement L1 = 0.51 and L2 = 0.8, but 0.5-to-0.78 does.

0 0, ,1 1 2 2c I c IP L L P

0 0, ,2 1 2 1c I c IP L L P


12

• assume a uniform distribution of the database instances; assume a uniform distribution of class values.

0,1 0

1. , .c I

d D

P P t C c t d t I I P dC


0,2

1c I c cd d

d S d S

P P dS

13


• Let I0 be the value of a given quasi-identifier I, and let V0 be the value of a given view V(T). If there exist some m1,m2 > 0 such that for all :

then for any c and any pair of L1,L2 > 0 publishing V0 will not cause any privacy breaches w.r.t. L1 and L2, provided that the following amplification criterion holds:

c C1 21 c

dd S

m m

C S C

2 2 1

1 1 2

1

1

m L L

m L L

14


• For a given quasi-identifier I = I0, a given view V(T) = V0 is safe to publish against any L1-to-L2 privacy breaches, if there exists such that the following conditions hold:

and for all :

• select the largest possible• for a given , recast the privacy goal as that of checking/enforcing

the second condition

1

, 'c c C

22 1

1 2

( 1) 1

1 1

C L L

C L L

'

cdd Scdd S

15


• With respect to a given I0 as the value of a quasi-identifier I, and a given amplification ratio , the viewset (P,N) is safe to publish, if for all , and , the following conditions hold:

, 'c c C1 i n

it A

,

, '

0i

It c

it c

N

N '0 Ic

c

P

P

16


• Two observations– all quasi-identifiers that have the same cardinality

(i.e., number of attributes) can be blocked at the same time, since the conditions are functions of |I|, and not of I or I0.

– all privacy breaches for all quasi-identifiers of any cardinality can be blocked by simply blocking the one with largest cardinality, namely n, because

1 1n n

17


• With respect to a given amplification ratio , the viewset (P,N) is safe to publish, if for all , and , the following conditions hold:

, 'c c C1 i n

it A

,

, '

0it c nit c

N

N '0 c n

c

P

P

18

Transformation from unsafe views to safe views

• NBC-Equivalence• Let f and f’ be two functions that map each element of

to a non-negative real number. We call f and f’

NBC-equivalent, ifi

i

A Ci

i

A , 'c c C 'c c

( , ) ( , ') '( , ) '( , ')

( , ) ( , ') '( , ) '( , ')

f c f c f c f c

f c f c f c f c

19

Transformation from unsafe views to safe views

• Transformation algorithms– Input: V is the given view set consisting of and ;

amplification ratio

– Description:• Step 1: Replace all those that are 0 to non-zero• Step 2: Scale down all to new rational numbers that

satisfy the given amplification ratio• Step 3: Adjust the numbers such that again• Step 4: Normalize the numbers or turn them into integers

– Output: V

,it cN cP

,it c c

t

N P

,it cN

,it cN

(1) Raising all the counts to the same power does not change the classification;

(2) In other words a set of NBC-equivalent viewsets is closed under exponentiation.

Example: 100 and 16, 10>4 100-16>10-4

20

Extension for arbitrary prior distributions

• See an tiny example

21

Experiments

• Adult dataset containing 32,561 tuples• The attributes used were Age, Years of education, Work

hours per week, and Salary.• an NBC trained on the k-anonymous data vs. an NBC

trained on the output of Safety Views Transformation

22

Conclusion

• Reformulated privacy breach for view publishing• Presented sufficient conditions that are easy to

check/enforce• Provided algorithms that guarantee the privacy of the

individuals who provided the training data, and incur zero accuracy loss in terms of building an NBC.

Documents

1 Publishing Naive Bayesian Classifiers: Privacy without Accuracy Loss Author: Barzan Mozafari and Carlo Zaniolo Speaker: Hongwei Tian