29
Attacks on Randomization based Privacy Preserving Data Mining Xintao Wu University of North Carolina at Charlotte Sept 20, 2010

Attacks on Randomization based Privacy Preserving Data Mining

  • Upload
    xia

  • View
    48

  • Download
    0

Embed Size (px)

DESCRIPTION

Attacks on Randomization based Privacy Preserving Data Mining. Xintao Wu University of North Carolina at Charlotte Sept 20, 2010. Scope. Outline. Part I: Attacks on Randomized Numerical Data Additive noise Projection Part II: Attacks on Randomized Categorical Data Randomized Response. - PowerPoint PPT Presentation

Citation preview

Page 1: Attacks on Randomization based Privacy Preserving Data Mining

Attacks on Randomization based Privacy Preserving Data Mining

Xintao Wu

University of North Carolina at CharlotteSept 20, 2010

Page 2: Attacks on Randomization based Privacy Preserving Data Mining

2

Scope

Page 3: Attacks on Randomization based Privacy Preserving Data Mining

3

Outline

Part I: Attacks on Randomized Numerical Data Additive noise Projection

Part II: Attacks on Randomized Categorical Data Randomized Response

Page 4: Attacks on Randomization based Privacy Preserving Data Mining

4

Additive Noise Randomization Example

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

10 85 2

15 70 18

50

45

80

120

23

110

35

134

15

= +

Y = X + E

7.334 3.759 0.099

4.199 7.537 7.939

9.199

6.208

9.048

8.447

7.313

5.692

3.678

1.939

6.318

17.334 88.759 2.099

19.199 77.537 25.939

59.199

51.208

89.048

128.447

30.313

115.692

38.678

135.939

21.318

Page 5: Attacks on Randomization based Privacy Preserving Data Mining

5

Individual Value Reconstruction (Additive Noise)

• Methods Spectral Filtering, Kargupta et al. ICDM03 PCA, Huang, Du, and Chen SIGMOD05 SVD, Guo, Wu and Li, PKDD06

• All aim to remove noise by projecting on lower dimensional space.

Page 6: Attacks on Randomization based Privacy Preserving Data Mining

6

Individual Reconstruction Algorithm

Apply EVD : Using some published information about V, extract the first k

components of as the principal components. λ1≥ λ2··· ≥ λk ≥ λe and e1, e2, · · · ,ek are the corresponding

eigenvectors. Qk = [e1 e2 · · · ek] forms an orthonormal basis of a subspace X.

Find the orthogonal projection on to X : Get estimate data set: PUU pˆ

TUp QQ

Up

TkkQQP

Up = U + VNoisePerturbed Original

Page 7: Attacks on Randomization based Privacy Preserving Data Mining

7

Why it works

• Original data are correlated • Noise are not correlated

noise

2nd principal vector

1st principal vectororiginal signal perturbed

+ =

2-d estimation1-d estimation

Page 8: Attacks on Randomization based Privacy Preserving Data Mining

8

Challenging Questions

• Previous work on individual reconstruction are only empirical

Attacker question: How close the estimated data is from the original one?

Data owner question: How much noise should be added to preserve privacy at a given tolerated level?

21 ||ˆ|| UU

Page 9: Attacks on Randomization based Privacy Preserving Data Mining

9

Determining k

• Strategy 1: (Huang and Du SIGMOD05 )

• Strategy 2:(Guo, Wu and Li, PKDD 2006)

The estimated data using is approximate

optimal

}~

|max{ Viik

TkkQQUPUU

~~~~ˆ ~

1}2~

|min{ Viik

Page 10: Attacks on Randomization based Privacy Preserving Data Mining

10

Additive Noise vs. Projection

• Additive perturbation is not safe Spectral Filtering Technique

H.Kargupta et al. ICDM03 PCA Based Technique

Huang et al. SIGMOD05 SVD based & Bound Analysis

Guo et al. SAC06,PKDD06

• How about the projection based perturbation? Projection models Vulnerabilities Potential attacks

XX EEYY

NoisePerturbed Original

= +

RR XXYY

Perturbed Transformation Original

=

Page 11: Attacks on Randomization based Privacy Preserving Data Mining

11

Rotation Randomization Example

0.3333 0.6667 0.6667

-0.6667 0.6667 -0.3333

-0.6667 -0.3333 0.6667

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

61.33 63.67 110.00 119.67 63.33

49.33 30.67 55.00 -59.33 -31.67

-33.67 -21.33 -30.00 51.67 -51.67

=

Y = R X

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

RRT = RTR = I

Page 12: Attacks on Randomization based Privacy Preserving Data Mining

12

Rotation Approach (R is orthonormal)

• When R is an orthonormal matrix (RTR = RRT = I) Vector length: |Rx| = |x| Euclidean distance: |Rxi - Rxj| = |xi - xj|

Inner product : <Rxi ,Rxj> = <xi , xj>

• Many clustering and classification methods are invariant to this rotation perturbation. Classification, Chen and Liu, ICDM 05 Distributed data mining, Liu and Kargupta, TKDE 06

Page 13: Attacks on Randomization based Privacy Preserving Data Mining

13

Example

866.0500.0

500.0866.0R

RXY

0.2902

0.2902

1.30

86

1.3086

RRT = RTR = I

Page 14: Attacks on Randomization based Privacy Preserving Data Mining

14

Weakness of Rotation

0.2902

1.30

86

0.2902

1.3086?

866.0500.0

500.0866.0RRegression

•Known sample attackKnown

Info Original data

Page 15: Attacks on Randomization based Privacy Preserving Data Mining

15

General Linear Transformation

• Y = R X + E When R = I: Y = X + E (Additive Noise Model) When RRT = RTR = I and E = 0: Y = RX (Rotation Model) R can be an arbitrary matrix

4.751 2.429 2.282

1.156 4.457 0.093

3.034 3.811 4.107

10 15 50 45 80

85 70 120 23 110

2 18 35 134 15

265.95 286.63 475.68 581.71 520.53

394.30 338.49 569.58 174.22 277.79

362.55 394.11 665.37 776.46 463.08

=

Y = R X

+

7.334 4.199 9.199 6.208 9.048

3.759 7.537 8.447 7.313 5.692

0.099 7.939 3.678 1.939 6.318

+ E

Page 16: Attacks on Randomization based Privacy Preserving Data Mining

16

Is Y = R X + E Safe?

• R can be an arbitrary matrix, hence regression based attack wont work

• How about noisy ICA direct attack?

Y = R X + E General Linear Transformation Model

X = A S + N Noisy ICA Model

Page 17: Attacks on Randomization based Privacy Preserving Data Mining

17

ICA Revisited

• ICA Motivation Blind source separation: separating unobservable or latent

independent source signals when mixed signals are observed Cocktail-party problem

• What is ICA ICA is a statistical technique which aims to represent a set of

random variables as linear combinations of statistically independent component variables

ICA is a process for determining the structure that produced a signal

Page 18: Attacks on Randomization based Privacy Preserving Data Mining

18

ICA

1 111 1

1

( ) ( )

( ) ( )

m

n nm m n

s t x tA A

A A s t x t

Linear Mixing ProcessLinear Mixing Process

Mixing Matrix Source Observed

Separation ProcessSeparation Process

Separated Demixing Matrix

1 111 1

1

( ) ( )

( ) ( )

n

m m mn n

y t x tW W

y t W W x t

Independent?Independent?

Cost Function

Optimize

Page 19: Attacks on Randomization based Privacy Preserving Data Mining

19

Restriction of ICA

• Restrictions: All the components si should be independent; They must be non-Gaussian with the possible exception of one

component.

• Can we apply the ICA directly? No Correlations among attributes of X More than one attributes may have Gaussian distributions

Y = RX

X = AS

Page 20: Attacks on Randomization based Privacy Preserving Data Mining

20

A-priori Knowledge based ICA (AK-ICA) Attack

Page 21: Attacks on Randomization based Privacy Preserving Data Mining

21

Correctness of AK-ICA

• We prove that J exists such that

J represents the connection between the distributions of and XS ~

YS

XJSAX yx ~ˆ

More details, See Guo and Wu, PAKDD 2007

Page 22: Attacks on Randomization based Privacy Preserving Data Mining

22

Assumption

• Privacy can be breached when a small subset of the original data X , is available to attackers

• Assumption is reasonable!

Understanding net users' attitude about online privacy, April 99

Bal income … IntP

1 10k 85k … 2k

2 15k 70k … 18k

3 50k 120k … 35k

4 45k 23k … 134k

. . . … .

N 80k 110k … 15k

Privacy Concern 56%

Refuse 17%

No Concern 27%

Willing to provide data

Page 23: Attacks on Randomization based Privacy Preserving Data Mining

24

Outline

Part I: Attacks on Randomized Numerical Data Additive noise Projection

Part II: Attacks on Randomized Categorical Data Randomized Response

Page 24: Attacks on Randomization based Privacy Preserving Data Mining

25

Randomized Response ([ Stanley Warner; JASA 1965])

: Cheated in the exam : Didn’t cheat in the exam

Cheated in exam

Didn’t cheat

AAA

A

Randomization device

Do you belong to A? (p)

Do you belong to ?(1-p)A…

)1)(1( pp AA 12

ˆ

12

pp

pAW

1

“Yes” answer

“No” answer

As: Unbiased estimate of is: A

Procedure:

Purpose: Get the proportion( ) of population members that cheated in the exam.

A

Purpose

Page 25: Attacks on Randomization based Privacy Preserving Data Mining

26

Matrix Expression

• RR can be expressed by matrix as: ( 0: No 1:Yes)

=

Unbiased estimate of is:

P

ˆˆ 1P

0

1p1

0

1

p p1

p

Page 26: Attacks on Randomization based Privacy Preserving Data Mining

27

Vector Response

is the true proportions of the population

is the observed proportions in the survey

is the randomization device set by the interviewer.

),...,( 1 t

))(( jipP

),...,( 1 t

i

j

1 2 3 4

1 0.60 0.20 0.00 0.10

2 0.20 0.50 0.20 0.10

3 0.15 0.15 0.70 0.30

4 0.05 0.15 0.10 0.50

0.10

0.30

0.20

0.40

=

0.16

0.25

0.32

0.27

=P

Page 27: Attacks on Randomization based Privacy Preserving Data Mining

28

Extension to Multi Attributes

,,...,, 21 mAAA m sensitive attributes: each has categories:

denote the true proportion corresponding to the combination

be vector with elements ,arranged

lexicographically.

e.g., if m =2, t1 =2 and t2=3

Simultaneous Model

Consider all variables as one compounded variable and apply the regular vector response RR technique

Sequential Model

jtjjtj AA ,...,1

mii ,...,1

),,...,(11 mmii AA

mii ,...,1 ),..,1( jj ti

)',,,,,( 232221131211

)...( 21 mPPP stands for Kronecker product

Page 28: Attacks on Randomization based Privacy Preserving Data Mining

29

Disclosure Analysis R: Typical response which is “yes” ( ) or “no” ( )

Posterior probabilities:

yy

)()1()(

)()(

ARPARP

ARPRAP

AA

A

)(1)( RAPRAP

R is regarded as jeopardizing with respect or if: AA

ARAP )( or ARAP 1)(

, are conditional probabilities set by investigators)( ARP )( ARP

Page 29: Attacks on Randomization based Privacy Preserving Data Mining

31

QA&Xintao Wu [email protected], http://www.sis.uncc.edu/~xwu

Data Privacy Lab http://www.dpl.sis.uncc.edu