View
54
Download
2
Category
Tags:
Preview:
DESCRIPTION
Multiplicative Data Perturbations. Outline. Introduction Multiplicative data perturbations Rotation perturbation Geometric Data Perturbation Random projection Understanding Distance preservation Perturbation-invariant models Attacks Privacy Evaluation Model - PowerPoint PPT Presentation
Citation preview
Multiplicative Data Perturbations
Outline Introduction Multiplicative data perturbations
Rotation perturbation Geometric Data Perturbation Random projection
Understanding Distance preservation Perturbation-invariant models
Attacks Privacy Evaluation Model Background knowledge and attack analysis Attack-resilient optimization
Comparison
Summary on additive perturbations
problems Weak to various attacks
Need to publish noise distribution The column distribution is known
Need to develop/revise data mining algorithms in order to utilize perturbed data So far, we have only seen that decision tree and naïve bayes
classifier can utilize additive perturbation.
Benefits Can be applied to both the Web model and the
collaborative data pooling model Low cost
More thoughts about perturbation
1. Preserve Privacy Hide the original data
not easy to estimate the original values from the perturbed data
Protect from data reconstruction techniques The attacker has prior
knowledge on the published data
2. Preserve Data Utility for Tasks
Single-dimensional info column data
distribution, etc.
Multi-dimensional infoCov matrix, distance,
etc
For most PP approaches…
Privacy guarantee
Data Utility/Model accuracy
?
Privacy guarantee
Data utility/Model accuracy
•Difficult to balance the two factors•Subject to attacks•May need new DM algorithms: randomization, cryptographic approaches
Multiplicative perturbations
Geometric data perturbation (GDP) Rotation data perturbation + Translation data perturbation + Noise addition
Random projection perturbation(RPP) Sketch approach
Definition of Geometric Data Perturbation
G(X) = R*X + T + DR: random rotationT: random translationD: random noise, e.g., Gaussian noise
Characteristics: R&T preserving distance, D slightly perturbing distance
Example:
ID 001 002
age 30 25
rent 1350 1000
tax 4230 3320
12 12
18 18
30 30
-.4 .29
-1.7 -1.1
.13 1.2
ID 001 002
age 1158 943
rent 3143 2424
tax -2919 -2297
.83 -.40 .40
.2 .86 .46
.53 .30 -.79
= * + +
* Each component has its use to enhance the resilience to attacks!
Benefits of Geometric Data Perturbation
Privacy guarantee
Data Utility/Model accuracy
decoupled
Make optimization and balancing easier!-Almost fully preserving model accuracy - we optimize privacy only
Applicable to many DM algorithms-Distance-based Clustering-Classification: linear, KNN, Kernel, SVM,…
Resilient to Attacks-the result of attack research
Limitations
Multiplicative perturbations are mostly used in outsourcing Cloud computing Can be applied to multiparty
collaborative computing in same cases
Web model does not fit – perturbation parameters cannot be published
Definition of Random Projection Perturbation F(X) = P*X
X is m*n matrix: m columns and n rows P is a k*m random matrix, k <= m
Johnson-Lindenstrauss LemmaThere is a random projection F() with
e is a small number <1, so that
(1-e)||x-y||<=||F(x)-F(y)||<=(1+e)||x-y||
i.e. distance is approximately preserved.
Comparison between GDP and RPP
Privacy preservation Subject to similar kinds of attacks RPP is more resilience to distance-based
attacks
Utility preservation(model accuracy) GDP preserves distances well RPP approximately preserves distances
Model accuracy is not guaranteed
Illustration of multiplicative data perturbation
Preserving distanceswhile perturbing eachindividual dimensions
A Model “invariant” to GDP … If distance plays an important role
Class/cluster members and decision boundaries are
correlated in terms of distance, not the concrete locations
Classification boundary
Class 1
Class 2
Classification boundary
Class 1
Class 2
Rotation and translation
Class 1
Class 2Slightly
changed Classification boundary
Distance perturbation(Noise addition)
2D Example:
Applicable DM algorithms Modeling methods that depend on Euclidean
geometric properties
Models “invariant” to GDP all Euclidean distance based clustering algorithms Classification algorithms
K Nearest Neighbors Kernel methods Linear classifier Support vector machines Most regression models And potentially more …
When to Use Multiplicative Data Perturbation
Data Owner Service Provider/data user
G(X)=RX+T+D
Mined models/patterns
G(X)
F(G(X), )Apply F to G(Xnew)
Good for the outsourcing model.Major issue!! curious service providers/data users try to break G(X)
Major issue: attacks! Many existing Privacy Preserving methods are
found not so effective when attacks are considered Ex: various data reconstruction algorithms to the
random noise addition approach [Huang05][Guo06]
Prior knowledge Service provider Y has “PRIOR
KNOWLEDGE” about X’s domain and nothing stops Y from using it to infer information in the sanitized data
Knowledge used to attack GDP
Three levels of knowledge Know nothing naïve estimation Know column distributions Independent
Component Analysis Know specific input-output records (original
points and their images in perturbed data) distance inference
Methodology of attack analysis
An attack is an estimate of the original data
Original O(x1, x2,…, xn) vs. estimate P(x’1, x’2,…, x’n)
How similar are these two series?
One of the effective methods is to evaluate the MSE of the estimation – VAR(P-O) or STD(P-O)
Two multi-column privacy metrics
qi : privacy guarantee for column i qi = std(Pi–Oi), Oi normalized column values,
Pi estimated column values
Min privacy guarantee: the weakest link of all columns
min { qi, i=1..d}
Avg privacy guarantee: overall privacy guarantee
1/d qi
Alternative metric
Based on Agarawal’s information theoretic measure: loss of privacy PI=1- 2-I(X; X^), X^ is the estimation of X I(X; X^) = H(X) – H(X|X^) = H(X) –
H(estimation error) Exact estimation H(X|X^) =0, PI = 1-2-H(X)
Random estimation I(X; X^) = 0, PI=0 Already normalized for different columns
Attack 1: naïve estimation
Estimate original points purely based on the perturbed dataIf using “random rotation” only
Intensity of perturbation matters Points around origin
Classification boundary
Class 1
Class 2
Classification boundary
Class 1
Class 2
Classification boundary
Class 1
Class 2 X
Y
Counter naïve estimation Maximize intensity
Based on formal analysis of “rotation intensity” Method to maximize intensity
Fast_Opt algorithm in GDP
“Random translation” T Hide origin Increase difficulty of attacking!
Need to estimate R first, in order to find out T
Attack 2: ICA based attacks
Independent Component Analysis (ICA) Try to separate R and X from Y= R*X
Characteristics of ICA
1. Ordering of dimensions is not preserved.
2. Intensity (value range) is not preserved
Conditions of effective ICA-attack1. Knowing column distribution2. Knowing value range.
Counter ICA attack
Weakness of ICA attack Need certain amount of knowledge Cannot effectively handle dependent
columns
In reality… Most datasets have correlated columns We can find optimal rotation perturbation maximizing the difficulty of ICA attacks
Original Perturbed
Known point
image
Attack 3: distance-inference attack
If with only rotation/translation perturbation, when the attacker knows a set of original points and their mapping…
How is the Attack done … Knowing points and their images …
find exact images of the known points Enumerate pairs by matched distances … Less effective for large data … we assume pairs are successfully identified
Estimation
1. Cancel random translation T from pairs (x, x’)2. calculate R with pairs: Y=RX R = Y*X-1 3. calculate T with R and known pairs
Counter distance-inference: Noise addition
Noise brings enough variance in estimation of R and T Now the attacker has to use regression to estimate R Then, use approximate R to estimate T increase uncertainty
Regression
1. Cancel random translation T from pairs (x, x’)2. estimate R with pairs:
Y=RX R = (Y*XT )(X*XT)-1
3. Use the estimated R and known pairs to estimate T
Discussion Can the noise be easily filtered?
Need to know noise distribution, distribution of RX + T, Both distributions are not published, however. Attack analysis will be different from that for noise
addition data perturbation Will PCA based noise filtering [Huang05] be effective?
What are the best estimation that the attacker can get?
If we treat the attack problem as a learning problem -- Minimum variance of error for the learner Higher bound of “loss of privacy”
Attackers with more knowledge?
What if attackers know a large amount of original records? Able to accurately estimate covariance matrix,
column distribution, and column range, etc., of the original data
Methods PCA, AK_ICA, …,etc can be used
What do we do?If you have released so much original information…
Stop releasing data anymore
A randomized perturbation optimization algorithm
Start with a random rotation Goal: passing tests on simulated attacks Not simply random – a hillclimbing
method1. Iteratively determine R
- Test on naïve estimation (Fast_opt) - Test on ICA (2nd level) find a better rotation R
2. Append a random translation component3. Append an appropriate noise component
Comparison on methods Privacy preservation
In general, RPP should be better than GDP Evaluate the effect of attacks for GDP
ICA and distance perturbation need experimental evaluation
Utility preservation GDP:
R and T exactly preserve distances, The effect of D needs experimental evaluation
RPP # of perturbed dimensions vs. utility
Datasets 12 datasets from UCI Data Repository
Privacy guarantee:GDP In terms of naïve estimation and ICA-based attacks Use only the random rotation and translation (R*X+T)
components
Worst perturbation (no optimization)Optimized for Naïve estimation only
Optimized perturbation for both attacks
Privacy guarantee:GDP In terms of distance inference attacks
Use all three components (R*X +T+D) Noise D : Gaussian N(0, 2) Assume pairs of (original, image) are identified by attackers
no noise addition, privacy guarantee =0
Considerably high PGaround small perturbation=0.1
Data utility : GDP with noise addition
Noise addition vs. model accuracy - noise: N(0, 0.12)
Data Utility: RPP
Reduced # of dims vs. model accuracy
KNN classifiers SVMs
Perceptrons
Recommended