Additive Data Perturbation: the Basic Problem and Techniques

Outline Motivation Definition Privacy metrics Distribution reconstruction methods Privacy-preserving data mining with

additive data perturbation Summary

Note: focus on the papers: 10 and 11

Motivation Web-based computing

Observations Only a few sensitive attributes need protection Allow individual user to perform protection with

low cost Some data mining algorithms work on

distribution instead of individual records

WebApps

user 1 user 1 user 1

Privateinfo

Definition of dataset Column by row table Each row is a record, or a vector Each column represents an attribute We also call it multidimensional data

10 1.0 100

12 2.0 20

A 3-dimensional record

2 records in the 3-attribute dataset

Additive perturbation Definition

Z = X+Y X is the original value, Y is random noise and Z is the

perturbed value Data Z and the parameters of Y are published

e.g., Y is Gaussian N(0,1)

History

Used in statistical databases to protect sensitive attributes (late 80s to 90s) [check paper14]

Benefit Allow distribution reconstruction Allow individual user to do perturbation

Publish the noise distribution

Applications in data mining Distribution reconstruction algorithms

Rakesh’s algorithm Expectation-Maximization (EM) algorithm

Column-distribution based algorithms Decision tree Naïve Bayes classifier

Major issues Privacy metrics Distribution reconstruction algorithms Metrics for loss of information

A tradeoff between loss of information and privacy

Privacy metrics for additive perturbation Variance/confidence based definition Mutual information based definition

Variance/confidence based definition

Method Based on attacker’s view: value estimation

Knowing perturbed data, and noise distribution No other prior knowledge

Estimation method

Perturbed value

Confidence interval: the range having c% prob that the real value is in

Y: zero mean, std is the important factor, i.e., var(Z-X) = 2

Given Z, X is distant from Z in the Z+/- range with c% confWe often ignore the confidence c% and use to represent thedifficulty of value estimation.

Problem with Var/conf metric No knowledge about the original data

is incorporated Knowledge about the original data

distribution which will be discovered with distribution

reconstruction, in additive perturbation can be known in prior in some applications

Other prior knowledge may introduce more types of attacks Privacy evaluation need to incorporate these

attacks

Mutual information based method incorporating the original data

distribution Concept: Uncertainty entropy

Difficulty of estimation… the amount of privacy…

Intuition: knowing the perturbed data Z and the noise Y distribution, how much uncertainty of X is reduced. Z,Y do not help in estimate X all uncertainty of

X is preserved: privacy = 1 Otherwise: 0<= privacy <1

Definition of mutual information Entropy: h(A) evaluate uncertainty of A

Not easy to estimate high entropy Distributions with the same variance uniform has

the largest entropy Conditional entropy: h(A|B)

If we know the random variable B, how much is the uncertainty of A

If B is not independent of A, the uncertainty of A can be reduced, (B helps explain A) i.e., h(A|B) <h(A)

Mutual information I(A;B) = h(A)-h(A|B) Evaluate the information brought by B in estimating

A Note: I(A;B) == I(B;A)

Inherent privacy of a random variable Using uniform variable as the reference 2^h(A) Make the definition consistent with Rakesh’s

approach

MI based privacy metricP(A|B) = 1-2^-I(A;B), the lost privacy I(A;B) =0 B does not help estimate A

Privacy is fully preserved, the lost privacy P(A|B) =0 I(A;B) >0 0<P(A|B)<1

Calculation for additive perturbation: I(X;Z) = h(Z) – h(Z|X) = h(Z) – h(Y)

Distribution reconstruction Problem: Z= X+Y

Know noise Y’s distribution Fy Know the perturbed values z1, z2,…zn Estimate the distribution Fx

Basic methods Rakesh’s method EM esitmation

Rakesh’s algorithm (paper 10) Find distribution P(X|X+Y) three key points to understand it

Bayes rule: P(X|X+Y) = P(X+Y|X) P(X)/P(X+Y)

Conditional prob fx+y(X+Y=w|X=x) = fy(w-x)

Prob at the point a uses the average of all sample estimates

Using fx(a)?

The iterative algorithm

Stop criterion: the difference between two consecutive fx estimates is small

Make it more efficient… Bintize the range of x

Discretize the previous formula

m(x) mid-point of the bin that x is inLt = length of interval t

Weakness of Rakesh’s algorithm No convergence proof Don’t know if the iteration gives the

globally optimal result

EM algorithm (paper 11) Using discretized bins to approximate the

distribution

Maximum Likelihood Estimation (MLE) method X1,x2,…, xn are Independent and identically

distributed Joint distribution

f(x1,x2,…,xn|) = f(x1|)*f(x2|)*…f(xn|) MLE principle:

Find that maximizes f(x1,x2,…,xn|) Often maximize log f(x1,x2,…,xn|) = sum log f(xi|)

Density (the height) of Bin i is notated as i

Basic idea of the EM alogrithm Q(,^) is the MLE function

is the bin densities (1, 2,… k), and ^ is the previous estimate of .

EM algorithm1. Initial ^ : uniform distribution2. In each iteration: find the current that maximize

Q(,^) based on previous estimate ^, and z

zj – upper(i)<=Y<=zj – lower(i)

EM algorithm has properties Unique global optimal solution ^ converges to the MLE solution

Evaluating loss of information The information that additive

perturbation wants to preserve Column distribution

First metric Difference between the estimate and the

original distribution

Evaluating loss of information Indirect metric

Modeling quality The accuracy of classifier, if used for classification

modeling

Evaluation method Accuracy of the classifier trained on the original

data Accuracy of the classifier trained on the

reconstructed distribution

DM with Additive Perturbation Example: decision tree A brief introduction to decision tree algorithm

There are many versions… One version working on continuous attributes

Split evaluation gini(S) = 1- sum(pj^2)

Pj is the relativ frequency of class j in S gini_split(S) = n1/n*gini(S1)+n2/n*gini(S2) The smaller the better

Procedure Get the distribution of each attribute Scan through each bin in the attribute and calculate

the gini_split index problem: how to determine pj Find the minimum one

An approximate method to determine pj The original domain is partitioned to m bins Reconstruction gives an distribution over the bins n1, n2,

…nm Sort the perturbed data by the target attribute assign the records sequentially to the bins according to the

distribution Look at the class labels associated with the records

Errors happen because we use perturbed values to determine the bin identification of each record

When to reconstruct distribution

Global – calculate once By class – calculate once per class Local – by class at each node

Empirical study shows By class and Local are more effective

Problems with paper 10,11 Privacy evaluation

Didn’t consider in-depth attacking methods Data reconstruction methods

Loss of information Negatively related to privacy Not directly related to modeling

Accuracy of distribution reconstruction vs. accuracy of classifier ?

Summary We discussed the basic methods with

additive perturbation Definition Privacy metrics Distribution reconstruction

The problem with privacy evaluation is not complete Attacks Covered by next class

Additive Data Perturbation: the Basic Problem and Techniques

Documents

Additive manufacturing (AM) in medicineare additive fabrication, additive processes, additive techniques, additive layer manufacturing, layer manufacturing, and freeform fabrication

Perturbation Techniques in ', 4A9 Conduction -Controlled ... · PDF filePerturbation Techniques in "', 4A9 Conduction -Controlled ... heat balance integral, ... Perturbation Techniques

FABRICATION ADDITIVE TECHNOLOGIES CAPACITÉS ET … · La fabrication additive : qu’est-ce que c’est? •La plupart des techniques conventionnelles ... Qu’est-ce que c’est?

Breaching Euclidean Distance-Preserving Data Perturbation Using … · 2018-10-27 · 2.1. General Data Perturbation and Transformation Methods Additive perturbation: Adding i.i.d

Lithography; Etching; Additive Techniques MP4004 2 ......Advanced Manufacturing and Nanotechnology A/P Yeo Swee Hock Content 1. Lithography; Etching; Additive S Techniques.H. r 2

Using wobble based laser scanning techniques in additive … · 2020. 7. 27. · Lasers in Manufacturing Conference 2019 Using wobble based laser scanning techniques in additive manufacturing

Advanced Feedstock for Additive Manufacturing from Molten ......Advanced Feedstock for Additive Manufacturing from Molten Regolith Electrolysis (MRE) 3 techniques. This and its higher

Additive Data Perturbation: data reconstruction attacks

The Utilisation of Additive Manufacturing Techniques in

Respiratory resistance measured by an airflow perturbation device · 2017-04-26 · 22 C G Lausted and A T Johnson Airflow perturbation techniques were proposed by several groups

Simulation Techniques Prioritization for the Additive …€¦ · Simulation Techniques Prioritization for the Additive Manufacturing Integration in Traditional Production Contexts

Perturbation Notation Perturbation Perturbation Methodsdejong/Perturbation Methods.pdf · Perturbation DND Notation Introduction to Perturbation Methods Using Perturbation to Approximate

Distance à centre additive - Operations Research...DISTANCE À CENTRE AUDITIVE 361 La perturbation par une distance à centre additive consiste à substituer à une dissimilarité

Perturbation Techniques in ', 4A9 Conduction …strained coordinates, the method of matched asymptotic expansions, and the recently developed method of extended perturbation series

Respiratory resistance measured by an airflow perturbation …artjohns/apd/apd-published... · 2007. 7. 31. · 22 C G Lausted and A T Johnson Airflow perturbation techniques were

Secure Systems 2.0Subtractive Security Techniques 10 •Additive methods add protections to thwart attacks • Verifying additive measures requires a nonexistence proof • For all

Additive Manufacturing Techniques for Fabrication of Bone

Webinar - Adding to Additive Manufacturing with …...1 Webinar - Adding to Additive Manufacturing with Particle Size and Shape Analysis Many Additive Manufacturing (3D printing) techniques

Analysis of 3D Biocompatible Additive Structures Using COMSOL … · 2014-11-05 · additive manufacturing techniques bone substitutes and 3D scaffolds for the growth of advanced

Creative Optimization with Additive Manufacturing Webinar ... · Many Additive Manufacturing (also called 3D printing) techniques such as selective laser sintering (SLS), selective