Calibrating Noise to Sensitivity in Private Data Analysis Kobbi Nissim BGU With Cynthia Dwork, Frank...

Preview:

Citation preview

Calibrating Noise to Sensitivity in Private Data Analysis

Kobbi Nissim

BGU

With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb

The Setting

xn

xn-1

x3

x2

x1 query

answerSan

I just want to learn a few

harmless global statistics

Users(government, researchers,

marketers, …)

Can I combine these to learn

some private info?

x Dn

(n rows each of domain D)

X =

What is privacy?

Clearly we cannot undo the harm done by others

Can we minimize the additional harm while providing utility? Goal: Whether or not I contribute my data

does not affect my privacy

Output Perturbation

xn

xn-1

x3

x2

x1 f

f(x) + noiseSan

random coins¢ ¢ ¢

San Controls:• which functions f• kind of perturbation

When Can I Release f(x) accurately?

Intuition: global information is “insensitive” to individual data and is safe

f(x1,…,xn) is sensitive if changing a few entries can drastically change its value

Talk Outline

A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to

privacy definitions Examples of sensitivity based analysis New ideas

Basic models for privacy Local vs. global Noninteractive vs. Interactive

Related Work

Relevant work in Statistics, Data mining, Computer Security, Databases Largely: no precise definitions and

analysis of privacy Recently: A foundational approach

[DN03,EGS03,DN04,BDMN05, KMN05 CDMSW05,CDMT05,MS06,CM06,…]

This work extends [DN03,DN04,BDMN05]

Privacy as Indistinguishability

xn

xn-1

x3

x2

x1

x=

query 1answer 1query T

answer T

transcriptT(x)

San

random coins¢ ¢ ¢

query 1answer 1query T

answer T

transcriptT(x')

San

random coins¢ ¢ ¢

x’=

xn

xn-1

x3

x2’x1

Distributions at “distance” <

Differ in 1 row

-Indistinguishability

A sanitizer is -indistinguishable if for all pairs x,x’ Dn which differ on at

most one entry for all adversaries A for all transcripts t

Pr[TA(x) = t]

Pr[TA(x’) = t] e

Semantically Flavored Definitions

Indistinguishability - easy to work with but does not directly say what the adversary can do an learn

“Ideal” semantic definition: Adversary does not change his beliefs about me

Problem: dependencies, e.g. in form of side information Say you know that I am 20 pounds heavier than average

Israeli… You will learn my weight from census results

Whether or not I participate

Ways to get around: Assume “independence” of X1,…,Xn [DN03,DN04,BDMN05] Compare “what A knows now” vs “what A would have learned

anyway” [DM]

Incremental Risk

Suppose adversary has prior “beliefs” about x Probability distribution, r.v. X= (X1,…,Xn)

Given transcript t, adversary updates “beliefs”

according to Bayes’ rule New distribution Xi’| T(X)=t

Incremental Risk

Two options: I participate in census (input = X)

I do not participate (input Yi = X1,…,Xi-1,*,Xi+1,…,Xn )

Privacy: whether I participate or not does not

significantly influence adversary’s posterior beliefs:

For all transcripts t, for all i: X’i |T(X)=t ¼ X’i |T(Yi)=t

SanXSanYi

“Proof:” indistinguishability guarantees that updates are

the same within 1±

Bugger!It’s the same whether you participate or

not

Recall – -Indistinguishability

For all pairs x,x’ Dn s.t. dist(x,x’) = 1 For all transcripts t

Pr[TA(x) = t]

Pr[TA(x’) = t] e

An Example – Sum Queries

xn

xn-1

x3

x2

x1

Pls let me know fA(x)=iA xi

fA(x) + noiseSan

random coins¢ ¢ ¢

x [0,1]n

x 2 [0,1]n fA(x)=iA xi

Can be used as a basis for other tasks: clustering, learning, classification… [BDMN05]

Answer: xi + Y where Y Lap(1/)

Laplace Distribution: h(y) e-|y|

Note: |fA(x)-fA(x’)| 1

Sum Queries – Answering a Query

Property of Lap x,y: h(x)/h(y) e|x-y|

Pr[T(x)=t] e|fA(x)-t|

Pr[T(x’)=t] e|fA(x’)-t|

Pr[T(x)=t] / Pr[T(x’)=t] e | fA(x)- fA(x’)| e

Sum Queries – Proof of -Indistinguishability

max |fA(x)-fA(x’)| = 1

f(x) f(x’)

We chose noise magnitude to cover for max |f(x)-f(x’)|

Sensitivity Sf = max ||f(x)-f(x’)||1

Local Sensitivity LSf(x) = max ||f(x)-f(x’)||1

Sensitivity

f

xn

xn-1

xn

xn-1x3

x2

x1

Sanx=

¢¢ ¢

x3 San

x’=¢ ¢ ¢

x2’x1

f(x) + noise

f

f(x’) + noise

dist(x,x’)=1

dist(x,x’)=1

Calibrating Noise to Sensitivity

xn

xn-1

x3

x2

x1Pls let me know

f (x)

f (x) + Lap(Sf /)San

random coins¢ ¢ ¢

x Dn

h(y) e-/Sf ||y||1

Calibrating Noise to Sensitivity - Why it Works?

Sf = max |f(x)-f(x’)|1

Property of Lap: x,y: h(x)/h(y) e||x-y||1

Pr[T(x)=t] / Pr[T(x’)=t] e / Sf ||fA(x)- fA(x’)||1 e

dist(x,x’)=1 h(y) e-/Sf ||y||1

Main Result

Theorem: If a user U is limited to T adaptive queries of sensitivity Sf

then -indistinguishability if iid noise Lap(SfT/ added to query answers

Same idea works with other metrics and noise Which useful functions are insensitive?

All useful functions should be insensitive… Statistical conclusions should not depend on small

variations in data

Using insensitive functions

Strategies: Use theorem, output f(x) + Lap(Sf /)

Sf may be hard to analyze/compute

Sf high for functions considered ‘insensitive’

Express f in terms of insensitive functions Resulting noise depends on input (in form and

magnitude)

Example - Expressing f in terms of insensitive functions x {0,1}n f(x) = ( xi)2

Sf = n2 - (n-1)2 = 2n-1 af = ( xi)2

+ Lap(2n/) If f(x) << n noise dominates

However f(x) = (g(x))2 where g(x) = xi

Sg=1 Better to query for g

Get ag = xi + Lap(1/) Estimate f(x) as (ag)2

Taking constant results in stddev O( xi)

– (1/ )2

Useful Insensitive functions

Means, variances,… With appropriate assumptions on data

Histograms & contingency tables Singular value decomposition Distance to a property Functions with low query complexity

Histograms/Contingency Tables

x1,…,xn 2 D where D partitioned into d disjoint bins b1,…,bd

h(x) = (v1,…,vd) where vi=|{i : xi bi}| Sh = 2

Changing one value xi changes vector by · 2

Irrespective of d

Add Laplacian with std. dev. 2/ to each count

b1 b2 … b4

Can do that with

sum queries …

Distance to a Property

Say P = set of “good” databases

Distance to P =

min # points in x that must be

changed to make x in P Always has sensitivity 1

Add Laplacian with stdev 1/

Examples: Distance to being clusterable

Weight of minimum cut in graph

Px

distance to P

Approximations with Low Query Complexity

Lemma: Assume algorithm A that randomly samples n points and

Pr[ A(x) f(x) ± ] > (1+)/2 Then Sf · 2

Proof: Consider x,x’ that differ on point i Let Ai be A conditioned on not choosing point i Pr[Ai(x) f(x)± | pt i not sampled] > 1/2 Pr[Ai(x’) f(x’)± | pt i not sampled] > 1/2

point p that is within dist from both f(x), f(x’) Sf · 2

Support of Ai(x)=Ai(x)p

Local Sensitivity Median – typically insensitive, large (global)

sensitivity LSf(x) = max ||f(x)-f(x’)||1

Example: f(x) = min(xi, 10) where xi{0,1} LSf(x) = 1 if xi 10 and 0 otherwise

dist(x,x’)=1

10 n xi

Local Sensitivity – First Attempt Calibrate noise to LSf(x)

Answer query f by f(x) + Lap(LSf(x)/) If x1…x10=1 and x11…xn=0

Answer = 10 + Lap(1/) If x1…x11=1 and x12…xn=0

Answer = 10 Noise magnitude

may be disclosive!

10 n xi

How to Calibrate Noise to Local Sensitivity?

Noise magnitude at a point x depends on LS(y) for all y Dn

N*f = max (LSf(y) e- dist(x,y)) Median

10 n xi

Talk Outline

A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to

privacy definitions Examples of sensitivity based analysis New ideas

Basic models for privacy Local vs. global Noninteractive vs. Interactive

Models for Data Privacy

Collection and

sanitization

You

Bob

Alice

Users(government, researchers,

marketers, …)

San

Models for Data Privacy – Local vs. Global

Local:

Global:

You

Bob

AliceCollection

and sanitization

Collection and

sanitization

San

San

San

You

Bob

Alice

Including “SFE”

Models for Data Privacy –Interactive vs. Noninteractive

You

Bob

AliceCollection

and sanitization

Interactive:

Noninteractive:

You

Bob

AliceCollection

and sanitization

Models for Data Privacy - Summary

Local (vs. Global) Non central trusted party

Individuals interact directly with (untrusted) user

Individuals control their own privacy

Noninteractive (vs. Interactive) Easier distribution: web site, book, CD, …

More secure: can erase the data once it is processed

Almost all work in statistics, data mining is noninteractive!

Four Basic Models

Local, noninteractive

Global, interactive

Global, noninteractive

Local, interactive ??

incomparable

Interactive vs. Noninteractive

Local, noninteractive

Global, interactive

Global, noninteractive

Local, interactive

Separating Interactive from Noninteractive

Random samples: can compute estimates for many stats (essentially) no need to decide upon queries ahead of time But not private (unless small domain, small sample [CM06])

Interaction: get the power of random samples With privacy! E.g. Sum queries f(x) = i fi(xi) Even chosen adaptively!

Noninteractive schemes seem weaker Intuition: privacy cannot answer all questions ahead of time

(e.g. [DN03]) Intuition: sanitization must be tailored to specific functions

Separating Interactive from Noninteractive

Theorem: If D={0,1}d, then for any private,

noninteractive scheme, many sum queries

cannot be learned,

unless d = o(log n)

Weaker than Interactive

Cannot emulate random sample if data is

complex

Local vs. Global

Local, noninteractive

Global, interactive

Global, noninteractive

Local, interactive

Separating Local from Global

D = {0,1}d for d = (log n) View x as an nd matrix Local: rank(x) has sensitivity 1, can

release with low noise Global: cannot distinguish whether

rank(x) = k or much larger than k For suitable choice of d,n,k

To sum up

Defined privacy in terms of indistinguishability Considered semantic versions of definitions “Crypto” with non-negligible error

How to Calibrate noise to sensitivity and # of queries Seems that useful stats should be insensitive Some commonly used functions have low sensitivity For others – local sensitivity?

Begun to explore the relationships between basic models

Questions Which useful functions are insensitive?

What would you like to compute? Can we get stronger results using:

Local sensitivity? Computational assumptions? [MS06] Entropy in data?

How to deal with small databases? Privacy in a broader context

Rationalizing privacy and privacy related decisions Which types of privacy? How to decide upon privacy

parameters? … Handling rich data

Audio, Video, Pictures, Text, …

Recommended