Upload
vila
View
38
Download
0
Embed Size (px)
DESCRIPTION
Calibrating Noise to Sensitivity in Private Data Analysis. Kobbi Nissim BGU. With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb. x 1. query. x 2. x 3. answer. . San. x n-1. x n. Users (government, researchers, marketers, … ). The Setting. - PowerPoint PPT Presentation
Citation preview
Calibrating Noise to Sensitivity in Private Data Analysis
Kobbi Nissim
BGU
With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb
The Setting
xn
xn-1
x3
x2
x1 query
answerSan
I just want to learn a few
harmless global statistics
Users(government, researchers,
marketers, …)
Can I combine these to learn
some private info?
x Dn
(n rows each of domain D)
X =
What is privacy?
Clearly we cannot undo the harm done by others
Can we minimize the additional harm while providing utility? Goal: Whether or not I contribute my data
does not affect my privacy
Output Perturbation
xn
xn-1
x3
x2
x1 f
f(x) + noiseSan
random coins¢ ¢ ¢
San Controls:• which functions f• kind of perturbation
When Can I Release f(x) accurately?
Intuition: global information is “insensitive” to individual data and is safe
f(x1,…,xn) is sensitive if changing a few entries can drastically change its value
Talk Outline
A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to
privacy definitions Examples of sensitivity based analysis New ideas
Basic models for privacy Local vs. global Noninteractive vs. Interactive
Related Work
Relevant work in Statistics, Data mining, Computer Security, Databases Largely: no precise definitions and
analysis of privacy Recently: A foundational approach
[DN03,EGS03,DN04,BDMN05, KMN05 CDMSW05,CDMT05,MS06,CM06,…]
This work extends [DN03,DN04,BDMN05]
Privacy as Indistinguishability
xn
xn-1
x3
x2
x1
x=
query 1answer 1query T
answer T
transcriptT(x)
San
random coins¢ ¢ ¢
query 1answer 1query T
answer T
transcriptT(x')
San
random coins¢ ¢ ¢
x’=
xn
xn-1
x3
x2’x1
Distributions at “distance” <
Differ in 1 row
-Indistinguishability
A sanitizer is -indistinguishable if for all pairs x,x’ Dn which differ on at
most one entry for all adversaries A for all transcripts t
Pr[TA(x) = t]
Pr[TA(x’) = t] e
Semantically Flavored Definitions
Indistinguishability - easy to work with but does not directly say what the adversary can do an learn
“Ideal” semantic definition: Adversary does not change his beliefs about me
Problem: dependencies, e.g. in form of side information Say you know that I am 20 pounds heavier than average
Israeli… You will learn my weight from census results
Whether or not I participate
Ways to get around: Assume “independence” of X1,…,Xn [DN03,DN04,BDMN05] Compare “what A knows now” vs “what A would have learned
anyway” [DM]
Incremental Risk
Suppose adversary has prior “beliefs” about x Probability distribution, r.v. X= (X1,…,Xn)
Given transcript t, adversary updates “beliefs”
according to Bayes’ rule New distribution Xi’| T(X)=t
Incremental Risk
Two options: I participate in census (input = X)
I do not participate (input Yi = X1,…,Xi-1,*,Xi+1,…,Xn )
Privacy: whether I participate or not does not
significantly influence adversary’s posterior beliefs:
For all transcripts t, for all i: X’i |T(X)=t ¼ X’i |T(Yi)=t
SanXSanYi
“Proof:” indistinguishability guarantees that updates are
the same within 1±
Bugger!It’s the same whether you participate or
not
Recall – -Indistinguishability
For all pairs x,x’ Dn s.t. dist(x,x’) = 1 For all transcripts t
Pr[TA(x) = t]
Pr[TA(x’) = t] e
An Example – Sum Queries
xn
xn-1
x3
x2
x1
Pls let me know fA(x)=iA xi
fA(x) + noiseSan
random coins¢ ¢ ¢
x [0,1]n
x 2 [0,1]n fA(x)=iA xi
Can be used as a basis for other tasks: clustering, learning, classification… [BDMN05]
Answer: xi + Y where Y Lap(1/)
Laplace Distribution: h(y) e-|y|
Note: |fA(x)-fA(x’)| 1
Sum Queries – Answering a Query
Property of Lap x,y: h(x)/h(y) e|x-y|
Pr[T(x)=t] e|fA(x)-t|
Pr[T(x’)=t] e|fA(x’)-t|
Pr[T(x)=t] / Pr[T(x’)=t] e | fA(x)- fA(x’)| e
Sum Queries – Proof of -Indistinguishability
max |fA(x)-fA(x’)| = 1
f(x) f(x’)
We chose noise magnitude to cover for max |f(x)-f(x’)|
Sensitivity Sf = max ||f(x)-f(x’)||1
Local Sensitivity LSf(x) = max ||f(x)-f(x’)||1
Sensitivity
f
xn
xn-1
xn
xn-1x3
x2
x1
Sanx=
¢¢ ¢
x3 San
x’=¢ ¢ ¢
x2’x1
f(x) + noise
f
f(x’) + noise
dist(x,x’)=1
dist(x,x’)=1
Calibrating Noise to Sensitivity
xn
xn-1
x3
x2
x1Pls let me know
f (x)
f (x) + Lap(Sf /)San
random coins¢ ¢ ¢
x Dn
h(y) e-/Sf ||y||1
Calibrating Noise to Sensitivity - Why it Works?
Sf = max |f(x)-f(x’)|1
Property of Lap: x,y: h(x)/h(y) e||x-y||1
Pr[T(x)=t] / Pr[T(x’)=t] e / Sf ||fA(x)- fA(x’)||1 e
dist(x,x’)=1 h(y) e-/Sf ||y||1
Main Result
Theorem: If a user U is limited to T adaptive queries of sensitivity Sf
then -indistinguishability if iid noise Lap(SfT/ added to query answers
Same idea works with other metrics and noise Which useful functions are insensitive?
All useful functions should be insensitive… Statistical conclusions should not depend on small
variations in data
Using insensitive functions
Strategies: Use theorem, output f(x) + Lap(Sf /)
Sf may be hard to analyze/compute
Sf high for functions considered ‘insensitive’
Express f in terms of insensitive functions Resulting noise depends on input (in form and
magnitude)
Example - Expressing f in terms of insensitive functions x {0,1}n f(x) = ( xi)2
Sf = n2 - (n-1)2 = 2n-1 af = ( xi)2
+ Lap(2n/) If f(x) << n noise dominates
However f(x) = (g(x))2 where g(x) = xi
Sg=1 Better to query for g
Get ag = xi + Lap(1/) Estimate f(x) as (ag)2
Taking constant results in stddev O( xi)
– (1/ )2
Useful Insensitive functions
Means, variances,… With appropriate assumptions on data
Histograms & contingency tables Singular value decomposition Distance to a property Functions with low query complexity
Histograms/Contingency Tables
x1,…,xn 2 D where D partitioned into d disjoint bins b1,…,bd
h(x) = (v1,…,vd) where vi=|{i : xi bi}| Sh = 2
Changing one value xi changes vector by · 2
Irrespective of d
Add Laplacian with std. dev. 2/ to each count
b1 b2 … b4
Can do that with
sum queries …
Distance to a Property
Say P = set of “good” databases
Distance to P =
min # points in x that must be
changed to make x in P Always has sensitivity 1
Add Laplacian with stdev 1/
Examples: Distance to being clusterable
Weight of minimum cut in graph
Px
distance to P
Approximations with Low Query Complexity
Lemma: Assume algorithm A that randomly samples n points and
Pr[ A(x) f(x) ± ] > (1+)/2 Then Sf · 2
Proof: Consider x,x’ that differ on point i Let Ai be A conditioned on not choosing point i Pr[Ai(x) f(x)± | pt i not sampled] > 1/2 Pr[Ai(x’) f(x’)± | pt i not sampled] > 1/2
point p that is within dist from both f(x), f(x’) Sf · 2
Support of Ai(x)=Ai(x)p
Local Sensitivity Median – typically insensitive, large (global)
sensitivity LSf(x) = max ||f(x)-f(x’)||1
Example: f(x) = min(xi, 10) where xi{0,1} LSf(x) = 1 if xi 10 and 0 otherwise
dist(x,x’)=1
10 n xi
Local Sensitivity – First Attempt Calibrate noise to LSf(x)
Answer query f by f(x) + Lap(LSf(x)/) If x1…x10=1 and x11…xn=0
Answer = 10 + Lap(1/) If x1…x11=1 and x12…xn=0
Answer = 10 Noise magnitude
may be disclosive!
10 n xi
How to Calibrate Noise to Local Sensitivity?
Noise magnitude at a point x depends on LS(y) for all y Dn
N*f = max (LSf(y) e- dist(x,y)) Median
10 n xi
Talk Outline
A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to
privacy definitions Examples of sensitivity based analysis New ideas
Basic models for privacy Local vs. global Noninteractive vs. Interactive
Models for Data Privacy
Collection and
sanitization
You
Bob
Alice
Users(government, researchers,
marketers, …)
San
Models for Data Privacy – Local vs. Global
Local:
Global:
You
Bob
AliceCollection
and sanitization
Collection and
sanitization
San
San
San
You
Bob
Alice
Including “SFE”
Models for Data Privacy –Interactive vs. Noninteractive
You
Bob
AliceCollection
and sanitization
Interactive:
Noninteractive:
You
Bob
AliceCollection
and sanitization
Models for Data Privacy - Summary
Local (vs. Global) Non central trusted party
Individuals interact directly with (untrusted) user
Individuals control their own privacy
Noninteractive (vs. Interactive) Easier distribution: web site, book, CD, …
More secure: can erase the data once it is processed
Almost all work in statistics, data mining is noninteractive!
Four Basic Models
Local, noninteractive
Global, interactive
Global, noninteractive
Local, interactive ??
incomparable
Interactive vs. Noninteractive
Local, noninteractive
Global, interactive
Global, noninteractive
Local, interactive
Separating Interactive from Noninteractive
Random samples: can compute estimates for many stats (essentially) no need to decide upon queries ahead of time But not private (unless small domain, small sample [CM06])
Interaction: get the power of random samples With privacy! E.g. Sum queries f(x) = i fi(xi) Even chosen adaptively!
Noninteractive schemes seem weaker Intuition: privacy cannot answer all questions ahead of time
(e.g. [DN03]) Intuition: sanitization must be tailored to specific functions
Separating Interactive from Noninteractive
Theorem: If D={0,1}d, then for any private,
noninteractive scheme, many sum queries
cannot be learned,
unless d = o(log n)
Weaker than Interactive
Cannot emulate random sample if data is
complex
Local vs. Global
Local, noninteractive
Global, interactive
Global, noninteractive
Local, interactive
Separating Local from Global
D = {0,1}d for d = (log n) View x as an nd matrix Local: rank(x) has sensitivity 1, can
release with low noise Global: cannot distinguish whether
rank(x) = k or much larger than k For suitable choice of d,n,k
To sum up
Defined privacy in terms of indistinguishability Considered semantic versions of definitions “Crypto” with non-negligible error
How to Calibrate noise to sensitivity and # of queries Seems that useful stats should be insensitive Some commonly used functions have low sensitivity For others – local sensitivity?
Begun to explore the relationships between basic models
Questions Which useful functions are insensitive?
What would you like to compute? Can we get stronger results using:
Local sensitivity? Computational assumptions? [MS06] Entropy in data?
How to deal with small databases? Privacy in a broader context
Rationalizing privacy and privacy related decisions Which types of privacy? How to decide upon privacy
parameters? … Handling rich data
Audio, Video, Pictures, Text, …