Calibrating Noise to Sensitivity in Private Data Analysis Kobbi Nissim BGU With Cynthia Dwork, Frank...

Calibrating Noise to Sensitivity in Private Data Analysis

Kobbi Nissim

With Cynthia Dwork, Frank McSherry, Adam Smith, Enav Weinreb

The Setting

x1 query

answerSan

I just want to learn a few

harmless global statistics

Users(government, researchers,

marketers, …)

Can I combine these to learn

some private info?

(n rows each of domain D)

What is privacy?

Clearly we cannot undo the harm done by others

Can we minimize the additional harm while providing utility? Goal: Whether or not I contribute my data

does not affect my privacy

Output Perturbation

f(x) + noiseSan

random coins¢ ¢ ¢

San Controls:• which functions f• kind of perturbation

When Can I Release f(x) accurately?

Intuition: global information is “insensitive” to individual data and is safe

f(x1,…,xn) is sensitive if changing a few entries can drastically change its value

Talk Outline

A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to

privacy definitions Examples of sensitivity based analysis New ideas

Basic models for privacy Local vs. global Noninteractive vs. Interactive

Related Work

Relevant work in Statistics, Data mining, Computer Security, Databases Largely: no precise definitions and

analysis of privacy Recently: A foundational approach

[DN03,EGS03,DN04,BDMN05, KMN05 CDMSW05,CDMT05,MS06,CM06,…]

This work extends [DN03,DN04,BDMN05]

Privacy as Indistinguishability

query 1answer 1query T

answer T

transcriptT(x)

query 1answer 1query T

answer T

transcriptT(x')

x2’x1

Distributions at “distance” <

Differ in 1 row

-Indistinguishability

A sanitizer is -indistinguishable if for all pairs x,x’ Dn which differ on at

most one entry for all adversaries A for all transcripts t

Pr[TA(x) = t]

Pr[TA(x’) = t] e

Semantically Flavored Definitions

Indistinguishability - easy to work with but does not directly say what the adversary can do an learn

“Ideal” semantic definition: Adversary does not change his beliefs about me

Problem: dependencies, e.g. in form of side information Say you know that I am 20 pounds heavier than average

Israeli… You will learn my weight from census results

Whether or not I participate

Ways to get around: Assume “independence” of X1,…,Xn [DN03,DN04,BDMN05] Compare “what A knows now” vs “what A would have learned

anyway” [DM]

Incremental Risk

Suppose adversary has prior “beliefs” about x Probability distribution, r.v. X= (X1,…,Xn)

Given transcript t, adversary updates “beliefs”

according to Bayes’ rule New distribution Xi’| T(X)=t

Incremental Risk

Two options: I participate in census (input = X)

I do not participate (input Yi = X1,…,Xi-1,*,Xi+1,…,Xn )

Privacy: whether I participate or not does not

significantly influence adversary’s posterior beliefs:

For all transcripts t, for all i: X’i |T(X)=t ¼ X’i |T(Yi)=t

SanXSanYi

“Proof:” indistinguishability guarantees that updates are

the same within 1±

Bugger!It’s the same whether you participate or

Recall – -Indistinguishability

For all pairs x,x’ Dn s.t. dist(x,x’) = 1 For all transcripts t

Pr[TA(x) = t]

Pr[TA(x’) = t] e

An Example – Sum Queries

Pls let me know fA(x)=iA xi

fA(x) + noiseSan

x [0,1]n

x 2 [0,1]n fA(x)=iA xi

Can be used as a basis for other tasks: clustering, learning, classification… [BDMN05]

Answer: xi + Y where Y Lap(1/)

Laplace Distribution: h(y) e-|y|

Note: |fA(x)-fA(x’)| 1

Sum Queries – Answering a Query

Property of Lap x,y: h(x)/h(y) e|x-y|

Pr[T(x)=t] e|fA(x)-t|

Pr[T(x’)=t] e|fA(x’)-t|

Pr[T(x)=t] / Pr[T(x’)=t] e | fA(x)- fA(x’)| e

Sum Queries – Proof of -Indistinguishability

max |fA(x)-fA(x’)| = 1

f(x) f(x’)

We chose noise magnitude to cover for max |f(x)-f(x’)|

Sensitivity Sf = max ||f(x)-f(x’)||1

Local Sensitivity LSf(x) = max ||f(x)-f(x’)||1

Sensitivity

xn-1x3

¢¢ ¢

x3 San

x’=¢ ¢ ¢

x2’x1

f(x) + noise

f(x’) + noise

dist(x,x’)=1

Calibrating Noise to Sensitivity

x1Pls let me know

f (x) + Lap(Sf /)San

h(y) e-/Sf ||y||1

Calibrating Noise to Sensitivity - Why it Works?

Sf = max |f(x)-f(x’)|1

Property of Lap: x,y: h(x)/h(y) e||x-y||1

Pr[T(x)=t] / Pr[T(x’)=t] e / Sf ||fA(x)- fA(x’)||1 e

dist(x,x’)=1 h(y) e-/Sf ||y||1

Main Result

Theorem: If a user U is limited to T adaptive queries of sensitivity Sf

then -indistinguishability if iid noise Lap(SfT/ added to query answers

Same idea works with other metrics and noise Which useful functions are insensitive?

All useful functions should be insensitive… Statistical conclusions should not depend on small

variations in data

Using insensitive functions

Strategies: Use theorem, output f(x) + Lap(Sf /)

Sf may be hard to analyze/compute

Sf high for functions considered ‘insensitive’

Express f in terms of insensitive functions Resulting noise depends on input (in form and

magnitude)

Example - Expressing f in terms of insensitive functions x {0,1}n f(x) = ( xi)2

Sf = n2 - (n-1)2 = 2n-1 af = ( xi)2

+ Lap(2n/) If f(x) << n noise dominates

However f(x) = (g(x))2 where g(x) = xi

Sg=1 Better to query for g

Get ag = xi + Lap(1/) Estimate f(x) as (ag)2

Taking constant results in stddev O( xi)

– (1/ )2

Useful Insensitive functions

Means, variances,… With appropriate assumptions on data

Histograms & contingency tables Singular value decomposition Distance to a property Functions with low query complexity

Histograms/Contingency Tables

x1,…,xn 2 D where D partitioned into d disjoint bins b1,…,bd

h(x) = (v1,…,vd) where vi=|{i : xi bi}| Sh = 2

Changing one value xi changes vector by · 2

Irrespective of d

Add Laplacian with std. dev. 2/ to each count

b1 b2 … b4

Can do that with

sum queries …

Distance to a Property

Say P = set of “good” databases

Distance to P =

min # points in x that must be

changed to make x in P Always has sensitivity 1

Add Laplacian with stdev 1/

Examples: Distance to being clusterable

Weight of minimum cut in graph

distance to P

Approximations with Low Query Complexity

Lemma: Assume algorithm A that randomly samples n points and

Pr[ A(x) f(x) ± ] > (1+)/2 Then Sf · 2

Proof: Consider x,x’ that differ on point i Let Ai be A conditioned on not choosing point i Pr[Ai(x) f(x)± | pt i not sampled] > 1/2 Pr[Ai(x’) f(x’)± | pt i not sampled] > 1/2

point p that is within dist from both f(x), f(x’) Sf · 2

Support of Ai(x)=Ai(x)p

Local Sensitivity Median – typically insensitive, large (global)

sensitivity LSf(x) = max ||f(x)-f(x’)||1

Example: f(x) = min(xi, 10) where xi{0,1} LSf(x) = 1 if xi 10 and 0 otherwise

dist(x,x’)=1

10 n xi

Local Sensitivity – First Attempt Calibrate noise to LSf(x)

Answer query f by f(x) + Lap(LSf(x)/) If x1…x10=1 and x11…xn=0

Answer = 10 + Lap(1/) If x1…x11=1 and x12…xn=0

Answer = 10 Noise magnitude

may be disclosive!

10 n xi

How to Calibrate Noise to Local Sensitivity?

Noise magnitude at a point x depends on LS(y) for all y Dn

N*f = max (LSf(y) e- dist(x,y)) Median

10 n xi

Talk Outline

A framework for output perturbation based on “sensitivity” Formalize “sensitivity” and relate it to

privacy definitions Examples of sensitivity based analysis New ideas

Basic models for privacy Local vs. global Noninteractive vs. Interactive

Models for Data Privacy

Collection and

sanitization

Users(government, researchers,

marketers, …)

Models for Data Privacy – Local vs. Global

Local:

Global:

AliceCollection

and sanitization

Collection and

sanitization

Including “SFE”

Models for Data Privacy –Interactive vs. Noninteractive

AliceCollection

and sanitization

Interactive:

Noninteractive:

AliceCollection

and sanitization

Models for Data Privacy - Summary

Local (vs. Global) Non central trusted party

Individuals interact directly with (untrusted) user

Individuals control their own privacy

Noninteractive (vs. Interactive) Easier distribution: web site, book, CD, …

More secure: can erase the data once it is processed

Almost all work in statistics, data mining is noninteractive!

Four Basic Models

Local, noninteractive

Global, interactive

Global, noninteractive

Local, interactive ??

incomparable

Interactive vs. Noninteractive

Global, interactive

Local, interactive

Separating Interactive from Noninteractive

Random samples: can compute estimates for many stats (essentially) no need to decide upon queries ahead of time But not private (unless small domain, small sample [CM06])

Interaction: get the power of random samples With privacy! E.g. Sum queries f(x) = i fi(xi) Even chosen adaptively!

Noninteractive schemes seem weaker Intuition: privacy cannot answer all questions ahead of time

(e.g. [DN03]) Intuition: sanitization must be tailored to specific functions

Separating Interactive from Noninteractive

Theorem: If D={0,1}d, then for any private,

noninteractive scheme, many sum queries

cannot be learned,

unless d = o(log n)

Weaker than Interactive

Cannot emulate random sample if data is

complex

Local vs. Global

Global, interactive

Local, interactive

Separating Local from Global

D = {0,1}d for d = (log n) View x as an nd matrix Local: rank(x) has sensitivity 1, can

release with low noise Global: cannot distinguish whether

rank(x) = k or much larger than k For suitable choice of d,n,k

To sum up

Defined privacy in terms of indistinguishability Considered semantic versions of definitions “Crypto” with non-negligible error

How to Calibrate noise to sensitivity and # of queries Seems that useful stats should be insensitive Some commonly used functions have low sensitivity For others – local sensitivity?

Begun to explore the relationships between basic models

Questions Which useful functions are insensitive?

What would you like to compute? Can we get stronger results using:

Local sensitivity? Computational assumptions? [MS06] Entropy in data?

How to deal with small databases? Privacy in a broader context

Rationalizing privacy and privacy related decisions Which types of privacy? How to decide upon privacy

parameters? … Handling rich data

Audio, Video, Pictures, Text, …

Calibrating Noise to Sensitivity in Private Data Analysis Kobbi Nissim BGU With Cynthia Dwork, Frank...

Documents

Computing zeta functions of nondegenerate hypersurfaces ...jvoight/articles/sparse-dwork-092813.pdf · Using the cohomology theory of Dwork, as developed by Adolphson and Sperber,

Eric Horvitz Tadayoshi Kohno Frank McSherry Wendy Seltzer Daniel Weitzner

Toward Privacy in Public Databases - Carnegie …shuchi/papers/privacy.pdfshuchi@cs.cmu.edu 2 Microsoft Research SVC, {dwork, mcsherry}@microsoft.com 3 Weizmann Institute of Science,

WkP dWork, Power and Energy

Lisa McSherry - The Alchemy of Abundance Practical Money Magic Cd7 Id 1685363789 Size747

[Page 1] On Dwork cohomology for singular hypersurfaces › pierre.berthelot › publis › Dwork_coh.pdf · Dwork cohomology for singular hypersurfaces 3 isomorphisms outside a ﬁnite

Professor Bernadette McSherry Director, CALMH Australian Research Council Federation Fellow

Towards Privacy in Public Databases Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee Work Done at Microsoft Research

Drip Irrigation Mary McSherry April 17, 2014. Drip Irrigation

1 Private Analysis of Graphs Shiva Kasiviswanathan Joint work with Shiva Kasiviswanathan (GE Research), Kobbi Nissim Kobbi Nissim (Ben-Gurion, Harvard,

TRABALHO DWORK

McSherry-Chamberlain Wedding from InsideWeddings

Helping Kinsey Compute Cynthia Dwork Microsoft Research Cynthia Dwork Microsoft Research

McSherry Operation Condor Clandestine Interamerican System

Kunal Talwar MSR SVC [Dwork, McSherry, Talwar, STOC 2007] TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A AA A

C. Dwork, M. Herlihy, S. Plotkin, 0. Waartsi.stanford.edu/pub/cstr/reports/cs/tr/92/1423/CS-TR-92-1423.pdf · Cynthia Dwork Maurice Her&y ... Section 2 gives our model of compu-tation

Privacy-Preserving Datamining on Vertically Partitioned Databases Kobbi Nissim Microsoft, SVC Joint work with Cynthia Dwork

Adding Privacy to Netflix Recommendations Frank McSherry, Ilya Mironov (MSR SVC)

Renee McSherry - Service Management

Toward Privacy in Public Databasesads22/pubs/PS...final-proceedings.pdf · 2 S. Chawla, C. Dwork, F. McSherry, A. Smith and H. Wee sanitization should permit the data analyst to identify