From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases

Preview:

DESCRIPTION

From Idiosyncratic to Stereotypical: Toward Privacy in Public Databases. Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith, Larry Stockmeyer, Hoeteck Wee. Database Privacy. Census data – a prototypical example Individuals provide information - PowerPoint PPT Presentation

Citation preview

Shuchi Chawla, Cynthia Dwork, Frank McSherry, Adam Smith,

Larry Stockmeyer, Hoeteck Wee

From Idiosyncratic to Stereotypical:

Toward Privacy in Public Databases

Shuchi Chawla2

Database Privacy

Census data – a prototypical example Individuals provide information Census bureau publishes sanitized records

Privacy is legally mandated; what utility can we achieve?

Our Goal: What do we mean by preservation of privacy? Characterize the trade-off between privacy and utility

– disguise individual identifying information– preserve macroscopic properties

Develop a “good” sanitizing procedure with theoretical guarantees

Shuchi Chawla3

An outline of this talk

A mathematical formalism What do we mean by privacy? Prior work An abstract model of datasets Isolation; Good sanitizations

A candidate sanitization A brief overview of results General argument for privacy of n-point datasets

Open issues and concluding remarks

Shuchi Chawla4

Privacy… a philosophical view-point

[Ruth Gavison] … includes protection from being brought to the attention of others …

Matches intuition; inherently desirable Attention invites further loss of privacy Privacy is assured to the extent that one blends in

with the crowd

Appealing definition; can be converted into a precise mathematical statement!

Shuchi Chawla5

Database Privacy

Statistical approaches Alter the frequency (PRAN/DS/PERT) of particular

features, while preserving means. Additionally, erase values that reveal too much

Query-based approaches involve a permanent trusted third party Query monitoring: dissallow queries that breach

privacy Perturbation: Add noise to the query output

[Dinur Nissim’03, Dwork Nissim’04]

Statistical perturbation + adversarial analysis [Evfimievsky et al ’03] combine statistical techniques

with analysis similar to query-based approaches

Shuchi Chawla6

Everybody’s First Suggestion

Learn the distribution, then output: A description of the distribution, or, Samples from the learned distribution

Want to reflect facts on the ground

Statistically insignificant facts can be important for allocating resources

Shuchi Chawla7

A geometric view

Abstraction : Points in a high dimensional metric space – say R d;

drawn i.i.d. from some distribution Points are unlabeled; you are your collection of

attributes Distance is everything

Real Database (RDB) – privaten unlabeled points in d-dimensional space.

Sanitized Database (SDB) – publicn’ new points possibly in a different space.

Shuchi Chawla8

The adversary or Isolator

Using SDB and auxiliary information (AUX), outputs a point q

q “isolates” a real point x, if it is much closer to x than to x’s neighbors,

T-radius of x – distance to its T-nearest neighbor x is “safe” if x > (T-radius of x)/(c-1)

B(q, cx) contains x’s entire T-neighborhood

(c-1)

c – privacy parameter; eg. 4

qx

c

large T and small c is good

i.e., if B(q,c) contains less than T RDB points

Shuchi Chawla9

A good sanitization

Sanitizing algorithm compromises privacy if the adversary is able to considerably increase his probability of isolating a point by looking at its output

A rigorous (and too ideal) definitionD I I ’ w.o.p RDB 2R Dn aux z x 2 RDB :

| Pr[I(SDB,z) isolates x] – Pr[I ’(z) isolates x] | · /n

Definition of can be forgiving, say, 2-(d) or (1 in a 1000)

Quantification over x : If aux reveals info about some x, the privacy of some other y should still be preserved

Provides a framework for describing the power of a sanitization method, and hence for comparisons

Shuchi Chawla10

The Sanitizer

The privacy of x is linked to its T-radius Randomly perturb it in proportion to its T-radius

x’ = San(x) R S(x,T-rad(x))

Intuition: We are blending x in with its crowd

If the number of dimensions (d) is large, there are “many” pre-images for x’. The adversary cannot conclusively pick any one.

We are adding random noise with mean zero to x, so several macroscopic properties should be preserved.

Shuchi Chawla11

Results on privacy.. An overview

Distribution Num. of points

Revealed to adversary

Auxiliary information

Uniform on surface of sphere

2 Both sanitized points Distribution, 1-radius

Uniform over a bounding box or surface of sphere

n One sanitized point, all other real points

Distribution, all real points

Gaussian 2o(d) n sanitized points Distribution

Gaussian 2(d) Work under progress

Shuchi Chawla12

Results on utility… An overview

Distributional/Worst-case

Objective Assumptions Result

Worst-case Find K clusters minimizing largest diameter

- Optimal diameter as well as approximations increase by at most a factor of 3

Distributional Find k maximum likelihood clusters

Mixture of k Gaussians

Correct clustering with high probability as long as means are pairwise sufficiently far

Shuchi Chawla13

A special case - one sanitized point

RDB = {x1,…,xn}

The adversary is given n-1 real points x2,…,xn and one sanitized point x’1 ; T = 1; c=4; “flat” prior

Recall: x’1 2R S(x1,|x1-y|)

where y is the nearest neighbor of x1

Main idea:Consider the posterior distribution on x1

Show that the adversary cannot isolate a large probability mass under this distribution

Shuchi Chawla14

Let Z = { pR d | p is a legal pre-image for x’1 }

Q = { p | if x1=p then x1 is isolated by q }

We show that Pr[ Q∩Z | x’1 ] ≤ 2-(d) Pr[ Z | x’1 ]

Pr[x1 in Q∩Z | x’1 ] = prob mass contribution from Q∩Z / contribution from Z = 21-d /(1/4)

A special case - one sanitized point

Qq

x’1

x2

x3

x4

x5

Z Q∩Z

x6

|p-q| · 1/3 |p-x’1|

Shuchi Chawla15

Contribution from Z

Pr[x1=p | x’1] Pr[x’1 | x1=p] 1/rd (r = |x’1-p|) Increase in r x’1 gets randomized over a larger area

– proportional to rd. Hence the inverse dependence.

Pr[x’1 | x12 S] sS 1/rd solid angle subtended at x’1

Z subtends a solid angle equal to at least half a sphere at x’1

x’1

x2

x3

x4

x5

Z

x6

S

r

p

Shuchi Chawla16

Contribution from Q Å Z

The ellipsoid is roughly as far from x’1 as its longest radius

Contribution from ellipsoid is 2-d x total solid angle

Therefore, Pr[x1 2 QÅZ] / Pr[x1 2 Z] 2-d

Qq

x’1

x2

x3

x4

x5

Z Q∩Z

x6

r r

Shuchi Chawla17

The general case… n sanitized points

Initial intuition is wrong: Privacy of x1 given x1’ and all the other points in the

clear does not imply privacy of x1 given x1’ and sanitizations of others!

Sanitization is non-oblivious – Other sanitized points reveal information about x, if x is their nearest neighbor

Where we are now Consider some example of safe sanitization (not necessarily

using perturbations) Density regions? Histograms?

Relate perturbations to the safe sanitization

Uniform distribution; histogram over fixed-size cells exponentially low probability of isolation

Shuchi Chawla18

Future directions

Extend the privacy argument to other “nice” distributions

For what distributions is there no meaningful privacy—utility trade-off?

Characterize acceptable auxiliary information Think of auxiliary information as an a priori

distribution

The low-dimensional case – Is it inherently impossible?

Discrete-valued attributes Our proofs require a “spread” in all attributes

Extend the utility argument to other interesting macroscopic properties – e.g. correlations

Shuchi Chawla19

Conclusions

Our work so far: A first step towards understanding the privacy-utility

trade-off A general and rigorous definition of privacy A work in progress!

How does this compare to other frameworks e.g. Query-based approaches?

Query-based approaches: directly identify good and bad functions

Our approach: summarize “good” functions by a “sanitized

database”

Shuchi Chawla20

Questions?

Recommended