73
Challenges in Privacy-Preserving Analysis of Structured Data Kamalika Chaudhuri University of California, San Diego Computer Science and Engineering

Challenges in Privacy-Preserving Analysis of Structured Data

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Challenges in Privacy-Preserving Analysis of Structured Data

Challenges in Privacy-Preserving Analysis of Structured Data

Kamalika Chaudhuri

University of California, San Diego

Computer Science and Engineering

Page 2: Challenges in Privacy-Preserving Analysis of Structured Data

Sensitive Structured Data

Medical Records

Search Logs

Social Networks

Page 3: Challenges in Privacy-Preserving Analysis of Structured Data

This Talk: Two Case Studies

1. Privacy-preserving HIV Epidemiology

2. Privacy in Time-series data

Page 4: Challenges in Privacy-Preserving Analysis of Structured Data

HIV Epidemiology

Goal: Understand how HIV spreads among people

Page 5: Challenges in Privacy-Preserving Analysis of Structured Data

HIV Transmission Data

distance (Seq-A, Seq-B) < t

HIV transmission

Virus Seq-A

A

Virus Seq-B

B

Page 6: Challenges in Privacy-Preserving Analysis of Structured Data

From Sequences to Transmission Graphs

Node = Patient

Edge = Plausible transmission

Viral Sequences

Page 7: Challenges in Privacy-Preserving Analysis of Structured Data

…Growing over Time

Node = Patient

Edge = Transmission

2015

Page 8: Challenges in Privacy-Preserving Analysis of Structured Data

…Growing over Time

Node = Patient

Edge = Transmission

2015 2016

Page 9: Challenges in Privacy-Preserving Analysis of Structured Data

…Growing over Time

Node = Patient

Edge = Transmission

2015 2016 2017

Page 10: Challenges in Privacy-Preserving Analysis of Structured Data

…Growing over Time

2015 2016 2017

Release properties of G with privacy across timeGoal:

Page 11: Challenges in Privacy-Preserving Analysis of Structured Data

Problem: Continual Graph Statistics Release

Given: (Growing) graph GAt time t, nodes and adjacent edges arrive(@Vt, @Et)

Goal: At time t, release f(Gt), where f = graph statistic, and Gt = ([st@Vs,[st@Es)

while preserving patient privacy and high accuracy

Page 12: Challenges in Privacy-Preserving Analysis of Structured Data

What kind of Privacy?

Patient A is in the graphHide:

Release: Large scale properties

Node = Patient

Edge = Transmission

Page 13: Challenges in Privacy-Preserving Analysis of Structured Data

What kind of Privacy?

Node = Patient

Edge = Transmission

A particular patient has HIVHide:

Release: Statistical properties (degree distribution, clusters, does therapy help, etc)

Privacy notion: Node Differential Privacy

Page 14: Challenges in Privacy-Preserving Analysis of Structured Data

Talk Outline

• The Problem: Private HIV Epidemiology

• Privacy Definition: Differential Privacy

Page 15: Challenges in Privacy-Preserving Analysis of Structured Data

Differential Privacy [DMNS06]

“similar”

RandomizedAlgorithm

Randomized Algorithm

Data +

Data +

Participation of a single person does not change output

Page 16: Challenges in Privacy-Preserving Analysis of Structured Data

Differential Privacy: Attacker’s View

Prior Knowledge +

AlgorithmOutput on Data &

=Conclusion

on

Prior Knowledge +

AlgorithmOutput on Data &

=Conclusion

on

a. Algorithm could draw personal conclusions about Alice

b. Alice has the agency to participate or not

Note:

Page 17: Challenges in Privacy-Preserving Analysis of Structured Data

Differential Privacy [DMNS06]

For all D, D’ that differ in one person’s value,t

D D’p[A(D) = t] p[A(D’) = t]

If A = -differentially private randomized algorithm, then:✏

sup

t

��� logp(A(D) = t)

p(A(D0) = t)

��� ✏

Page 18: Challenges in Privacy-Preserving Analysis of Structured Data

Differential Privacy

1. Provably strong notion of privacy

2. Good approximations for many functions

e.g, means, histograms, etc.

Page 19: Challenges in Privacy-Preserving Analysis of Structured Data

Node Differential Privacy

Node = Patient

Edge = Transmission

Page 20: Challenges in Privacy-Preserving Analysis of Structured Data

Node Differential Privacy

Node = Patient

Edge = Transmission

One person’s value = One node + adjacent edges

Page 21: Challenges in Privacy-Preserving Analysis of Structured Data

Talk Outline

• The Problem: Private HIV Epidemiology

• Privacy Definition: Node Differential Privacy

• Challenges

Page 22: Challenges in Privacy-Preserving Analysis of Structured Data

Problem: Continual Graph Statistics Release

Given: (Growing) graph GAt time t, nodes and adjacent edges arrive(@Vt, @Et)

Goal: At time t, release f(Gt), where f = graph statistic, and Gt = ([st@Vs,[st@Es)

with node differential privacy and high accuracy

Page 23: Challenges in Privacy-Preserving Analysis of Structured Data

Why is Continual Release of Graphs with Node Differential Privacy hard?

1. Node DP challenging in static graphs [KNRS13, BBDS13]

2. Continual release of graph data has extra challenges

Page 24: Challenges in Privacy-Preserving Analysis of Structured Data

Challenge 1: Node DP

Removing one node can change properties by a lot (even for static graphs)

#edges = 6 (size of V) #edges = 0

Hiding one node needs high noise low accuracy

Page 25: Challenges in Privacy-Preserving Analysis of Structured Data

Prior Work: Node DP in Static Graphs

- Project to low degree graph G’ and use node DP on G’- Projection algorithm needs to be “smooth” and

computationally efficient

Approach 1 [BCS15]:

Approach 2 [KNRS13, RS15]:

- Assume bounded max degree

Page 26: Challenges in Privacy-Preserving Analysis of Structured Data

Challenge 2: Continual Release of Graphs

- Methods for tabular data [DNPR10, CSS10] do not apply

- Sequential composition gives poor utility

- Graph projection methods are not “smooth” over time

Page 27: Challenges in Privacy-Preserving Analysis of Structured Data

Talk Outline

• The Problem: Private HIV Epidemiology

• Privacy Definition: Node Differential Privacy

• Challenges

• Approach

Page 28: Challenges in Privacy-Preserving Analysis of Structured Data

Algorithm: Main Ideas

Strategy 1: Assume bounded max degree of G (from domain)

Strategy 2: Privately release “difference sequence” of statistic(instead of the direct statistic)

Page 29: Challenges in Privacy-Preserving Analysis of Structured Data

Difference Sequence

GraphSequence:

G1 G2 G3

StatisticSequence: f(G1) f(G2) f(G3)

DifferenceSequence:

f(G1) f(G2) - f(G1) f(G3) - f(G2)

Page 30: Challenges in Privacy-Preserving Analysis of Structured Data

Key Observation

Key Observation: For many graph statistics, when G is degree bounded, the difference sequence has low sensitivity

Example Theorem: If max degree(G) = D, then sensitivity of the difference sequence for #high degree nodes is at most 2D + 1.

Page 31: Challenges in Privacy-Preserving Analysis of Structured Data

From Observation to Algorithm

Algorithm:

1. Add noise to each item of difference sequence to hide effect of single node and publish

2. Reconstruct private statistic sequence from private difference sequence

Page 32: Challenges in Privacy-Preserving Analysis of Structured Data

How does this work?

Page 33: Challenges in Privacy-Preserving Analysis of Structured Data

Experiments - Privacy vs. Utility

#high degree nodes

Our Algorithm, DP Composition 1, DP Composition 2

#edges

Baselines:

Page 34: Challenges in Privacy-Preserving Analysis of Structured Data

Experiments - #Releases vs. Utility

#high degree nodes#edges

Our Algorithm, DP Composition 1, DP Composition 2Baselines:

Page 35: Challenges in Privacy-Preserving Analysis of Structured Data

Talk Agenda

Privacy is application-dependent!

Two applications:

1. HIV Epidemiology

2. Privacy of time-series data - activity monitoring, power consumption, etc

Page 36: Challenges in Privacy-Preserving Analysis of Structured Data

Time Series Data

Physical ActivityMonitoring

Location traces

Page 37: Challenges in Privacy-Preserving Analysis of Structured Data

Example: Activity Monitoring

Hide: Activity at each time against adversary with prior knowledge

Data: Activity trace of a subject

Release: (Approximate) aggregate activity

Page 38: Challenges in Privacy-Preserving Analysis of Structured Data

Why is Differential Privacy not Right for Correlated data?

Page 39: Challenges in Privacy-Preserving Analysis of Structured Data

1-DP: Output histogram of activities + noise with stdev T

Correlation Network

Example: Activity Monitoring

D = (x1, .., xT), xt = activity at time t

Too much noise - no utility!

Data from a single subject

Page 40: Challenges in Privacy-Preserving Analysis of Structured Data

Correlation Network

Example: Activity Monitoring

D = (x1, .., xT), xt = activity at time t

1-entry-DP: Output activity histogram + noise with stdev 1

Not enough noise - activities across time are correlated!

Page 41: Challenges in Privacy-Preserving Analysis of Structured Data

Correlation Network

Example: Activity Monitoring

D = (x1, .., xT), xt = activity at time t

1-entry-group DP: Output activity histogram + noise with stdev T

Too much noise - no utility!

Page 42: Challenges in Privacy-Preserving Analysis of Structured Data

How to define privacy for Correlated Data ?

Page 43: Challenges in Privacy-Preserving Analysis of Structured Data

Pufferfish Privacy [KM12]

Secret Set S

S: Information to be protected

e.g: Alice’s age is 25, Bob has a disease

Page 44: Challenges in Privacy-Preserving Analysis of Structured Data

Pufferfish Privacy [KM12]

Secret Set SSecret Pairs Set Q

Q: Pairs of secrets we want to be indistinguishable

e.g: (Alice’s age is 25, Alice’s age is 40)

(Bob is in dataset, Bob is not in dataset)

Page 45: Challenges in Privacy-Preserving Analysis of Structured Data

Pufferfish Privacy [KM12]

Secret Set SSecret Pairs Set Q

Distribution Class ⇥

e.g: (connection graph G, disease transmits w.p [0.1, 0.5])

(Markov Chain with transition matrix in set P)

: A set of distributions that plausibly generate the data⇥

May be used to model correlation in data

Page 46: Challenges in Privacy-Preserving Analysis of Structured Data

Pufferfish Privacy [KM12]

Secret Set SSecret Pairs Set Q

Distribution Class ⇥

whenever P (si|✓), P (sj |✓) > 0

p(A(X)|sj , ✓)p(A(X)|si, ✓)

t

p✓,A(A(X) = t|si, ✓) e✏ · p✓,A(A(X) = t|sj , ✓)

An algorithm A is -Pufferfish private with parameters

(S,Q,⇥) if for all (si, sj) in Q, for all , all t,✓ 2 ⇥ X ⇠ ✓,

Page 47: Challenges in Privacy-Preserving Analysis of Structured Data

Pufferfish Interpretation of DP

Theorem: Pufferfish = Differential Privacy when:

S = { si,a := Person i has value a, for all i, all a in domain X }

Q = { (si,a si,b), for all i and (a, b) pairs in X x X }

= { Distributions where each person i is independent }⇥

Page 48: Challenges in Privacy-Preserving Analysis of Structured Data

Pufferfish Interpretation of DP

Theorem: Pufferfish = Differential Privacy when:

S = { si,a := Person i has value a, for all i, all a in domain X }

Q = { (si,a si,b), for all i and (a, b) pairs in X x X }

= { Distributions where each person i is independent }⇥

Theorem: No utility possible when:

= { All possible distributions }⇥

Page 49: Challenges in Privacy-Preserving Analysis of Structured Data

How to get Pufferfish privacy?

Special case mechanisms [KM12, HMD12]

Is there a more general Pufferfish mechanism for a large class of correlated data?

Our work: Yes, the Markov Quilt Mechanism

(Also concurrent work [GK16])

Page 50: Challenges in Privacy-Preserving Analysis of Structured Data

Correlation Measure: Bayesian Networks

Node: variable

Directed Acyclic Graph

Pr(X1, X2, . . . , Xn) =Y

i

Pr(Xi|parents(Xi))

Joint distribution of variables:

Page 51: Challenges in Privacy-Preserving Analysis of Structured Data

A Simple Example

X1 X2 X3 Xn

Xi in {0, 1}

Model:

State Transition Probabilities:

0 1

1 - p

1 - p

pp

Page 52: Challenges in Privacy-Preserving Analysis of Structured Data

A Simple Example

X1 X2 X3 Xn

Xi in {0, 1}

Model:

State Transition Probabilities:

0 1

1 - p

1 - p

pp

Pr(X2 = 0| X1 = 0) = p

….

Pr(X2 = 0| X1 = 1) = 1 - p

Page 53: Challenges in Privacy-Preserving Analysis of Structured Data

A Simple Example

X1 X2 X3 Xn

Xi in {0, 1}

Model:

State Transition Probabilities:

0 1

1 - p

1 - p

pp

Pr(X2 = 0| X1 = 0) = p

….

Influence of X1 diminishes with distance

Pr(Xi = 0| X1 = 0) =1

2+

1

2(2p� 1)i�1

Pr(X2 = 0| X1 = 1) = 1 - p

1

2� 1

2(2p� 1)i�1Pr(Xi = 0| X1 = 1) =

Page 54: Challenges in Privacy-Preserving Analysis of Structured Data

Algorithm: Main Idea

Goal: Protect X1

X1 X2 X3 Xn

Page 55: Challenges in Privacy-Preserving Analysis of Structured Data

Algorithm: Main Idea

Goal: Protect X1

X1 X2 X3 Xn

Local nodes Rest(high correlation) (almost independent)

Page 56: Challenges in Privacy-Preserving Analysis of Structured Data

Algorithm: Main Idea

Goal: Protect X1

X1 X2 X3 Xn

Add noise to hidelocal nodes

Small correctionfor rest+

Local nodes Rest(high correlation) (almost independent)

Page 57: Challenges in Privacy-Preserving Analysis of Structured Data

Measuring “Independence”

Max-influence of Xi on a set of nodes XR:

To protect Xi, correction term needed for XR is exp(e(XR|Xi))

e(X

R

|Xi

) = max

a,b

sup

✓2⇥max

xR

log

Pr(X

R

= x

R

|Xi

= a, ✓)

Pr(X

R

= x

R

|Xi

= b, ✓)

Low e(XR|Xi) means XR is almost independent of Xi

Page 58: Challenges in Privacy-Preserving Analysis of Structured Data

How to find large “almost independent” sets

Brute force search is expensive

Use structural properties of the Bayesian network

Page 59: Challenges in Privacy-Preserving Analysis of Structured Data

Markov Blanket

Markov Blanket(Xi) =Set of nodes XS s.t Xi is independent of X\(Xi U XS)given XS

(usually, parents, children,other parents of children)

Xi

XS

Markov Blanket (Xi)

Page 60: Challenges in Privacy-Preserving Analysis of Structured Data

Define: Markov Quilt

XQ is a Markov Quilt of Xi if:

2. Xi lies in XN

1. Deleting XQ breaks graph into XN and XR

3. XR is independent of Xi

given XQ

Xi XQ

XR

XN

(For Markov Blanket XN = Xi)

Page 61: Challenges in Privacy-Preserving Analysis of Structured Data

Why do we need Markov Quilts?

Given a Markov Quilt,

Xi XQ

XR

XN

XN = local nodes for Xi XQ U XR = rest

Page 62: Challenges in Privacy-Preserving Analysis of Structured Data

From Markov Quilts to Amount of Noise

Xi XQ

XR

XN Stdev of noise to protect Xi:

Score(XQ) =

Correction for XQ U XR

Noise due to XN

Let XQ = Markov Quilt for Xi

card(XN )

✏� e(XQ|Xi)

Search all Markov Quilts to find one that needs min noise

Page 63: Challenges in Privacy-Preserving Analysis of Structured Data

Privacy Properties

Privacy: MQM is -Pufferfish private✏

Page 64: Challenges in Privacy-Preserving Analysis of Structured Data

Graceful Composition

MQM for Markov Chains has:

- Additive sequential composition

- Parallel composition with a correction term

X1 X2 X3 Xn

Page 65: Challenges in Privacy-Preserving Analysis of Structured Data

Simulations - Task

X1 X2 X3 Xn

Xi in {0, 1}

Model:

State Transition Probabilities:

0 1

1 - p

q

1-qp

Model Class:⇥ = [`, 1� `]

(implies p and q can lieanywhere in )⇥

Sequence length = 100

Page 66: Challenges in Privacy-Preserving Analysis of Structured Data

Simulations - Results

Methods: - Two versions of Markov Quilt Mechanism (MQMExact, MQMApprox)- GK16

0.1 0.15 0.2 0.25 0.3 0.35 0.40

1

2

3

4

5

L1er

ror

GK16MQM ApproxMQM Exact

0.1 0.15 0.2 0.25 0.3 0.35 0.40

0.2

0.4

0.6

0.8

1

L1er

ror

GK16MQM ApproxMQM Exact

` `✏ = 0.2 ✏ = 1

Page 67: Challenges in Privacy-Preserving Analysis of Structured Data

Real Data - Activity Measurement

Dataset on physical activity by three groups of subjects: 40 cyclists, 16 older women and 36 overweight women

4 states (active, standing still, standing moving, sedentary)

Over 9,000 observations per subject

Methods:

MQMExact and MQMApprox

GK16 does not apply

GroupDP

⇥ = { Empirical data generating distribution }

Page 68: Challenges in Privacy-Preserving Analysis of Structured Data

Real Data - Activity Measurement

Active Stand Still Stand Moving Sedentary0

0.2

0.4

0.6

0.8

1

Rel

ativ

e Fr

eque

ncy

Group-DPMQM ApproxMQM Exact

Active Stand Still Stand Moving Sedentary0

0.2

0.4

0.6

0.8

1

Rel

ativ

e Fr

eque

ncy

Group-DPMQM ApproxMQM Exact

Active Stand Still Stand Moving Sedentary0

0.2

0.4

0.6

0.8

1

Rel

ativ

e Fr

eque

ncy

Group-DPMQM ApproxMQM Exact

Cyclists

Older Overweight

Aggregated results (over groups)

✏ = 1

Page 69: Challenges in Privacy-Preserving Analysis of Structured Data

Real Data - Power Consumption

Dataset on power consumption in a single household

Power consumption discretized to 51 levels (51 states)

Over 1 million observations

Methods:

MQMExact vs. MQMApprox

GK16 does not apply

GroupDP has too little utility

⇥ = { Empirical data generating distribution }

Page 70: Challenges in Privacy-Preserving Analysis of Structured Data

Real Data - Power Consumption

Methods: Two versions of Markov Quilt Mechanism (MQMExact, MQMApprox)

✏ = 0.2 ✏ = 1

Page 71: Challenges in Privacy-Preserving Analysis of Structured Data

Conclusion

• Real problems have complex privacy challenges

• Rigorous privacy definitions are available

• For any privacy problem, important to think:

• What do we need to hide?

• What do we need to reveal?

Page 72: Challenges in Privacy-Preserving Analysis of Structured Data

References

• “Differentially Private Continual Release of Graph Statistics”, S. Song, S. Mehta, S. Vinterbo, S. Little and K. Chaudhuri, Arxiv, 2018

• “Pufferfish Privacy Mechanisms for Correlated Data”, S. Song, Y. Wang and K. Chaudhuri, SIGMOD 2018.

• “Composition Properties of Inferential Privacy on Time-Series Data”, S. Song and K. Chaudhuri, Allerton 2018.

Page 73: Challenges in Privacy-Preserving Analysis of Structured Data

Thanks!