Privacy through Accountability - CyLab · Research Challenge Ensure organizations respect privacy...

Preview:

Citation preview

Privacy through Accountability

Anupam Datta

Associate Professor

CSD, ECE, CyLab

Carnegie Mellon University

2

Personal Information is Everywhere

Research Challenge

Ensure organizations respect privacy expectations,

regulations, and organizational policies in the collection,

use, and disclosure of personal information

3

Programs and People

Web Advertising

Example privacy policies:

Not use detailed location (full IP address) for advertising

Not use health information for advertising

4

5

Privacy through Accountability:

An Emerging Research Area

Privacy as a right to restrictions on

personal information flow

Computational accountability mechanisms

for enforcement

http://www.andrew.cmu.edu/user/danupam/privacy.html

Today: Focus on Web Privacy

1. Bootstrapping Privacy Compliance in Big Data Systems

Methodology

Tool and application to Bing’s advertising system

Focus on current policies

2. Information Flow Experiments

Methodology

Tool and application to Google’s advertising system

Focus on principles that go beyond current policies

6

7

Bootstrapping Privacy Compliance in Big

Data Systems

With S. Sen (CMU) and

S. Guha, S. Rajamani, J. Tsai, J. M. Wing (MSR)

2014 IEEE Symposium on Security & Privacy

(Best Student Paper Award)

Privacy Compliance for Bing

Setting:

Auditor has access to source code

8

The Privacy Compliance Challenge

9

Specification

Verification

Scale Compliance?

A Streamlined Audit Workflow

10

Encode Refine

Code analysis

Checker

Annotated

Code

Legalease

Policy

Potential violations

Fix code

Update Grok Developer annotations

A Streamlined Audit Workflow

Encode Refine

Code analysis, developer annotations

Checker

Annotated

Code

Legalease

Policy

Potential violations

Fix code

Update Grok

Workflow for privacy compliance

Legalease, usable yet formal policy specification language

Grok, bootstrapped data inventory for big data systems

Scalable implementation for Bing

11

A Streamlined Audit Workflow

Encode Refine

Code analysis, developer annotations

Checker

Annotated

Code

Legalease

Policy

Potential violations

Fix code

Update Grok

12

Specification: Legalease

Usable by

lawyers

and

privacy

champs.

Expressive

enough for

real-world

policies.

Precise

semantics

for local

reasoning.

Usable.

Expressive.

Precise.

13

Legalease: Example Policy

DENY Datatype IPAddress

UseForPurpose Advertising

EXCEPT

ALLOW

Datatype IPAddress:Truncated

ALLOW

UseForPurpose AbuseDetect

EXCEPT

DENY Datatype

IPAddress, AccountInfo

14

We will not use full IP Address for Advertising. IP Address may be used for detecting abuse. In such cases, it will not be combined with account information.

Legalease: Example Policy

15

We will not use full IP Address for Advertising. IP Address may be used for detecting abuse. In such cases, it will not be combined with account information.

DENY Datatype IPAddress

UseForPurpose Advertising

EXCEPT

ALLOW

Datatype IPAddress:Truncated

ALLOW

UseForPurpose AbuseDetect

EXCEPT

DENY Datatype

IPAddress, AccountInfo

DENY Datatype IPAddress

UseForPurpose Advertising

EXCEPT

ALLOW

Datatype IPAddress:Truncated

ALLOW

UseForPurpose AbuseDetect

EXCEPT

DENY Datatype

IPAddress, AccountInfo

We will not use full IP Address for Advertising. IP Address may be used for detecting abuse. In such cases, it will not be combined with account information.

We will not use full IP Address for Advertising. IP Address may be used for detecting abuse. In such cases, it will not be combined with account information.

Legalease : Policy Checking

16

Program

A Lattice of Policy Labels

17

IPAddress

• If “IPAddress” use is allowed then so is everything below it

• If “IPAddress:Truncated” use is denied then so is everything above it

T

IPAddress: Truncated

18

Designed for Precision

Designed for Expressivity (Bing, October 2013)

Designed for Expressivity (Google, October 2013)

20

DENY Datatype IPAddress

UseForPurpose Advertising

EXCEPT

ALLOW

Datatype IPAddress:Truncated

ALLOW

UseForPurpose AbuseDetect

EXCEPT

DENY Datatype

IPAddress, AccountInfo

Designed for Usability

Exceptions How legal texts are structured

One-to one correspondence

Local Reasoning Each exception refines its immediate parent

Formally proven property

Independent of Code

21

H. DeYoung, D. Garg, L. Jia, D. Kaynar, and A. Datta,

“Experiences in the logical specification of the HIPAA and GLBA

privacy laws”

Legalease Usability

Survey taken by 12 policy authors within Microsoft

Encode Bing data usage policy after a brief tutorial

Time spent 2.4 mins on the tutorial

14.3 mins on encoding policy

High overall correctness

22

A Streamlined Audit Workflow

Checker

Encode Refine

Code analysis

Annotated

Code

Legalease

Policy

Potential violations

Fix code

Update Grok Developer annotations

23

A Streamlined Audit Workflow

Encode Refine

Code analysis, developer annotations

Checker

Annotated

Code

Legalease

Policy

Potential violations

Fix code

Update Grok

24

Scope, Hive, Dremel

Data in the form of Tables

Code Transforms Columns to Columns

No Shared State

Limited Hidden Flows

Process 1

Dataset A Dataset B

Dataset

C

Map-Reduce Programming Systems

25

Verification

Nightly

audit of

all jobs

executed.

Static

source

code

analysis.

What

data,

stored

where?

Who

used.

26

Process 1

Dataset A Dataset B

Dataset

C

Dataset F Dataset E

Process 2

Process 3

Dataset

D

Process 5

Dataset J

Process 6

Process 4

Dataset

H Dataset I

Dataset

G

Grok

27

Process 1

Dataset A Dataset B

Dataset

C

Dataset F Dataset E

Process 2

Process 3

Dataset

D

Process 5

Dataset J

Process 6

Process 4

Dataset

H Dataset I

Dataset

G

NewAcct

Login

Check

Hijack

GeoIP

Check

Fraud

Reporting

Grok

Purpose Labels

Annotate programs with purpose labels

28

Initial Data Labels

Heuristics and Annotations

29

Process 1

Dataset A Dataset B

Dataset

C

Dataset F Dataset E

Process 2

Process 3

Dataset

D

Process 5

Dataset J

Process 6

Process 4

Dataset

H Dataset I

Dataset

G

NewAcct

Login

Check

Hijack

GeoIP

Check

Fraud

Reporting

Name Age IPAddress IDX

?? Country

Timestamp Hash

IDX

??

Grok

Purpose Labels

Annotate programs with purpose labels

29

Flow Labels

Source labels propagated via data flow graph

30

Process 1

Dataset A Dataset B

Dataset

C

Dataset F Dataset E

Process 2

Process 3

Dataset

D

Process 5

Dataset J

Process 6

Process 4

Dataset

H Dataset I

Dataset

G

NewAcct

Login

Check

Hijack

GeoIP

Check

Fraud

Reporting

Name Age IPAddress IDX

Profile Country

Timestamp Hash

IDX

IDX

D. E. Denning. “A lattice model of secure information flow”

Grok

Purpose Labels

Annotate programs with purpose labels

Initial Data Labels

Heuristics and Annotations

30

Nightly

Compliance

Process

Generate

report

Static

code

analysis

Manual

Audit

Proce

ss 1

Datas

et A

Datas

et B

Datas

et C

Datas

et F

Datas

et E

Proce

ss 2

Proce

ss 3

Datas

et D

Proce

ss 5

Datas

et J

Proce

ss 6

Proce

ss 4

Datas

et H Datas

et I

Datas

et G

FIMLa

st

Name

LiveId

Age

ss_us

er_ip

M_A

NID

MCM

UID

LocId

s

csts msMUI

D2

msnA

NID

User

Anid

DB

Read

Datase

t D

Read

Datase

t G

Transfor

m Data

Write

Dataset

H, I

Positive

Patterns (40 Taxonomy values, 400

patterns)

Negative

Patterns (2500 total entries)

Granular Overrides (116 total entries)

-- DENY DataType UniqueIdentifier WITH PII InStore BingStore SELECT * FROM (SELECT * FROM Report WHERE Taxonomy='ANID' AND Confidence>='High') AS ID INNER JOIN (SELECT * FROM Report WHERE TaxonomyGroup='PII' AND Confidence>='High') AS P ON ID.VC = P.VC

files

25M+ schemas

2M+

privacy

elements*

300K+

audit

candidates

10K+

teams

8

audit

items

1K+ 31

Why Bootstrapping Grok Works

Pick the nodes which will

label the most of the

graph

~200 annotations label 60% of nodes

A small number of annotations

is enough to get off the ground.

33

Scale

77,000 jobs run each day By 7000 entities

300 functional groups

1.1 million unique lines of code 21% changes on avg, daily

46 million table schemas

32 million files

Manual audit infeasible

Information flow analysis takes ~30 mins daily

34

A Streamlined Audit Workflow

Checker

Encode Refine

Code analysis

Annotated

Code

Legalease

Policy

Potential violations

Fix code

Update Grok Developer annotations

35

A Streamlined Audit Workflow

Encode Refine

Code analysis, developer annotations

Checker

Annotated

Code

Legalease

Policy

Potential violations

Fix code

Update Grok

36

Today: Focus on Web Privacy

1. Bootstrapping Privacy Compliance in Big Data Systems

Methodology

Tool and application to Bing’s advertising system

Focus on current policies

2. Information Flow Experiments

Methodology

Tool and application to Google’s advertising system

Focus on principles that go beyond current policies

37

38

Information Flow Experiments Methodology

With Michael Carl Tschantz (CMU UC Berkeley)

Amit Datta (CMU)

Jeannette M. Wing (CMU Microsoft Research)

39

User Ads

Browsing history Other users

Advertisers

Websites

Google

Confounding

inputs

Personalized Web Advertising

?

Probabilistic Interference

Control Group

Experimental Design

Scientist

40

Experimental Group

Drug

Placebo

Group 2

Information Flow Experiment (IFE)

41

Group 1 Rehab ads

Substance abuse websites

Generic ads

Idle

IFE Methodology

42

Control

treatment

Experimenter

Experimental

treatment

Random

permutation

Measurements

p-value Significance testing

The

Internet

Information Flow Experiments as Science

Experimental Science Information Flow

Natural process System in question

Population of units Subset of interactions

… …

Causation Information flow

43

Theorem

Pearl’s Causation = Probabilistic Interference

44

Information Flow Experiments

on Personalized Ad Settings: A Tale of Opacity, Choice and Discrimination

With Amit Datta (CMU) and

Michael Carl Tschantz (UC Berkeley)

Google Ad Settings

45

Goals

Study transparency, choice, fairness

Methodology and tool (AdFisher)

Automation, statistical rigor, scalability, explanations

46

Browsing

Behavior

Ads

Received

Ad

Settings

Internal

State

Experiment 1: Opacity

Experimental group visits top 100 substance abuse sites

Control group idles

Then both groups visit Times of India and collects ads

47

Browsing

Behavior

Ads

Received

Ad

Settings

Internal

State

Experiment 1: Significant Opacity

Substance abuse: significant effect on ads, no effect on ad

settings

Disability: significant effect on ads, “unrelated” effect on ad

settings

48

Treatment p-value

Substance abuse 0.0000053

Disability 0.0000053

Mental disorder 0.053

Infertility 0.11

Adult websites 0.42

Statistical

significance

Experiment 1: Opacity Explanation

Top ads for group visiting substance abuse webpages

The Watershed Rehab www.thewatershed.com/Help

Watershed Rehab www.thewatershed.com/Rehab

The Watershed Rehab Ads by Google

Veteran Home Loans www.vamortgagecenter.com

CAD Paper Rolls paper-roll.net/Cad-Paper

Top ads for control group

Alluria Alert www.bestbeautybrand.com

Best Dividend Stocks dividends.wyattresearch.com

10 Stocks to Hold Forever www.streetauthority.com

Delivery Drivers Wanted get.lyft.com/drive

VA Home Loans Start Here www.vamortgagecenter.com

49

Experiment 2: Choice

Experimental group visits top 100 dating sites; then removes

dating interest from ad settings

Control group visits top 100 dating sites; then keeps dating

interest

Then both groups visit Times of India and collects ads

50

Browsing

Behavior

Ads

Received

Ad

Settings

Internal

State

Experiment 2:

Choice Buttons have an Effect

Treatment p-value

Opting out 0.0000053

Dating 0.0000053

Weight loss 0.041

51

Statistical

significance

Experiment 2: Choice Explanation

Top ads for group keeping dating interest

Are You Single? www.zoosk.com/Dating

Top 5 Online Dating Sites www.consumer-rankings.com/Dating

Why can't I find a date? www.gk2gk.com

Latest Breaking News www.onlineinsider.com

Gorgeous Russian Ladies anastasiadate.com

52

Top ads for group removing dating interest

Car Loans w/ Bad Credit www.car.com/Bad-Credit-Car-Loan

Individual Health Plans www.individualhealthquotes.com

Crazy New Obama Tax www.endofamerica.com

Atrial Fibrillation Guide www.johnshopkinshealthalerts.com

Free $5 - $25 Gift Cards swagbucks.com

Experiment 3: Discrimination

Experimental group visits top 100 job sites with gender set to

male in ad settings

Control group visits top 100 job sites with gender set to

female in ad settings

Then both groups visit Times of India and collects ads

53

Browsing

Behavior

Ads

Received

Ad

Settings

Internal

State

Experiment 3:

Discrimination Explanation

Top ads for female group

Jobs (Hiring Now) www.jobsinyourarea.co

4Runner Parts Service www.westernpatoyotaservice.com

Criminal Justice Program www3.mc3.edu/Criminal+Justice

Goodwill - Hiring goodwill.careerboutique.com

UMUC Cyber Training www.umuc.edu/cybersecuritytraining

54

Top ads for male group

$200k+ Jobs - Execs Only careerchange.com

Find Next $200k+ Job careerchange.com

Become a Youth Counselor www.youthcounseling.degreeleap.com

CDL-A OTR Trucking Jobs www.tadrivers.com/OTRJobs

Free Resume Templates resume-templates.resume-now.com

55

Information Flow Experiments More on methodology

With Michael Carl Tschantz (CMU UC Berkeley)

Amit Datta (CMU)

Jeannette M. Wing (CMU Microsoft Research)

Google Exhibits Complex Behavior

0

5

10

15

20

25

30

35

40

45

0 50 100 150 200

Ad

id

Reload number

56

56

Browser Instances are Not Independent

57

17

13 13 13 12

11 10 10

8 7

Which Statistical Test to Use?

Our Idea:

Use a non-parametric test

Does not require model of Google

Specifically, a permutation test

Does not require independence among browser instances or

assumption that ads are independent and identically distributed

58

Permutation Test over Keywords

59

0

5 6

30 30

0

19 22

31

2

1 2 3 4 5 6 7 8 9 10

Permutation Test over Keywords

60

0 0 2

5 6

19 22

30 30 31

1 6 10 2 3 7 8 4 5 9

Permutation Test over Keywords

61

13

132

1,6,10,2,3 7,8,4,5,9

119

Permutation Test over Keywords

62

44

101

9,6,10,2,3 7,8,4,5,1

67

Permutation Test over Keywords

63

-57

119

67

7

Conclusion

A rigorous methodology for information flow

experiments

1. Probabilistic interference = Pearl’s causation

2. Experimental design for causal determination

3. Significance testing with non-parametric statistics

An experimental study of Google Ads

1. AdFisher Tool

2. Findings of opacity, choice and discrimination

64

Prior Work on Behavioral Marketing

Authors Test Limitation

Guha et al. Cosine similarity No statistical significance

Balebako et al. Cosine similarity No statistical significance

Wills and Tatar Ad hoc examination No statistical significance

Liu et al. Process of elimination No statistical significance

Barford et al. χ2 test Assumes ads identically distributed

Lécuyer et al. Parametric Model Correlation, not causation; assumes

ads are independent

65

Privacy as Restrictions on Personal

Information Flow

66

Restrictions

Info

rmatio

n F

low

Direct

Interference

Probabilistic

Interference

Temporal Purpose & Role based

EPAL

XACML

*-access control

Purpose Planning

FOTLs

[Formal Contextual Integrity,

Reduce audit algorithm,

Basin et al.]

Grok +

Legalease Jif,

FlowCaml,…

[Hayati &

Abadi]

Information Flow

Experiments

Differential

Privacy

Web Privacy

Healthcare

Privacy

Summary

1. Information Flow Experiments

Methodology

Tool and application to Google’s advertising system with

findings of opacity, choice and discrimination

2. Privacy Compliance in Big Data Systems

Methodology

Tool and application to Bing’s compliance workflow, privacy

policies and advertising programs on production system

67

68

Privacy through Accountability:

An Emerging Research Area

Privacy as a right to restrictions on

personal information flow

Computational accountability mechanisms

for enforcement

http://www.andrew.cmu.edu/user/danupam/privacy.html

Recommended