24
Enabling Privacy in Provenance- Aware Workflow Systems Susan B. Davidson 1 Joint work with Sanjeev Khanna, Sudeepa Roy, Julia Stoyaonovich, Val Tannen 1 Yi Chen 2 Tova Milo 3 1 University of Pennsylvania 2 Arizona State University 3 Tel Aviv University

Enabling Privacy in Provenance- Aware Workflow Systems Susan B. Davidson 1 Joint work with Sanjeev Khanna, Sudeepa Roy, Julia Stoyaonovich, Val Tannen

Embed Size (px)

Citation preview

Enabling Privacy in Provenance-Aware Workflow Systems

Susan B. Davidson1

Joint work with

Sanjeev Khanna, Sudeepa Roy, Julia Stoyaonovich, Val Tannen 1

Yi Chen 2

Tova Milo3

1University of Pennsylvania 2Arizona State University 3Tel Aviv University

2

Science has been revolutionized

• High-throughput technologies generate floods of data, which must be analyzed to create knowledge.

• Scientists are turning to scientific workflow systems to help manage the analysis.

Scientific Workflow

Split Entries

Align Sequences

Functional Data Curate Annotations

Format-2

Format-1

Format-3

Construct Trees

3

Graphical representation of a sequence of actions to perform a task (e.g., a biological experiment)

Vertex ≡ Module (program) Edge ≡ Dataflow

Run: An execution of the workflow Actual data appears on the edges

.

TGCCGTGTGGCTAAATG…

CTGTGC

…CTAAATGTCTGTGC…

GGCTAAATGTCTGTGCCGTG

TGGCGTC…

ATCCGTGTGGCTA..

Need for provenance

4

TGCCGTGTGGCTAAATGTCTGTGC

CCCTTTCCGTGTGGCTAAATGTCTGTGC

TGCCGTGTGGCTAAATGTCTGTGC

GTCTGTGC…

TGCCGTGTGGCTAAATGTCTGTGC

GTCTGTGC…

TGCCGTGTGGCTAAATGTCTGTGC…

ATGGCCGTGTGGTCTGTGCCTAACTAACTAA…

Biologist’s workspace

Which sequences have been used to produce this tree?

How has this tree been generated?

?

s

Split Entries

Align Sequences

Functional Data Curate Annotations

Format

Format

Format

Construct Trees

t

Need for privacy

5

Outline

• The need: Workflow provenance repositories

• The challenge of privacy

• Privacy-aware search and query

Workflow Provenance Repositories

6

Current situation: Workflow repositories

• To enable sharing and reuse, repositories of workflow specifications are being created– e.g. myExperiment.org– Keyword search is used to find

specifications of interest

• Several workflow systems are storing provenance information – Module executions, input parameters,

input/output data– “Input-only”

7

The Vision• “Workflow Provenance” repositories will store

specifications as well as executions (i.e. provenance information)– Searchable– Queryable– Privacy protecting

• Searching/querying these repositories can be used to– Find/reuse workflows– Understand meaning of a workflow– Correct/debug erroneous specifications– Find “bad” data and what its downstream effect

might be8

9

“You are better off designing in security and privacy… from the start, rather than trying to add them later.”

(Shapiro, CACM 53:6, June 2010)

Privacy Concerns in Scientific Workflows

10

Data, module, structural

Example: Module Privacy

11

P: (X1, X2, X3, X4, X5)

(X1, X2, X3) (X1, X2, X4, X5)

P has cancer?

Split entries

Check for Cancer

Check for Infectious disease

Create Report

report

Patient record: Gender, smoking habits, Familial environment,blood pressure, blood test report, …

P has an infectious disease?

If X1 > 60 OR

(X2 < 800 AND

X5=1) AND ….

Module functionality

should be kept secret

(From patient’s standpoint): output should not be guessed given input data values

(From module owner’s standpoint): no one should be able to simulate the module and use it elsewhere.

Example: Structural Privacy

12

M1 M2

Protein+

Functional annotation

Protein

M2 compares domains of proteins (more precise but

more time consuming)

M1 compares the entire protein

against already annotated genomes

Relationships between certain data/module pairs should be kept secret

13

Privacy concerns at a glance

P: (X1, X2, X3, X4)

(X1, X2, X3) (X1, X2, X4)

P has cancer?

P has infectious disease?

Data Privacy Data items are

private

Module Privacy Module

functionality is private (x, f(x))

Structural Privacy Execution paths

between certain data is private

Split entries

Check for cancer Check for infectious disease

Create Report

report

smoking habits, blood pressure,blood test report, ……

Check for cancer

DB

14

Module Privacy (a hint)• A module f = a function• For every input x to f, f(x) value should not be revealed

– Enough equivalent possible f(x) values w.r.t. visible information

According to required privacy

guarantee

• There is a knife, a fork and a spoon in this figure

arxiv.org/abs/1005.5543

Search and Query

15

I

O

DetermineGenetic

Susceptibility

M1

M2 EvaluateDisorder Risk

Consult External Databases

M3 Expand SNPSet

M4

GenerateDatabase Queries

QueryOMIM

CombineDisorder Sets

M5

M6

M8

W1 W2 W4

QueryPubMed M7

SNPs, ethnicity

lifestyle, family history, physical symptoms

disorders

prognosis

SNPs

disorders disorders

query query

W3

….

Workflow model, revisited Vertex ≡ Module

Atomic or composite (subworkflow)

Edge ≡ DataflowW1

W3

W2 W4

Workflow hierarchy

17

Access Views

• Prefixes of workflow hierarchy can be used to define the finest level of granularity the user can “see”

W1

W3

W2 W4 W1 W2 W4

W1 W2

W1:

I OM1 M2

I

O

DetermineGenetic

Susceptibility

M1

M2 EvaluateDisorder Risk

Consult External Databases

M3 Expand SNPSet

M4

GenerateDatabase Queries

QueryOMIM

CombineDisorder Sets

M5

M6

M8

W1 W2 W4

QueryPubMed M7

SNPs, ethnicity

lifestyle, family history, physical symptoms

disorders

prognosis

SNPs

disorders disorders

query query

Search Keyword search: “Database, disorder risk” Result should be concise and informative

W1 W2 W4

“WISE: a Workflow Information Search Engine”,(ICDE 2009)

I

O

M2

EvaluateDisorder Risk

M3

Expand SNPSet

GenerateDatabase Queries

QueryOMIM

CombineDisorder Sets

M5

M6

M8

QueryPubMed

M7

SNPs, ethnicity

lifestyle, family history, physical symptoms

disorders

prognosis

SNPs

disorders

disorders

query query

Result of “database, disorder risk”

W1 W2 W4

I

O

DetermineGenetic

Susceptibility

M1

M2 EvaluateDisorder Risk

Consult External Databases

M3 Expand SNPSet

M4

W1 W2

SNPs, ethnicity

lifestyle, family history, physical symptoms

disorders

prognosis

SNPs

Result of “database, disorder risk”

W1 W2

I

O

M2

EvaluateDisorder Risk

M3

Expand SNPSet

SNPs, ethnicity

lifestyle, family history, physical symptoms

prognosis

SNPs

disorders

M4 Consult External Databases

21

Queries

• “Workflows in which Expand SNP is performed before Query PubMed.”

• “Data items that were used to produce d19.”

• …

• Shares with search the need to match certain terms and give a view of the result and preserve privacy.

22

Research Challenges• More ways of enabling scientists to construct

workflows

• Formalizing privacy notions

• Provenance query language?

• Efficiently identifying data that users can access– users may have different privileges, yielding many

different access views.

• …

Acknowledgments• Members of the PennDB

group– Susan Davidson– Sanjeev Khanna– Sudeepa Roy– Julia Stoyanovich– Val Tannen

• Friend of PennDB and Tel Aviv faculty– Tova Milo

• PennDB alum and ASU faculty– Yi Chen

• Bioinformatics collaborators– Sarah Cohen-BoulakiaThis work was supported in part by NSF IIS-0803524, IIS-0629846, IIS-0915438, CCF-0635084, and IIS-0904314; NSF CAREER award IIS-0845647; and CRA 0937060

24