16
1 Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California gil@isi . edu http://www. isi . edu/~gil Scientific Reproducibility through Semantic Workflows and Shared Provenance Representations

Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

Embed Size (px)

DESCRIPTION

Scientific Reproducibility through Semantic Workflows and Shared Provenance Representations. Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science University of Southern California [email protected] http://www.isi.edu/~gil. - PowerPoint PPT Presentation

Citation preview

Page 1: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

1

Yolanda Gil, PhDInformation Sciences Institute andDepartment of Computer ScienceUniversity of Southern California

[email protected]

http://www.isi.edu/~gil

Scientific Reproducibility through Semantic Workflows and

Shared Provenance Representations

Page 2: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

2

NSF Workshop on Challenges of Scientific Workflows [Gil et al IEEE Computer 2007]

Despite investments on CyberInfrastructure as an enabler of a significant paradigm change in science:

• Reproducibility, key to scientific method, is threatened• Exponential growth in Compute, Sensors, Data storage,

Network BUT growth of science is not same exponential What is missing:

• Perceived importance of capturing and sharing process in accelerating pace of scientific advances

• Process (method/protocol) is increasingly complex and highly distributed

Workflows are emerging as a paradigm for process-model driven science that captures the analysis itself

Workflows need to be first class citizens in science CyberInfrastructure

• Enable reproducibility• Accelerate scientific progress by automating processes

Interdisciplinary and intradisciplinary research challenges Report available at http://www.isi.edu/nsf-workflows06

Page 3: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

3

Benefits of Workflow Systems [Taylor et al 07] Managing execution

Remote job submission Dependencies among

steps Failure recovery

Managing distributed computation Move data when needed

Managing large data sets Efficiency, reliability

Security and access control Access to shared

resources Provenance recording

Low-cost high-fidelity reproducibility

Page 4: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

4

Capabilities Available Today: Wings/Pegasus Workflows for Seismic Hazard Analysis [Gil et al 07] (see also [Maechlin et al 05] [Deelman et al 06])

Input data: a site and an earthquake forecast model

• thousands of possible fault ruptures and rupture variations, each a file, unevenly distributed

• ~110,000 rupture variations to be simulated for that site

High-level template combines 11 application codes 8048 application nodes in the workflow instance

generated by Wings Provenance records kept for 100,000 workflow data

products• Generated more than 2M triples of metadata

24,135 nodes in the executable workflow generated by Pegasus, including:

• data stage-in jobs, data stage-out jobs, data registration jobs

Executed in USC HPCC cluster, 1820 nodes w/ dual processors) but only < 144 available

• Including MPI jobs, each runs on hundreds of processors for 25-33 hours

• Runtime was 1.9 CPU years

Page 5: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

5

The Wings/Pegasus Workflow System[Gil et al 07; Deelman et al 03; Deelman et al 05; Kim et al 08; Gil et al forthcoming]

Grid servicescondor.uwisc.eduwww.globus.org

Pegasus:Automated workflow refinement and executionpegasus.isi.edu

WINGS:Semanticworkflow environmentwings.isi.edu

•Knowledge-based reasoning on workflows and data (W3C’s OWL)

•Semantic workflow catalogs•Automation and assistance•Execution-independent workflows•Optimize for performance, cost, reliability

•Assign execution resources•Manage execution through DAGMan

•Daily operational use in many domains•Secure and controlled sharing of distributed services, computing, data

•Scalable service-oriented architecture

•Commercial quality, open sourceIBM

IBM

IBM

IBM

Page 6: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

6

Semantic Workflows in WINGS[Gil et al IEE IS 2010; Gil et al JETAI 2010; Gil et al eScience 2009; Kim et al JCCPE 2008; Gil et al 2007]

Semantic workflows:• More than a dataflow

graph• Workflow variables:

each constituent (node, link, component, dataset) has a corresponding variable

• Semantic constraints on workflow variables, both within and across variables

• Semantic descriptions of collections of of data and components are concisely represented

[modelerInput_not_equal_to_classifierInput: (:modelerInput wflow:hasDataBinding ?ds1) (:classifierInput wflow:hasDataBinding ?ds2) equal(?ds1, ?ds2) (?t rdf:type wflow:WorkflowTemplate) > (?t wflow:isInvalid "true"^^xsd:boolean)]

(TestData dcdom:isDiscrete false)(TrainingData dcdom:isDiscrete false)

Page 7: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

7

Workflow Portal for Genetic Studies of Mental Disorders (with E. Deelman and C. Mason)

Existing repository of genotypic and phenotypic information

Goal: develop workflows useful for data in the repository

Page 8: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

8

Designing a Workflow Collection for Population Genomics

Designed workflows for common analysis types• Association tests• CNV detection• Variant discovery• Family-based association analysis (TDT)

Developed workflow components by encapsulating widely-used heterogeneous open software

• Plink (Purcell, Harvard)• R (Chambers et al)• PennCNV (Penn) -- Hidden Markov Models• Gnosis (State, Yale) -- sliding windows• Allegro (Decode, Iceland) -- Multiterminal Binary Decision Diagrams• Structure (Pritchard, Chicago) -- structured association• FastLink (Schaffer, NCBI)• (BWA) Burrows-Wheeler Aligner (Li * Durbin)• SAMTools

Page 9: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

9

Wings Workflows for Genetic Studies of Mental Disorders [Gil et al, forthcoming]

CNV Detection

Variant Discovery from Resequencing

Transmission Disequilibrium Test (TDT)

Association Tests

Page 10: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

10

Major Features Workflow system

manages set up and execution

• Wings – set up• Pegasus -

execution Initial collection of

workflows captures common genomic analyses

Users can upload their own datasets

• Including collections of datasets

User data is secure• Not accessible by

others

Page 11: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

11

Wings Replication of Crohn’s Disease Association Study from [Duerr et al, Science 06]

Page 12: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

12

Wings Replication of Early-Onset Parkinson’s Disease Study from [Bayrakli et al, Human Mutation 07]

Page 13: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

13

Observations about Reproducibility with Workflows [Gil et al, forthcoming]

Effort involved in reproducing results is minor• 30 seconds to set up a workflow

A catalog of carefully crafted workflows of select state-of-the-art methods will cover a wide range of genomic analyses• Our workflows were independently developed and used “as is”

Semantic representations abstract the analysis method from the software that implements it• Our workflows used different analytic tools than the original

studies• Many implementations of same algorithm, some proprietary

Semantic constraints can be added to workflows to avoid analysis errors• Eg: in association analysis workflow, added constraint to remove

duplicate individuals initially to avoid problems downstream

Page 14: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

14

Benefits of Semantic Workflows [Gil JSP-09]

Execution management: Automation of workflow

execution Managing distributed

computation Managing large data sets Security and access

control Provenance recording Low-cost high fidelity

reproducibility

Semantics and reasoning:

User assistance to correctly explore analysis “design space”

Validation of analyses Automated generation of

metadata Workflow retrieval and

discovery “Conceptual”

reproducibility

Page 15: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

15

W3C Provenance Group (Y. Gil, chair):Goals

Provide state-of-the-art understanding and develop a roadmap for development and possible standardization

Articulate requirements for accessing and reasoning about provenance information• Develop use cases

Identify issues in provenance that are direct concern to the Semantic Web• Articulate relationships with other aspects of Web architecture

Report on state-of-the-art work on provenance Report on a roadmap for provenance in the Semantic

Web• Identify starting points for provenance representations• Identifying elements of a provenance architecture that would

benefit from standardization

Page 16: Yolanda Gil, PhD Information Sciences Institute and Department of Computer Science

16

W3C Provenance Group:Products of the Group to Date

Group formed in September 2009, open to new members• All information is public: http://www.w3.org/2005/Incubator/prov/wiki/

Developed a set of key dimensions for provenance (11/09)• Grouped into three major categories: content, management, use

Developed use cases for provenance (12/09)• More than 30 use cases, including ~10 in science but others are

relevant Developed requirements for provenance from use cases (1/10)

• User requirements: what is the purpose of the provenance information • Technical requirements: derived from the user requirements

Report on “Requirements for Provenance on the Web” Currently developing state-of-the-art report (expected 6/10) Started to develop recommendations (expected 9/10)

• Mappings across provenance vocabularies (eg: DC, OPM, SWAN,…)