Upload
khalid-belhajjame
View
236
Download
0
Tags:
Embed Size (px)
DESCRIPTION
This is a talk that was presented by Khalid Belhajjame at the eScience conference that took place in 2012 in Chicago.
Citation preview
Why Workflows Break - Understanding
and Combating Decay in Taverna
Workflows
Jun Zhao, Jose Manuel Gomez-Perez, Khalid Belhajjame, Graham Klyne, Esteban Garcia-Cuesta, Aleix Garrido, Kristina
Hettne, Marco Roos, David De Roure, and Carole Goble
10 October, 2012
IEEE eScience 2012. Chicago, USA
http://www.flickr.com/photos/sheepies/3798650645/ @ CC BY-NC 2.0
Reproducibility: Why Bother?
10 October, 2012
IEEE eScience 2012. Chicago, USA
◉ Results produced by scientists not only give insight, they lead to progress and are built upon
◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then others scientist should be able to check it.
◉ This should be also possible for experiments carried out in computational environments.
47 of 53 “landmark” publications could not be replicated
Inadequate cell lines and animal models
Nature, 483, 2012
Credit to Carole Goble JCDL 2012 Keynote
Reproducibility: Why Bother?
10 October, 2012
IEEE eScience 2012. Chicago, USA
◉ Results produced by scientists not only give insight, they lead to progress and are built upon
◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then other scientists should be able to check it.
◉ This should be also possible for experiments carried out in computational environments.
Reproducibility: Why Bother?
10 October, 2012
IEEE eScience 2012. Chicago, USA
◉ Results produced by scientists not only give insight, they lead to progress and are built upon
◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then other scientists should be able to check it.
◉ This should be also possible for experiments carried out in computational environments.
A famous quote
An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.
Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995
10 October, 2012
IEEE eScience 2012. Chicago, USA
Another quote
Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry.
Eric S. Raymond, The art of UNIX programming, 2004
10 October, 2012
IEEE eScience 2012. Chicago, USA
Workflows: A Means for Preserving Scientific
MethodsFortunately, there is a means that can be
used to document the experiment that the
scientist ran, and even re-run it!
10 October, 2012
IEEE eScience 2012. Chicago, USA
Kegg pathway
query
Kegg pathway
query
Kegg pathway
query
Kegg pathway
query
chromosome17 chromosome37
Detect common pathways
Detect common pathways
Common pathways
Scientific workflows
Increasingly adopted in modern
sciences.
Transparent documentation of
experimental methods
Repeatable and configurable
Workflow Decay
A decayed or reduced ability to be executed or produce the same results
Our Contributions
An empirical analysis for identifying and categorizing the causes of workflow decay
A software framework to assess workflow preservation
Storyline
The importance of reproducibility
Workflow as a means for preserving scientific methods
Understanding the causes of workflow decay
Combating decay
Lessons learnt and future work
10 October, 2012
IEEE eScience 2012. Chicago, USA
Understanding The Causes of Workflow
DecayWe adopted an empirical approach
To identify the causes of workflow decay
To quantify their severity
To do so, we analyzed a sample of real workflows to determine if they suffer from decay and the reasons that caused their decay
10 October, 2012
IEEE eScience 2012. Chicago, USA
Experimental Setup
Taverna workflows from myExperiment.org
Taverna 1
Taverna 2
Selection processBy the creation year
By the creator
By the domain
Software environment
Taverna 2.3
Experiment metadata
June-July 2012
4 researchers
10 October, 2012
IEEE eScience 2012. Chicago, USA
Analyzed WorkflowsNumber of Taverna 1 workflows from 2007 to 2011
2007 2008 2009 2010 2011
Tested 12 10 10 10 4*
Total 74 341 101 26 13
10 October, 2012
IEEE eScience 2012. Chicago, USA
Number of Taverna 2 workflows from 2009 to 2012
2009 2010 2011 2012
Tested 12 10 15 9
Total 97 308 289 184
Profile of Analyzed Workflows
10 October, 2012
IEEE eScience 2012. Chicago, USA
The Proportion of Decay
75% of the 92 tested workflows failed to be either executed or produce the same result (if testable)
Those from early years (2007-2009) had 91% failure rate
10 October, 2012
IEEE eScience 2012. Chicago, USA
Taverna 1
Taverna 2
The Cause of Decay
10 October, 2012
IEEE eScience 2012. Chicago, USA
Manual analysisBy the validation report from Taverna workbenchBy interpreting experiment results reported by Taverna
Identified 4 categories of causesMissing example dataMissing execution environment Insufficient descriptions about workflows Volatile third-party Resources
Other unconsidered possible factorsChanges in the local operating environment (hardware, OS, middleware, compiler, etc)
Decay Caused by Third-Party Resources
10 October, 2012
IEEE eScience 2012. Chicago, USA
Causes Refined Causes ExamplesThird party resources are not available
Underlying dataset, particularly those locally hosted in-house dataset, is no longer available
Researcher hosting the data changed institution, server is no longer available
Services are deprecated DDBJ web services are not longer provided despite the fact that they are used in many myExperiment workflows
Third party resourcesare available but not accessible
Data is available but identified using different IDs than the ones known to the user
Due to scalability reasons the input data is superseded by new one making the workflow not executable or providing wrong results
Data is available but permission, certificate, or network to access it is needed
Cannot get the input, which is a security token that can only be obtained by a registered user of ChemiSpider
Services are available but need permission, certificate, or network to access and invoke them
The security policies of the execution framework are updated due to new hosting institution rules
Third party resources have changed
Services are still available by using the same identifiers but their functionality have changed
The web services are updated
The Cause of Decay
10 October, 2012
IEEE eScience 2012. Chicago, USA
Manual analysisBy the validation report from Taverna workbenchBy interpreting experiment results reported by Taverna
Identified 4 categories of causesMissing example dataMissing execution environment Insufficient descriptions about workflows Volatile third-party Resources
Other unconsidered possible factorsChanges in the local operating environment (hardware, OS, middleware, compiler, etc)
Summary of Decay Causes
50% of the decay was caused by volatility of 3rd-party resource
UnavailableInaccessibleUpdated
Missing example dataUnable to re-run
Missing execution environment
Such as local plugins
Insufficient metadataSuch as any required dependency libraries or permission information
10 October, 2012
IEEE eScience 2012. Chicago, USA
Storyline
The importance of reproducibility
Workflow as a means for preserving scientific methods
Understanding the causes of workflow decay
• Combating decay
• Lessons learnt and future work
10 October, 2012
IEEE eScience 2012. Chicago, USA
Combating Workflow Decay
• Objective: To provide enough information to– Prevent decay– Detect decay– Repair decay
• Approach: Research Objects + Checklists– Research Objects [1][2]: Aggregate workflow
specifications together with auxiliary elements, such as example data inputs, annotations, provenance traces that can be used to prevent decay and/or repair the workflow in case of decay.
– Checklists: to check that sufficient information is preserved along with the workflows
[1] http://wf4ever.github.com/ro/
[2] http://wf4ever.github.com/ro-primer/ 10 October, 2012
IEEE eScience 2012. Chicago, USA
Wf4Ever Project
Checklists• Checklists are a well established tool for guiding practices to ensure safety, quality and consistency in the conduct of complex operations.
• They have been adopted by the biological research community to promote consistency across research datasets
• In our case, we use checklists to assess if a research object contains sufficient information for running the workflow and checking that its results are replicable.
10 October, 2012
IEEE eScience 2012. Chicago, USA
Cheklist-ing the Reproducibility of a
Workflow
10 October, 2012
IEEE eScience 2012. Chicago, USA
The Minim model used in our approach is an adaptation of the MiM model [1][2].
[1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and Framework for Scientific Linked Data. eScience 2012[2] https://raw.github.com/wf4ever/ro-manager/master/src/iaeval/Minim/minim.rdf
Use Case
• 4 myExperiment packs– 2 from genomics, 1 from geography, and 1
domain-neutral
• Experiment process:– Transform them into RO– Create checklist descriptions
• Observations– 2 research objects were found not to contain the
necessary information to run them, 2 others failed because of update to third party resources and environment of execution.
10 October, 2012
IEEE eScience 2012. Chicago, USA
Storyline
The importance of reproducibility
Workflow as a means for preserving scientific methods
Understanding the causes of workflow decay
• Combating decay
• Lessons Learnt and future work
10 October, 2012
IEEE eScience 2012. Chicago, USA
Lessons Learnt
1. Dependency is the root enemy of reproducible workflows
2. Documentation, i.e., annotation, is vital
3. Documentation should be easy to create
10 October, 2012
IEEE eScience 2012. Chicago, USA
The Future Work• Decay detection, explanation, and repair
• Reproducibility and provenance
• Working with scientists is vital for reproducible science – GigaScience– BioVel– 2020 Science
10 October, 2012IEEE eScience 2012. Chicago, USA
Acknowledgement
The principles of provenance. Dagstuhl, March 1, 2012
EU Wf4Ever project (270129) funded under EU FP7 (ICT- 2009.4.1). (http://www.wf4ever-project.org)