28
Why Workflows Break - Understanding and Combating Decay in Taverna Workflows Jun Zhao, Jose Manuel Gomez-Perez, Khalid Belhajjame , Graham Klyne, Esteban Garcia-Cuesta, Aleix Garrido, Kristina Hettne, Marco Roos, David De Roure, and Carole Goble 10 October, 2012 IEEE eScience 2012. Chicago, USA http://www.flickr.com/photos/sheepies/3798650645/ @ CC BY-NC 2.0

Why Workflows Break

Embed Size (px)

DESCRIPTION

This is a talk that was presented by Khalid Belhajjame at the eScience conference that took place in 2012 in Chicago.

Citation preview

Page 1: Why Workflows Break

Why Workflows Break - Understanding

and Combating Decay in Taverna

Workflows

Jun Zhao, Jose Manuel Gomez-Perez, Khalid Belhajjame, Graham Klyne, Esteban Garcia-Cuesta, Aleix Garrido, Kristina

Hettne, Marco Roos, David De Roure, and Carole Goble

10 October, 2012

IEEE eScience 2012. Chicago, USA

http://www.flickr.com/photos/sheepies/3798650645/ @ CC BY-NC 2.0

Page 2: Why Workflows Break

Reproducibility: Why Bother?

10 October, 2012

IEEE eScience 2012. Chicago, USA

◉ Results produced by scientists not only give insight, they lead to progress and are built upon

◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then others scientist should be able to check it.

◉ This should be also possible for experiments carried out in computational environments.

Page 3: Why Workflows Break

47 of 53 “landmark” publications could not be replicated

Inadequate cell lines and animal models

Nature, 483, 2012

Credit to Carole Goble JCDL 2012 Keynote

Page 4: Why Workflows Break

Reproducibility: Why Bother?

10 October, 2012

IEEE eScience 2012. Chicago, USA

◉ Results produced by scientists not only give insight, they lead to progress and are built upon

◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then other scientists should be able to check it.

◉ This should be also possible for experiments carried out in computational environments.

Page 5: Why Workflows Break

Reproducibility: Why Bother?

10 October, 2012

IEEE eScience 2012. Chicago, USA

◉ Results produced by scientists not only give insight, they lead to progress and are built upon

◉ Therefore, the ability to test results is important◉ In natural sciences, when a scientist claims an experimental result, then other scientists should be able to check it.

◉ This should be also possible for experiments carried out in computational environments.

Page 6: Why Workflows Break

A famous quote

An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment and the complete set of instructions which generated the figures.

Jon B. Buckheit and David L. Donoho, WaveLab and reproducible research, 1995

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 7: Why Workflows Break

Another quote

Abandoning the habit of secrecy in favor of process transparency and peer review was the crucial step by which alchemy became chemistry.

Eric S. Raymond, The art of UNIX programming, 2004

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 8: Why Workflows Break

Workflows: A Means for Preserving Scientific

MethodsFortunately, there is a means that can be

used to document the experiment that the

scientist ran, and even re-run it!

10 October, 2012

IEEE eScience 2012. Chicago, USA

Kegg pathway

query

Kegg pathway

query

Kegg pathway

query

Kegg pathway

query

chromosome17 chromosome37

Detect common pathways

Detect common pathways

Common pathways

Scientific workflows

Increasingly adopted in modern

sciences.

Transparent documentation of

experimental methods

Repeatable and configurable

Page 9: Why Workflows Break

Workflow Decay

A decayed or reduced ability to be executed or produce the same results

Our Contributions

An empirical analysis for identifying and categorizing the causes of workflow decay

A software framework to assess workflow preservation

Page 10: Why Workflows Break

Storyline

The importance of reproducibility

Workflow as a means for preserving scientific methods

Understanding the causes of workflow decay

Combating decay

Lessons learnt and future work

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 11: Why Workflows Break

Understanding The Causes of Workflow

DecayWe adopted an empirical approach

To identify the causes of workflow decay

To quantify their severity

To do so, we analyzed a sample of real workflows to determine if they suffer from decay and the reasons that caused their decay

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 12: Why Workflows Break

Experimental Setup

Taverna workflows from myExperiment.org

Taverna 1

Taverna 2

Selection processBy the creation year

By the creator

By the domain

Software environment

Taverna 2.3

Experiment metadata

June-July 2012

4 researchers

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 13: Why Workflows Break

Analyzed WorkflowsNumber of Taverna 1 workflows from 2007 to 2011

2007 2008 2009 2010 2011

Tested 12 10 10 10 4*

Total 74 341 101 26 13

10 October, 2012

IEEE eScience 2012. Chicago, USA

Number of Taverna 2 workflows from 2009 to 2012

2009 2010 2011 2012

Tested 12 10 15 9

Total 97 308 289 184

Page 14: Why Workflows Break

Profile of Analyzed Workflows

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 15: Why Workflows Break

The Proportion of Decay

75% of the 92 tested workflows failed to be either executed or produce the same result (if testable)

Those from early years (2007-2009) had 91% failure rate

10 October, 2012

IEEE eScience 2012. Chicago, USA

Taverna 1

Taverna 2

Page 16: Why Workflows Break

The Cause of Decay

10 October, 2012

IEEE eScience 2012. Chicago, USA

Manual analysisBy the validation report from Taverna workbenchBy interpreting experiment results reported by Taverna

Identified 4 categories of causesMissing example dataMissing execution environment Insufficient descriptions about workflows Volatile third-party Resources

Other unconsidered possible factorsChanges in the local operating environment (hardware, OS, middleware, compiler, etc)

Page 17: Why Workflows Break

Decay Caused by Third-Party Resources

10 October, 2012

IEEE eScience 2012. Chicago, USA

Causes Refined Causes ExamplesThird party resources are not available

Underlying dataset, particularly those locally hosted in-house dataset, is no longer available

Researcher hosting the data changed institution, server is no longer available

Services are deprecated DDBJ web services are not longer provided despite the fact that they are used in many myExperiment workflows

Third party resourcesare available but not accessible

Data is available but identified using different IDs than the ones known to the user

Due to scalability reasons the input data is superseded by new one making the workflow not executable or providing wrong results

Data is available but permission, certificate, or network to access it is needed

Cannot get the input, which is a security token that can only be obtained by a registered user of ChemiSpider

Services are available but need permission, certificate, or network to access and invoke them

The security policies of the execution framework are updated due to new hosting institution rules

Third party resources have changed

Services are still available by using the same identifiers but their functionality have changed

The web services are updated

Page 18: Why Workflows Break

The Cause of Decay

10 October, 2012

IEEE eScience 2012. Chicago, USA

Manual analysisBy the validation report from Taverna workbenchBy interpreting experiment results reported by Taverna

Identified 4 categories of causesMissing example dataMissing execution environment Insufficient descriptions about workflows Volatile third-party Resources

Other unconsidered possible factorsChanges in the local operating environment (hardware, OS, middleware, compiler, etc)

Page 19: Why Workflows Break

Summary of Decay Causes

50% of the decay was caused by volatility of 3rd-party resource

UnavailableInaccessibleUpdated

Missing example dataUnable to re-run

Missing execution environment

Such as local plugins

Insufficient metadataSuch as any required dependency libraries or permission information

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 20: Why Workflows Break

Storyline

The importance of reproducibility

Workflow as a means for preserving scientific methods

Understanding the causes of workflow decay

• Combating decay

• Lessons learnt and future work

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 21: Why Workflows Break

Combating Workflow Decay

• Objective: To provide enough information to– Prevent decay– Detect decay– Repair decay

• Approach: Research Objects + Checklists– Research Objects [1][2]: Aggregate workflow

specifications together with auxiliary elements, such as example data inputs, annotations, provenance traces that can be used to prevent decay and/or repair the workflow in case of decay.

– Checklists: to check that sufficient information is preserved along with the workflows

[1] http://wf4ever.github.com/ro/

[2] http://wf4ever.github.com/ro-primer/ 10 October, 2012

IEEE eScience 2012. Chicago, USA

Wf4Ever Project

Page 22: Why Workflows Break

Checklists• Checklists are a well established tool for guiding practices to ensure safety, quality and consistency in the conduct of complex operations.

• They have been adopted by the biological research community to promote consistency across research datasets

• In our case, we use checklists to assess if a research object contains sufficient information for running the workflow and checking that its results are replicable.

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 23: Why Workflows Break

Cheklist-ing the Reproducibility of a

Workflow

10 October, 2012

IEEE eScience 2012. Chicago, USA

The Minim model used in our approach is an adaptation of the MiM model [1][2].

[1] Matthew Gamble, Jun Zaho, Graham Klyne and Carole Goble. MIM: A Minimum Information Model Vocabulary and Framework for Scientific Linked Data. eScience 2012[2] https://raw.github.com/wf4ever/ro-manager/master/src/iaeval/Minim/minim.rdf

Page 24: Why Workflows Break

Use Case

• 4 myExperiment packs– 2 from genomics, 1 from geography, and 1

domain-neutral

• Experiment process:– Transform them into RO– Create checklist descriptions

• Observations– 2 research objects were found not to contain the

necessary information to run them, 2 others failed because of update to third party resources and environment of execution.

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 25: Why Workflows Break

Storyline

The importance of reproducibility

Workflow as a means for preserving scientific methods

Understanding the causes of workflow decay

• Combating decay

• Lessons Learnt and future work

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 26: Why Workflows Break

Lessons Learnt

1. Dependency is the root enemy of reproducible workflows

2. Documentation, i.e., annotation, is vital

3. Documentation should be easy to create

10 October, 2012

IEEE eScience 2012. Chicago, USA

Page 27: Why Workflows Break

The Future Work• Decay detection, explanation, and repair

• Reproducibility and provenance

• Working with scientists is vital for reproducible science – GigaScience– BioVel– 2020 Science

10 October, 2012IEEE eScience 2012. Chicago, USA

Page 28: Why Workflows Break

Acknowledgement

The principles of provenance. Dagstuhl, March 1, 2012

EU Wf4Ever project (270129) funded under EU FP7 (ICT- 2009.4.1). (http://www.wf4ever-project.org)