Upload
richard-littauer
View
1.445
Download
3
Embed Size (px)
DESCRIPTION
Presented at the Open Knowledge Conference 2011 in Berlin. This work is being done under the heading of DataONE. More information can be found at http://notebooks.dataone.org/workflows
Citation preview
Workflow Classification and Open-Sourcing Methods: Towards a New Publication ModelRichard Littauer, Karthik Ram, Bertram Ludäscher, William Michener, Rebecca Koskela
Dat
aON
E
1
Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult
work
Dat
aON
E
2
Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult
work• Provide reproducibility to their
experiments
Dat
aON
E
3
Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult
work• Provide reproducibility to their
experiments• Track provenance
Dat
aON
E
4
Scientific Workflows• Tools that help scientists:• Automate repetitive or difficult
work• Provide reproducibility to their
experiments• Track provenance• Share their data with other
scientists
Dat
aON
E
5
Workflow Workbenches
Dat
aON
E
6
Workflow Workbenches
Dat
aON
E
7
Workflow Workbenches
Dat
aON
E
8
Workflow Workbenches• These facilitate:
Dat
aON
E
9
Creation
http://www.flickr.com/photos/ideacreamanuelapps/3542203718/
Workflow Workbenches• These facilitate:
Dat
aON
E
10
Mapping
http://www.flickr.com/photos/fatguyinalittlecoat/5716492273
Workflow Workbenches• These facilitate:
Dat
aON
E
11http://www.flickr.com/photos/silent-penguin/232394/
Scheduling
Workflow Workbenches• These facilitate:
Dat
aON
E
12
Execution
http://www.flickr.com/photos/pagedooley/4039784738/
Workflow Workbenches• These facilitate:
Dat
aON
E
13
http://www.flickr.com/photos/cnon/5698746966/
Visualisation
Workflow Workbenches• These facilitate:
Dat
aON
E
14
Re-use
http://www.flickr.com/photos/nihonbunka/32774212/
Workflow Workbenches
• Not all scientists are coders.
Dat
aON
E
15
Workflow Workbenches
• Not all scientists are coders.
• By using front-end visualizations and eliminating the need for lower-level coding (ie, shell scripts)…
Dat
aON
E
16
Workflow Workbenches
• Not all scientists are coders.
• By using front-end visualizations and eliminating the need for lower-level coding (ie, shell scripts)…
• …it is easier for scientists to do and share their work.
Dat
aON
E
17
http://www.flickr.com/photos/wouterverhelst/362538835/
Workflow Workbenches
• This is a common way how workflows are ‘sold’.
Dat
aON
E
18
http://www.flickr.com/photos/amagill/3366720659/
Workflow Workbenches
• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.
Dat
aON
E
19
http://www.flickr.com/photos/amagill/3366720659/
Workflow Workbenches
• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.• Often it is just replacing one style of coding (conventional)
with another (workflows).
Dat
aON
E
20
http://www.flickr.com/photos/amagill/3366720659/
Workflow Workbenches
• This is a common way how workflows are ‘sold’.• However, the reality isn't quite there yet.• Often it is just replacing one style of coding (conventional)
with another (workflows).• We’re trying to see if we can get to the bottom of how the
promises cash out.
Dat
aON
E
21
http://www.flickr.com/photos/amagill/3366720659/
Our Study
• However, there have been few studies done looking at how these workflows work.
Dat
aON
E
22
http://www.flickr.com/photos/eleaf/2536358399
Our Study
• How do we classify workflows?
Dat
aON
E
23
http://www.flickr.com/photos/eleaf/2536358399
Our Study
• How do we classify workflows?• Where do existing workflow
systems fall short?
Dat
aON
E
24
http://www.flickr.com/photos/eleaf/2536358399
Our Study
• How do we classify workflows?• Where do existing workflow
systems fall short? • How can the process of creating
workflows be improved?
Dat
aON
E
25
http://www.flickr.com/photos/eleaf/2536358399
Our Study
• How do we classify workflows?• Where do existing workflow
systems fall short? • How can the process of creating
workflows be improved?• How about executing them?
Dat
aON
E
26
http://www.flickr.com/photos/eleaf/2536358399
Our Study
• How do we classify workflows?• Where do existing workflow
systems fall short? • How can the process of creating
workflows be improved?• How about executing them?• And sharing them?
Dat
aON
E
27
http://www.flickr.com/photos/eleaf/2536358399
Our Study• Some studies have been done.
Dat
aON
E
28
Our Study• Some studies have been done.
• For example, as much as 30% of workflow components have been assessed to be so-called data conversion shims [4].
Dat
aON
E
29
Our Study• Some studies have been done.
• For example, as much as 30% of workflow components have been assessed to be so-called data conversion shims [4].
• This large percentage and the difficulty of developing custom shims suggest that workflow design technology can still be improved.
Dat
aON
E
30
Our Study• But most importantly, these studies have not significantly
changed the way we use workflows.
Dat
aON
E
31
Our Study• But most importantly, these studies have not significantly
changed the way we use workflows.
• In some cases, studies run on the same data came up with different results, which suggests that open data alone does not lead to reproducible science [5]. D
ataO
NE
32
Our Study• But most importantly, these studies have not significantly
changed the way we use workflows.
• In some cases, studies run on the same data came up with different results, which suggests that open data alone does not lead to reproducible science [5].
• Therefore, a greater understanding of workflows and how we can most adequately implement them into open science is called for.
Dat
aON
E
33
Our Study• We are analyzing a wide variety of workflow systems and
publicly available workflows.
Dat
aON
E
34
Our Study• We are analyzing a wide variety of workflow systems and
publicly available workflows.
• Our main repository: http://www.myexperiment.org Dat
aON
E
35
Our Study• We are analyzing a wide variety of workflow systems and
publicly available workflows.
• Our main repository: http://www.myexperiment.org• Est. 2007
Dat
aON
E
36
Our Study• We are analyzing a wide variety of workflow systems and
publicly available workflows.
• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users
Dat
aON
E
37
Our Study• We are analyzing a wide variety of workflow systems and
publicly available workflows.
• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users• 1850+ workflows (mostly Taverna 1, 2, and RapidMiner)
Dat
aON
E
38
Our Study• We are analyzing a wide variety of workflow systems and
publicly available workflows.
• Our main repository: http://www.myexperiment.org• Est. 2007• 4500+ users• 1850+ workflows (mostly Taverna 1, 2, and RapidMiner)• Minable by SPARQL
Dat
aON
E
39
Our Study• Methods: • For each workflow, we’re gathering three tiers of information.
Dat
aON
E
40
http://www.flickr.com/photos/jpvargas/83258973/
Our Study• Methods: • For each workflow, we’re gathering three tiers of information.
Dat
aON
E
41
http://www.flickr.com/photos/jpvargas/83258973/
Meta-Data
Description
`Worth’
Tier 1
Metadata:• Workflow source• Workflow system• Works on run• Area of research• Type• Description• User• User total uploads• Published citations• Downloads• Date uploaded
Dat
aON
E
42
Tier 2Description:• Foreign components• QA/QC steps• Visual Output• Number of inputs• Intermediate input• Linear• Embedded• Embedded details• Number of databases• Type conversion• Tag conversion• Multiple outputs
• Processing• Stats• Scalable• Smart reruns• provenance retained• Multipurpose• research mining• Query• Loop• Grid• Accounts necessary• External results
Dat
aON
E
43
Tier 3
`Worth’:• Sufficiency of metadata• Sufficiency of Natural
Language Description• Reuse in published articles• Relevant issues based on the
system it was created in.
Dat
aON
E
44
Research Hypotheses
1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.
Dat
aON
E
45
http://www.flickr.com/photos/nauright/5391995939/
Research Hypotheses
1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.
2. Workflows are becoming more complex over time.
Dat
aON
E
46
http://www.flickr.com/photos/nauright/5391995939/
Research Hypotheses
1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.
2. Workflows are becoming more complex over time.3. Workflows become more powerful over time.
Dat
aON
E
47
http://www.flickr.com/photos/nauright/5391995939/
Research Hypotheses
1. Most workflows perform simple, but repetitive data acquisition tasks as opposed to complex operations.
2. Workflows are becoming more complex over time.3. Workflows become more powerful over time. 4. Workflows become more complex as one gains more
experience. Dat
aON
E
48
http://www.flickr.com/photos/nauright/5391995939/
Research Hypotheses
5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.
Dat
aON
E
49
http://www.flickr.com/photos/nauright/5391995939/
Research Hypotheses
5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.
6. Workflow re-use is proportional to the sufficiency of the documentation.
Dat
aON
E
50
http://www.flickr.com/photos/nauright/5391995939/
Research Hypotheses
5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.
6. Workflow re-use is proportional to the sufficiency of the documentation.
7. Reuse is proportional to the age of the workflow.
Dat
aON
E
51
http://www.flickr.com/photos/nauright/5391995939/
Research Hypotheses
5. Workflow re-use is proportional to the complexity of tasks performed by the workflow.
6. Workflow re-use is proportional to the sufficiency of the documentation.
7. Reuse is proportional to the age of the workflow. 8. Workflow reuse is proportional to the proficiency of the
creator.
Dat
aON
E
52
http://www.flickr.com/photos/nauright/5391995939/
Data• Still being gathered and analysed.
Dat
aON
E
53
Data• Still being gathered and analysed.
• We’re using myExperiment download rate as a proxy for workflow reuse.
Dat
aON
E
54
Data• Still being gathered and analysed.
• We’re using myExperiment download rate as a proxy for workflow reuse.
Dat
aON
E
55
Data• Still being gathered and analysed.
• We’re using myExperiment download rate as a proxy for workflow reuse.
Dat
aON
E
56
Data• One of the issues with this is the amount of workflows being
created by each user.
• However, this still should allow for a diachronic analysis.
Dat
aON
E
57
Conclusion
Old publishing model:
Write paper. Submit paper. Drink wine.
Dat
aON
E
58
http://www.flickr.com/photos/joelmontes/4762384399/
Conclusion
Old publishing model:
Write paper. Submit paper. Drink wine.
New publishing model:
Write paper. Submit paper. Get feedback.Submit data. Replication (?)
Dat
aON
E
59
http://www.flickr.com/photos/joelmontes/4762384399/
Conclusion
Better publishing model:
Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication
Dat
aON
E
60
http://www.flickr.com/photos/mactitioner/5595830505
Conclusion
Better publishing model:
Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication
Submit workflows. That works.
Dat
aON
E
61
http://www.flickr.com/photos/mactitioner/5595830505
Conclusion
Better publishing model:
Write paper using Submit paper. Get feedback.Workflows. Submit data. Replication
Submit workflows. That works.
As this is done, questions of how effective workflows are, and how they can be utilized in the new research and publishing paradigm, might be answered.
Dat
aON
E
62
http://www.flickr.com/photos/mactitioner/5595830505
References• [1] Kepler Project. http://www.kepler-project.org• [2] Taverna. http://www.taverna.org.uk/• [3] Vistrails http://www.vistrails.org/• [4] Cui Lin, Shiyong Lu, Xubo Fei, Darshan Pai, and Jing Hua. 2009. A
Task Abstraction and Mapping Approach to the Shimming Problem in Scientific Workflows. In Proceedings of the 2009 IEEE International Conference on Services Computing (SCC '09). IEEE Computer Society, Washington, DC, USA, http://dx.doi.org/10.1109/SCC.2009.77
• [5]Coombes, K. R., Wang, J. & Baggerly, K. A. Microarrays: retracing steps.Nature Med. 13, 1276–1277 (2007).
DataONE Workflows Project: http://notebooks.dataone.org/workflows Mendeley Research Group: http://www.mendeley.com/groups/1189721/scientific-workflows-and-workflow-systems/
Dat
aON
E
63
http://www.flickr.com/photos/wwworks/4759535950/