Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
Sarah Cohen-BoulakiaUniversité Paris Sud, LRI CNRS UMR 8623On leave at INRIA Virtual Plants & Zenith, Inst. of Comput. Biology, Montpellier
Sarah Cohen-Boulakia, Université Paris Sud2
Repositories queried (IR-style) with workflow docum.
Open question: Query languages for repositories◦ Given a high-level description of a (integration) task – a sketch◦ Given a input and/or and output format/type◦ Given a workflow – find similar workflows◦ Search across workflow models (Galaxy, Taverna…) …◦ Querying runs [KSB10, MPB10,…15]
Core of the problem: Workflow similarity◦ Clear view of the state-of-the-art [SCB+14]◦ Need to design hybrid and efficient solutions
Becomes a practical topic only now◦ Large repositories are available + Smaller provenance repositories
Relationships with Business workflows BPQL, BPMN-Q [AS10], BP-QL [BEKM08], … start considering logs
Sarah Cohen-Boulakia, Université Paris Sud3
Reuse can be improved by providing citation
Discovering reused workflows in existing repositories◦ Detecting Graph patterns
Various techniques exist, again graph-based problems
Subgraph isomorphism [Ull76] or graph simulation [FLM+10]
Constructing workflow citations◦ Techniques to track copy-paste operations when designing
workflow
◦ Workflow as citeable objects
◦ Storing/indexing workflows (graphs)
Illustration
Sarah Cohen-Boulakia, Université Paris Sud4
workflow
workflow interconnection
by ≥ 3 mutual processors
( avg 11.4 proc / swf )
Sarah Cohen-Boulakia, Université Paris Sud5
workflow
workflow interconnection
by mutual processors
( ≥ 3 )
Sarah Cohen-Boulakia, Université Paris Sud6
workflow
workflow interconnection
by mutual processors
( ≥ 3 )
Sarah Cohen-Boulakia, Université Paris Sud7
1
2
34
Sarah Cohen-Boulakia, Université Paris Sud8
1
2
34
Sarah Cohen-Boulakia, Université Paris Sud9
1
2
34
Sarah Cohen-Boulakia, Université Paris Sud10
Workflows and provenance help in reproducibility
But tasks may require certain software to be pre-installed
Open question: Make SWFS infrastructure-aware◦ Problem is well studied in operating systems / middleware
◦ SWFS need to communicate with operating system
New approaches are emerging possibly combined with workflows (virtual environments): Docker, Reprozip, ….
Reproducible papers◦ Web-based interactive computational environment ◦ Combination of code execution, text, mathematics, plots
and rich media into a single document◦ Some systems export workflows as executable IPython
papers To be formalized
Sarah Cohen-Boulakia, Université Paris Sud11
A lot of bioinformatics analysis are performed using scripts (instead of workflows)
Provenance of a script execution?◦ noWorkflow [MBC+14], yesWorkflow [MSK+15]
Equivalence between scripts and workflows?◦ Provenance-equivalence [CBC+14]? Other kind of
equivalence?
Aim ◦ Optimization of workflows (using ZOOM*userviews,
DistillFlow…) Optimization of scripts (refactoring, …)
Sarah Cohen-Boulakia, Université Paris Sud12
On-the-fly solutions have to be designed◦ Data is too volatile to be updated as in data warehouses
One size cannot fit all ◦ combining ranking criteria or consensus ranking?
Exploiting alternative paths ◦ Tuning page-rank… ?
Organizing challenges & providing gold standards to evaluate solutions
Many research opportunities …
…. with big impact on large communities of users!
Sarah Cohen-Boulakia, Université Paris Sud13
Data Integration in the Life Science (DILS) is more important than ever
Faced with the increasing number of data, sources, and analytic tools and the increasing complexity of analysis pipelines, challenges are numerous
Scientific workflows play a crucial role by their ability to combine analysis and integration and enhance reproducibility
Ranking is necessary to help priorize research
New developments in Databases and Graphs (algorithmics) will have major impact in DILS…
… and new algorithms from the DILS community may be reused by other communities!