51
Reproducibility Challenges in Computational Settings: What are they, why should we address them, and how? Andreas Rauber Vienna University of Technology [email protected] http://www.ifs.tuwien.ac.at/~andi

Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Embed Size (px)

Citation preview

Page 1: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Reproducibility Challenges in Computational Settings:

What are they, why should we address them, and how?

Andreas Rauber

Vienna University of [email protected]

http://www.ifs.tuwien.ac.at/~andi

Page 2: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Outline

What are the challenges in reproducibility? What do we gain from reproducibility?

(and: why is non-reproducibility interesting?)

How to address the challenges of complex processes?

How to deal with “Big Data”?

Summary

Page 3: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Challenges in Reproducibility

Challenges in reproducibility or: Why data sharing is not enough

FAIR principles are a necessity Data Management and DMPs are a necessity But they are not sufficient if we want to

- Ensure reproducibility

- Enable metastudies

- Benefit from efficient eScience

…unless we define data broader than we commonly tend to do

Page 4: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Challenges in Reproducibility

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234

Page 5: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Challenges in Reproducibility

Excursion: Scientific Processes

Page 6: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Challenges in Reproducibility Excursion: scientific processes

set1_freq440Hz_Am12.0Hz

set1_freq440Hz_Am05.5Hz

set1_freq440Hz_Am11.0Hz

Java Matlab

Page 7: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Challenges in Reproducibility

Excursion: Scientific Processes

Bug? Psychoacoustic transformation tables? Forgetting a transformation? Different implementation of filters? Limited accuracy of calculation? Difference in FFT implementation? ...?

Page 8: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Challenges in Reproducibility

Workflows

Taverna

Page 9: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Challenges in Reproducibility

Page 10: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Challenges in Reproducibility Large scale quantitative analysis Obtain workflows from MyExperiments.org

- March 2015: almost 2.700 WFs (approx. 300-400/year)- Focus on Taverna 2 WFs: 1.443 WFs- Published by authors should be „better quality“

Try to re-execute the workflows- Record data on the reasons for failure along

Analyse the most common reasons for failures

Page 11: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Re-Execution resultsMajority of workflows failsOnly 23.6 % are successfully executed

- No analysis yet on correctness of results…

Challenges in Reproducibility

Rudolf Mayer, Andreas Rauber, “A Quantitative Study on the Re-executability of Publicly Shared Scientific Workflows”, 11th IEEE Intl. Conference on e-Science, 2015.

Page 12: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Computer Science 613 papers in 8 ACM conferences Process

- download paper and classify- search for a link to code (paper, web, email twice)- download code- build and execute

Christian Collberg and Todd Proebsting. “Repeatability in Computer Systems Research,” CACM 59(3):62-69.2016

Page 13: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

In a nutshell – and another aspect of reproducibility:

Challenges in Reproducibility

Source: xkcd

Page 14: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Reproducibility – solved! (?)

Reproducibility is more than just sharing the data! Provide source code, parameters, data, … Ensure that it works:

Wrap it up in a container/virtual machine, …

done?LXC

Page 15: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Outline

What are the challenges in reproducibility?

What do we gain by aiming for reproducibility?

How to address the challenges of complex processes?

How to deal with dynamic data?

Summary

Page 16: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Reproducibility – solved! (?)

Provide source code, parameters, data, … Wrap it up in a container/virtual machine, …

Why do we want reproducibility? Which levels or reproducibility are there? What do we gain by different levels of reproducibility?

LXC

Page 17: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Reproducibility – solved! (?)

Dagstuhl Seminar: Reproducibility of Data-Oriented Experiments in e-ScienceJanuary 2016, Dagstuhl, Germany

Page 18: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Types of Reproducibility

The PRIMAD1 model: which attributes can we “prime”? - Data

• Parameters• Input data

- Plattform- Implementation- Method- Research Objective- Actors

What do we gain by priming one or the other?

[1] Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented Experiments in eScience. Dagstuhl Reports, 6(1), 2016.

Page 19: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Types of Reproducibility and Gains

Page 20: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Reproducibility Papers

Aim for reproducibility: for one’s own sake – and as Chairs of conference tracks, editor, reviewer, superviser, …- Review of reproducibility of submitted work (material provided)

- Encouraging reproducibility studies

- (Messages to stakeholders in Dagstuhl Report)

Consistency of results, not identity! Reproducibility studies and papers

- Not just re-running code / a virtual machine

- When is a reproducibility paper worth the effort / worth being published?

Page 21: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Reproducibility Papers When is a Reproducibility paper worth being published?

Page 22: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Learning from Non-Reproducibility

Do we always want reproducibility?

- Scientifically speaking: yes! Research is addressing challenges:

- Looking for and learning from non-reproducibility! Non-reproducibility if

- Some (un-known) aspect of a study influences results

- Technical: parameter sweep, bug in code, OS, … -> fix it!

- Non-technical: input data! (specifically: “the user”)

Page 23: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Learning from Non-Reproducibility

Challenges in MIR – “things sometimes don’t seem to work”Virtual Box, Github, <your favourite tool> are starting pointsSame features, same algorithm, different data -> Same data, different listeners -> Understanding “the rest”:

- Isolating unknown influence factors

- Generating hypotheses

- Verifying these to understand the “entire system”, cultural and other biases, …

Benchmarks and Meta-Studies

Page 24: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Reproducibility – solved! (?)

Provide source code, parameters, data, … Wrap it up in a container/virtual machine,

Provide context information Encourage reproducibility studies beyond re-running Use it to establish trust in your research & gain new insights

done?

LXC

Page 25: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Outline

What are the challenges in reproducibility?

What do we gain by aiming for reproducibility?

How to address the challenges of complex processes?

How to deal with “Big Data”?

Summary

Page 26: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Deja-vue…

http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0038234

Page 27: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

And the solution is…

Standardization and Documentation- Standardized components, procedures, workflows- Documenting complete system set-up across

entire provenance chain How to do this – efficiently?

Alexander Graham Bell’s Notebook, March 9 1876https://commons.wikimedia.org/wiki/File:Alexander_Graham_Bell's_notebook,_March_9,_1876.PNG

Pieter Bruegel the Elder: De Alchemist (British Museum, London)

Page 28: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Documenting a Process

Context Model: establish what to document and how Meta-model for describing process & context

- Extensible architecture integrated by core model- Reusing existing models as much as possible- Based on ArchiMate, implemented using OWL

Extracted by static and dynamic analysis

Page 29: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Context Model – Static Analysis

Analyses steps, platforms, services, tools called Dependencies (packages, libraries) HW, SW Licenses, …

Taverna Workflow ArchiMate model Context Model(OWL ontology)

#!/bin/bash

# fetch datajava -jar GestBarragensWSClientIQData.jarunzip -o IQData.zip

# fix encoding#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r

# generate referencesR --vanilla < iq_utf8.r > IQout.txt

# create pdfpdflatex iq.texpdflatex iq.tex

Script

Page 30: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Context Model – Dynamic Analysis

Process Migration Framework (PMF)

- designed for automatic redeployments into virtual machines

- uses strace to monitor system calls

- complete log of all accessed resources (files, ports)

- captures and stores process instance data

- analyse resources (file formats via PRONOM, PREMIS)

Page 31: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Context Model – Dynamic Analysis

Taverna Workflow

Page 32: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

VFramework

Are these processes the same?

Original environment Redeployment environmentRepository

Preserve Redeploy

Page 33: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

VFramework

Page 34: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

VFramework

#!/bin/bash

# fetch datajava -jar GestBarragensWSClientIQData.jarunzip -o IQData.zip

# fix encoding#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r

# generate referencesR --vanilla < iq_utf8.r > IQout.txt

# create pdfpdflatex iq.texpdflatex iq.tex

Page 35: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

VFramework

#!/bin/bash

# fetch datajava -jar GestBarragensWSClientIQData.jarunzip -o IQData.zip

# fix encoding#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r

# generate referencesR --vanilla < iq_utf8.r > IQout.txt

# create pdfpdflatex iq.texpdflatex iq.tex

Page 36: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

VFramework

#!/bin/bash

# fetch datajava -jar GestBarragensWSClientIQData.jarunzip -o IQData.zip

# fix encoding#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r

# generate referencesR --vanilla < iq_utf8.r > IQout.txt

# create pdfpdflatex iq.texpdflatex iq.tex

Page 37: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

#!/bin/bash

# fetch datajava -jar GestBarragensWSClientIQData.jarunzip -o IQData.zip

# fix encoding#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r

# generate referencesR --vanilla < iq_utf8.r > IQout.txt

# create pdfpdflatex iq.texpdflatex iq.tex

VFramework

ADDED

NOT USED

Page 38: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Reproducibility – solved! (?)

Provide source code, parameters, data, … Wrap it up in a container/virtual machine, Provide context information Encourage reproducibility studies beyond re-running Use it to establish trust in your research & gain new insights

(automatically) capture process execution context Verify re-executions

done?

LXC

Page 39: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Outline

What are the challenges in reproducibility?

What do we gain by aiming for reproducibility?

How to address the challenges of complex processes?

How to deal with “Big Data”?

Summary

Page 40: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Research Data Alliance WG on Data Citation:

Making Dynamic Data Citeable WG endorsed in March 2014

- Concentrating on the problems of large, dynamic (changing) datasets

- Focus! Identification of data!Not: PID systems, metadata, citation string, attribution, …

- Liaise with other WGs and initiatives on data citation (CODATA, DataCite, Force11, …)

- https://rd-alliance.org/working-groups/data-citation-wg.html

RDA WG Data Citation

Page 41: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Data Citation – Output

14 Recommendationsgrouped into 4 phases:- Preparing data and query store- Persistently identifying specific data sets- Resolving PIDs- Upon modifications to the data

infrastructure 2-page flyer

https://rd-alliance.org/recommendations-working-group-data-citation-revision-oct-20-2015.html

More detailed report: IEEE TCDL 2016http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-TCDL-DC-2016_paper_1.pdf

Page 42: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Data Citation – Output

14 Recommendationsgrouped into 4 phases:- Preparing data and query store- Persistently identifying specific data sets- Resolving PIDs- Upon modifications to the data

infrastructure 2-page flyer

https://rd-alliance.org/recommendations-working-group-data-citation-revision-oct-20-2015.html

More detailed report: IEEE TCDL 2016http://www.ieee-tcdl.org/Bulletin/v12n1/papers/IEEE-TCDL-DC-2016_paper_1.pdf

Detailed presentation on

Tuesday, Session 9,

12:00-13:30

Page 43: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

3 Take-Away Messages

Message 1Aim at achieving reproducibility at different levels

- Re-run, ask others to re-run

- Re-implement

- Port to different platforms

- Test on different data, vary parameters (and report!)

If something is not reproducible -> investigate!(you might be onto something!)Encourage reproducibility studies!

Page 44: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

3 Take-Away Messages

Message 2Aim for better procedures and documentationDocument the research process, environment, interim results, …(preferably automatically, 80:20, …)The process is part of the data (and vice versa)

Source: xkdc Pieter Bruegel the Elder: De Alchemist (British Museum, London)

Research Objects, Context Models, VFramework

Page 45: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

3 Take-Away Messages

Message 3Aim for proper (research) data management(not just in academia!)Data Management Plans, Research Infrastructure Services

Source: http://www.phdcomics.com/comics.php?f=1323 RDA WGDC: Dynamic Data Citation

Detailed presentation on

Tuesday, Session 9,

12:00-13:30

Page 46: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Summary

Trustworthy and efficient e-Science Need to move beyond preserving code + data Need to move beyond the focus on description Capture Process and entire execution context Precisely identify data used in process Verification of re-execution Data and process re-use as basis for data driven science

- evidence- investment- efficiency

Trust!!

Page 47: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Summary

Preaching and eating…

Do we do all this in our lab for our experiments?

No! (not yet?)

Researchers (also in CS) need assistance

Institutions and Research Infrastructures

… and some research on open questions on how to best do

all of this (but mind the infamous 80:20 rule)

Page 48: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Summary

C. Glenn Begley, Alastair M. Buchan, Ulrich Dirnagl: Robust research: Institutions must do their part for reproducibility, Nature 525(7567), Sep 3 2015, Illustration by David Parkinshttp://www.nature.com/news/robust-research-institutions-must-do-their-part-for-reproducibility-1.18259?WT.mc_id=SFB_NNEWS_1508_RHBox

Page 49: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Acknowledgements

Johannes Binder Rudolf Mayer Tomasz Miksa Stefan Pröll Stephan Strodl Marco Unterberger

TIMBUS SBA: Secure Business Austria RDA: Research Data Alliance WGDC

Page 50: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

References Juliana Freire, Norbert Fuhr, and Andreas Rauber. Reproducibility of Data-Oriented

Experiments in eScience. Dagstuhl Reports, 6(1), 2016. Andreas Rauber, Ari Asmi, Dieter van Uytvanck and Stefan Proell. Identification of

Reproducible Subsets for Data Citation, Sharing and Re-Use. Bulletin of IEEE Technical Committee on Digital Libraries (TCDL), vol. 12, 2016.

Andreas Rauber, Tomasz Miksa, Rudolf Mayer and Stefan Proell. Repeatability and Re-Usability in Scientific Processes: Process Context, Data Identification and Verification. In Proceedings of the 17th International Conference on Data Analytics and Management in Data Intensive Domains (DAMDID), 2015.

Tomasz Miksa, Rudolf Mayer and Andreas Rauber. Ensuring sustainability of web services dependent processes. International Journal of Computational Science and Engineering (IJCSE). 2015 Vol.10, No.1/2, pp.70 – 81

Rudolf Mayer and Andreas Rauber, A Quantitative Study on the Re-executability ofPublicly Shared Scientific Workflows. 11th IEEE Intl. Conference on e-Science, 2015.

Rudolf Mayer, Tomasz Miksa and Andreas Rauber. Ontologies for describing the context of scientific experiment processes. 10th IEEE Intl. Conference on e-Science, 2014.

Tomasz Miksa, Stefan Proell, Rudolf Mayer, Stephan Strodl, Ricardo Vieira, Jose Barateiro and Andreas Rauber, Framework for verification of preserved and redeployed processes. 10th International Conference on Preservation of Digital Objects (IPRES2013), 2013.

Tomasz Miksa, Stephan Strodl and Andreas Rauber, Process Management Plans. International Journal of Digital Curation, Vol 9, No 1 (2014),pp. 83-97.

Page 51: Reproducibility challenges in computational settings: what are they, why should we address them, and how?

Thank you!

http://www.ifs.tuwien.ac.at/imp