31
Making computations reproducible Tokyo.SciPy #6 2014-08-02 1 / 31

Making Computations Reproducible

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: Making Computations Reproducible

Making computations reproducible

Tokyo.SciPy #62014-08-02

1 / 31

Page 2: Making Computations Reproducible

Abstract

Scientific computations tend to involve a number of experimentsunder different conditions.

It is important to manage computational experiments so that theirresults are reproducible.

In this talk we introduce 3 rules to make computations reproducible.

2 / 31

Page 3: Making Computations Reproducible

Outline

...1 Introduction

...2 DisciplineThree elementsThree rulesComplements

...3 Practice

...4 Summary

3 / 31

Page 4: Making Computations Reproducible

1. Introduction

4 / 31

Page 5: Making Computations Reproducible

Background

A lab notebook is indispensable for experimental research in naturalscience. One of its role is to make experiments reproducible.

Why not for computational research?

.

......Lack of reproducibility means lack of reliability.

5 / 31

Page 6: Making Computations Reproducible

Common problems

Common problems in computational experiments:I confused which results is got under which condition.I overwrote previous results without intent.I used inconsistent data to get invalid results....

Not a few problems are caused due an inappropriate managementof experiments.

6 / 31

Page 7: Making Computations Reproducible

Goal

To archive all results of each experiment withall information required to reproduce themso that we can retrieve and restore easilyin a systematic and costless way.

7 / 31

Page 8: Making Computations Reproducible

Note

What is introduced in this talk is not a established methodology,but a collection of field techniques. Same with wording.

In this talk, we will not deal withdistributed computationdocumentation or testpublishing of a paperrelease of OSS

8 / 31

Page 9: Making Computations Reproducible

2. Discipline

9 / 31

Page 10: Making Computations Reproducible

Three elements

We distinguish the following elements which affect reproducibility ofcomputations:

Algorithm an algorithm coded into a programimplemented by yourself, calling external library, ...

Data input and output data, intermediate data to reuseEnvironment software and hardware environment

external library, server configuration, platform, ...

10 / 31

Page 11: Making Computations Reproducible

Three rules

Give an Identifier to each element and archive them.Record a machine-readable Recipewith a human-readable comments.Make every manipulation Mechanized.

11 / 31

Page 12: Making Computations Reproducible

.Identifier........Give an Identifier to each element and archive them.

Algorithmuse version control systemDatagive a name to distinguish data kindgive a version to distinguish concrete contentEnvironmentfind information of platformfind a version (optionally build parameters) of a library

Keep in mind to track all elements during the whole process:every code under version controlno data without an identifierno temporary environment

12 / 31

Page 13: Making Computations Reproducible

.Recipe..

......Record a machine-readable Recipewith a human-readable comments.

A recipe should include all informationrequired to reproduce the results of an experiment(other than contents of Algorithm, Data and Environmentstored in other place.)

A recipe should be machine-readable to re-conduct the experiment.

A recipe should include a human-readable commenton purpose and/or meanings of the experiment.

A recipe should be generated automatically by trackingexperiments.

13 / 31

Page 14: Making Computations Reproducible

Typically a recipe include the following information:in which orderwhich data is processedby which algorithmunder which environmentwith which Parameter

Typically a recipe consists of the followings:a script file to run the whole processa configuration file which specifies parameters and identifiersa text file of comments

14 / 31

Page 15: Making Computations Reproducible

.Mechanize........Make every manipulation Mechanized.

Run the whole process of an experiment by a single operation.No manual manipulation of data.No manual compilation of source codes.Automated provision of an environment.

15 / 31

Page 16: Making Computations Reproducible

complement: Tentative experiment

Too large archive detracts substantive significant of reproducibility.

For tentative experiments with ephemeral results,it is not necessarily required to record.

test of codestrial on tiny data...

If there is a possibility to get a result which might be used, referredor looked up afterward, then it should be recorded.

16 / 31

Page 17: Making Computations Reproducible

complement: Reuse of intermediate data

In order to reuse intermediate data, utilize an identifier.Explicitly specify intermediate data to reuse by an identifier.Automatically detect available intermediate databased on dependency....

17 / 31

Page 18: Making Computations Reproducible

3. Practice

18 / 31

Page 19: Making Computations Reproducible

Identify Algorithm

Use a version control system to manage source codessuch as Git and Mercurial.

It is easy to record a revision and uncommitted changesat each experiment.

(Learn inside of VCS if you need more flexible management.)

19 / 31

Page 20: Making Computations Reproducible

Identify Data

FileGive appropriate names to directories and files,then a resolved absolute path can be used as an identifier.

If no meaningful word is thought up, use time-stamp or hash.

DB or other APIA pair of URI and query of which results are constantcan be used as an identifier.

If API behaves randomly, keep the results at hand (w/time-stamp).

20 / 31

Page 21: Making Computations Reproducible

Identify Environment

Python packageUse PyPa tools (virtualenv, setuptools and pip) or Conda/enstaller.

LibraryUse HashDist.It is an alternative to utilize CDE.

PlatformUse platform, a standard library of Python

Server configurationUse Ansible or other configuration management tool,and Vagrant or other provisioning tool.

21 / 31

Page 22: Making Computations Reproducible

HashDist

A tool for developing, building and managing software stacks.

An software stack is described by YAML.We can create, copy, move and remove software stacks.

$ git checkout stack.yml

$ hit build stack.yaml

22 / 31

Page 23: Making Computations Reproducible

Recipe: configuration file

A configuration in recipe should be of a machine-readable format.

Use ConfigParser, PyYAML or json moduleto read/write parameters in INI, YAML or JSON format.

A receipt should include the followings:command line argumentenvironment variablerandom seed

23 / 31

Page 24: Making Computations Reproducible

Recipe: script file

A script in recipe should run the whole processby a single operation.

There are several alternatives to realize such a script:utilize a build tool (such as Autotools, Scons, and maf)utilize a job-flow tool (such as Ruffus, Luigi)write a small script by hand (e.g. run.py)

24 / 31

Page 25: Making Computations Reproducible

maf

“maf is a waf extension for writing computational experiments.”

Conduct computational experiments as build processes.Focus on machine learning:

list configurationsrun programs with each configurationaggregate and visualize their results

25 / 31

Page 26: Making Computations Reproducible

Recipe: automatic generation

Do it yourself, or use Sumatra.

“Sumatra: automated tracking of scientific computations”recording information about experiments, linking to data filescommand line & web interfaceintegration with LATEX/Sphinx

$ smt run --executable=python --main=main.py \

conf.param input.data

$ smt comment "..."

$ smt info

$ smt repeat

26 / 31

Page 27: Making Computations Reproducible

4. Summary

27 / 31

Page 28: Making Computations Reproducible

Summary

We have introduced 3 rules to manage computational experimentsso that their results are reproducible.

However, our method is just a makeshift patchwork of fieldtechniques.

.

......

We need a tool to manage experimentsin more integrated, systematic and sophisticated mannerfor reproducible computations.

28 / 31

Page 29: Making Computations Reproducible

LinksPyPa http://python-packaging-user-guide.

readthedocs.org

Conda http://conda.pydata.orgenstaller https://github.com/enthought/enstallerHashDist http://hashdist.github.io

CDE http://www.pgbovine.net/cde.htmlAnsible http://www.ansible.comVagrant http://www.vagrantup.com

Scons http://www.scons.orgmaf https://github.com/pfi/maf

Ruffus http://www.ruffus.org.ukLuigi https://github.com/spotify/luigi

Sumatra http://neuralensemble.org/sumatra

29 / 31

Page 30: Making Computations Reproducible

References

[1] G. K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig, “Ten Simple Rulesfor Reproducible Computational Research,” PLoS Comput. Biol.9(10): e1003285 (2013). doi:10.1371/journal.pcbi.1003285

[2] V. Stodden, F. Leisch, R. Peng, “Implementing ReproducibleResearch,” Open Science Framework (2014). osf.io/s9tya

30 / 31

Page 31: Making Computations Reproducible

fin.

back to outline

Revision: f2b0e97 (2014-08-03)

31 / 31