of 31/31
Making computations reproducible Tokyo.SciPy #6 2014-08-02 1 / 31

Making Computations Reproducible

  • View

  • Download

Embed Size (px)



Text of Making Computations Reproducible

  • Making computations reproducible Tokyo.SciPy #6 2014-08-02 1 / 31
  • Abstract Scientic computations tend to involve a number of experiments under different conditions. It is important to manage computational experiments so that their results are reproducible. In this talk we introduce 3 rules to make computations reproducible. 2 / 31
  • Outline ...1 Introduction ...2 Discipline Three elements Three rules Complements ...3 Practice ...4 Summary 3 / 31
  • 1. Introduction 4 / 31
  • Background A lab notebook is indispensable for experimental research in natural science. One of its role is to make experiments reproducible. Why not for computational research? . ......Lack of reproducibility means lack of reliability. 5 / 31
  • Common problems Common problems in computational experiments: I confused which results is got under which condition. I overwrote previous results without intent. I used inconsistent data to get invalid results. ... Not a few problems are caused due an inappropriate management of experiments. 6 / 31
  • Goal To archive all results of each experiment with all information required to reproduce them so that we can retrieve and restore easily in a systematic and costless way. 7 / 31
  • Note What is introduced in this talk is not a established methodology, but a collection of eld techniques. Same with wording. In this talk, we will not deal with distributed computation documentation or test publishing of a paper release of OSS 8 / 31
  • 2. Discipline 9 / 31
  • Three elements We distinguish the following elements which affect reproducibility of computations: Algorithm an algorithm coded into a program implemented by yourself, calling external library, ... Data input and output data, intermediate data to reuse Environment software and hardware environment external library, server conguration, platform, ... 10 / 31
  • Three rules Give an Identier to each element and archive them. Record a machine-readable Recipe with a human-readable comments. Make every manipulation Mechanized. 11 / 31
  • . Identier.. ......Give an Identier to each element and archive them. Algorithm use version control system Data give a name to distinguish data kind give a version to distinguish concrete content Environment nd information of platform nd a version (optionally build parameters) of a library Keep in mind to track all elements during the whole process: every code under version control no data without an identier no temporary environment 12 / 31
  • . Recipe .. ...... Record a machine-readable Recipe with a human-readable comments. A recipe should include all information required to reproduce the results of an experiment (other than contents of Algorithm, Data and Environment stored in other place.) A recipe should be machine-readable to re-conduct the experiment. A recipe should include a human-readable comment on purpose and/or meanings of the experiment. A recipe should be generated automatically by tracking experiments. 13 / 31
  • Typically a recipe include the following information: in which order which data is processed by which algorithm under which environment with which Parameter Typically a recipe consists of the followings: a script le to run the whole process a conguration le which species parameters and identiers a text le of comments 14 / 31
  • . Mechanize.. ......Make every manipulation Mechanized. Run the whole process of an experiment by a single operation. No manual manipulation of data. No manual compilation of source codes. Automated provision of an environment. 15 / 31
  • complement: Tentative experiment Too large archive detracts substantive signicant of reproducibility. For tentative experiments with ephemeral results, it is not necessarily required to record. test of codes trial on tiny data ... If there is a possibility to get a result which might be used, referred or looked up afterward, then it should be recorded. 16 / 31
  • complement: Reuse of intermediate data In order to reuse intermediate data, utilize an identier. Explicitly specify intermediate data to reuse by an identier. Automatically detect available intermediate data based on dependency. ... 17 / 31
  • 3. Practice 18 / 31
  • Identify Algorithm Use a version control system to manage source codes such as Git and Mercurial. It is easy to record a revision and uncommitted changes at each experiment. (Learn inside of VCS if you need more exible management.) 19 / 31
  • Identify Data File Give appropriate names to directories and les, then a resolved absolute path can be used as an identier. If no meaningful word is thought up, use time-stamp or hash. DB or other API A pair of URI and query of which results are constant can be used as an identier. If API behaves randomly, keep the results at hand (w/time-stamp). 20 / 31
  • Identify Environment Python package Use PyPa tools (virtualenv, setuptools and pip) or Conda/enstaller. Library Use HashDist. It is an alternative to utilize CDE. Platform Use platform, a standard library of Python Server conguration Use Ansible or other conguration management tool, and Vagrant or other provisioning tool. 21 / 31
  • HashDist A tool for developing, building and managing software stacks. An software stack is described by YAML. We can create, copy, move and remove software stacks. $ git checkout stack.yml $ hit build stack.yaml 22 / 31
  • Recipe: conguration le A conguration in recipe should be of a machine-readable format. Use CongParser, PyYAML or json module to read/write parameters in INI, YAML or JSON format. A receipt should include the followings: command line argument environment variable random seed 23 / 31
  • Recipe: script le A script in recipe should run the whole process by a single operation. There are several alternatives to realize such a script: utilize a build tool (such as Autotools, Scons, and maf) utilize a job-ow tool (such as Ruffus, Luigi) write a small script by hand (e.g. run.py) 24 / 31
  • maf maf is a waf extension for writing computational experiments. Conduct computational experiments as build processes. Focus on machine learning: list congurations run programs with each conguration aggregate and visualize their results 25 / 31
  • Recipe: automatic generation Do it yourself, or use Sumatra. Sumatra: automated tracking of scientic computations recording information about experiments, linking to data les command line & web interface integration with LATEX/Sphinx $ smt run --executable=python --main=main.py conf.param input.data $ smt comment "..." $ smt info $ smt repeat 26 / 31
  • 4. Summary 27 / 31
  • Summary We have introduced 3 rules to manage computational experiments so that their results are reproducible. However, our method is just a makeshift patchwork of eld techniques. . ...... We need a tool to manage experiments in more integrated, systematic and sophisticated manner for reproducible computations. 28 / 31
  • Links PyPa http://python-packaging-user-guide. readthedocs.org Conda http://conda.pydata.org enstaller https://github.com/enthought/enstaller HashDist http://hashdist.github.io CDE http://www.pgbovine.net/cde.html Ansible http://www.ansible.com Vagrant http://www.vagrantup.com Scons http://www.scons.org maf https://github.com/pfi/maf Ruffus http://www.ruffus.org.uk Luigi https://github.com/spotify/luigi Sumatra http://neuralensemble.org/sumatra 29 / 31
  • References [1] G. K. Sandve, A. Nekrutenko, J. Taylor, E. Hovig, Ten Simple Rules for Reproducible Computational Research, PLoS Comput. Biol. 9(10): e1003285 (2013). doi:10.1371/journal.pcbi.1003285 [2] V. Stodden, F. Leisch, R. Peng, Implementing Reproducible Research, Open Science Framework (2014). osf.io/s9tya 30 / 31
  • n. back to outline Revision: f2b0e97 (2014-08-03) 31 / 31