Marie-Josée Cros MIAT, INRA Toulouse Atelier ... · Contexte Le calcul scientifique central dans la recherche Des scandales : climategate (2009), en économie (2013) ... Un haut

Comment être plus reproductible ?Marie-Josée Cros

MIAT, INRA Toulouse

Atelier Reproductibilité, Journées Bioinformatique INRA, 22mars 2016

Contexte● Le calcul scientifique central dans la recherche

● Des scandales : climategate (2009), en économie (2013) ...

● Un haut pourcentage de rétractation en publication

● Dans les faits très peu de reproductibilité (47 sur 53 publications sur la cancer non reproductible, Begley and Ellis, Nature, 2012 ;

80% non reproductible dans la 1ère étude, Collberg, Christian et Al.,

Measuring Reproducibility in Computer Systems Research 2014, 2015,

http://reproducibility.cs.arizona.edu )

Mouvements pour une science plus reproductible, ouverte

→ meilleure science, partage, réutilisation ...

http://reproducibleresearch.net https://okfn.org

http://reproducibility.cs.arizona.edu/

http://reproducibleresearch.net/

https://okfn.org/

Reproductibilité

http://researchoject.org

Reproduire les résultats d'une étude sans avoir les mêmes matériels.

http://researchoject.org/

Mise en œuvre

Politique

Pratiques

Outils

Des politiques

Code : disponible pour les reviewers et les lecteurs de papierCopyright : propriété du droit d’auteur et licence clairement indiquésCitation : créditer les créateurs du code utilisé ou adapté dans les publications résultantesCrédit : contributions logicielles inclues dans l’évaluation scientifiqueConservation : code source disponible lors de la durée de vie utile d’une publication

sciencecodemanifesto.org

Pour inciter, reconnaître, créer des supports ...

http://pantonprinciples.org

https://en.wikipedia.org/wiki/Open_science

http://www.software.ac.uk

http://f1000research.com

http://pantonprinciples.org/

https://en.wikipedia.org/wiki/Open_science

http://www.software.ac.uk/

http://f1000research.com/

Des pratiques

http://dx.doi.org/10.6084/m9.figshare.1286826

Des pratiques tout au long du processus de recherche :

● pour gérer et analyser les donnéesAppliquer les règles pour gérer le cycle de vie des données

● pour programmerAppliquer les règles pour gérer le développement et la diffusion

● pour diffuser PF de dissémination : Open Science Framework, Research Compendia, MLOSS, IPOL, thedatahub …

● se former (« data scientist »)MOOC, webinars, partage d'expériences, workshop...


Des pratiques pour gérer les données

Data acquisition Where possible, store data in nonproprietary software formats Keep backups on stable media, and in geographically separated locations Have a clear organisation for your data files Always maintain effective metadataData analysis Keep code and data separate Use version control Test your code Split commonly used code off into functions/classes, put these into libraries Prioritize code robustness Maintain a consistent, repeatable computing environment Separate code from configuration Share your codeVerification (testing) How confident are you that your code is doing what you think it’s doing? Automated testing Test coverage measurementProvenance tracking The lab notebook

Best practices for data management in neurophysiology https://rrcns.readthedocs.org/en/latest

Advertise your data using datacasting toolsAssign descriptive file namesBackup your dataCheck data and other outputs for print and web accessibilityChoose and use standard terminology to enable discoveryCommunicate data qualityConfirm a match between data and their description in metadataConsider the compatibility of the data you are integratingCreate a data dictionaryCreate and document a data backup policyCreate, manage, and document your data storage systemDecide what data to preserveDefine expected data outcomes and typesDefine roles and assign responsibilities for data managementDefine the data model

DataONE Best practices https://www.dataone.org/all-best-practices

https://rrcns.readthedocs.org/en/latest

https://www.dataone.org/all-best-practices

Des pratiques pour programmer

Write programs for people, not computers. A program should not require its readers to hold more than a handful of facts in memory at once. Make names consistent, distinctive, and meaningful. Make code style and formatting consistent. Let the computer do the work. Make the computer repeat tasks. Save recent commands in a file for re-use. Use a build tool to automate workflows. Make incremental changes. Work in small steps with frequent feedback and course correction. Use a version control system. Put everything that has been created manually in version control. Don't repeat yourself (or others). Every piece of data must have a single authoritative representation in the system. Modularize code rather than copying and pasting. Re-use code instead of rewriting it. Plan for mistakes. Add assertions to programs to check their operation. Use an off-the-shelf unit testing library. Turn bugs into test cases. Use a symbolic debugger. Optimize software only after it works correctly. Use a profiler to identify bottlenecks. Write code in the highest-level language possible. Document design and purpose, not mechanics. Document interfaces and reasons, not implementations. Refactor code in preference to explaining how it works. Embed the documentation for a piece of software in that software. Collaborate. Use pre-merge code reviews. Use pair programming Use an issue tracking tool.

Rule 1: For Every Result, Keep Track of How It Was ProducedRule 2: Avoid Manual Data Manipulation StepsRule 3: Archive the Exact Versions of All External Programs UsedRule 4: Version Control All Custom ScriptsRule 5: Record All Intermediate Results, When Possible in Standardized FormatsRule 6: For Analyses That Include Randomness, Note Underlying Random SeedsRule 7: Always Store Raw Data behind PlotsRule 8: Generate Hierarchical Analysis Output, Rule 9: Connect Textual Statements to Underlying ResultsRule 10: Provide Public Access to Scripts, Runs, and Results

Best Practices for Scientific Computing http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

Ten Simple Rules for Reproducible Computational Research http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285

Des outils

Quelques outils

Gestionnaire de version de fichiers : Git (git-scm.com) et code repositories : Github.com, Bitbucket.org, Forge logiciel (sourcesup.cru.fr)…

Programmation lettrée (Literate Programming) et cahier de laboratoire électronique (Electronic lab notebook) : Sweave (http://leisch.userweb.mwn.de/Sweave), knitr (http://yihui.name/knitr), Emacs Org mode (orgmode.org), Ipython (ipython.org), Matlab, Mathematica, Sage (sagemath.org) …

Workflow Tracking environment : VisTrails.org, Taverna.org.uk, Galaxy (galaxyproject.org), Sumatra (neuralensemble.org/sumatra/), Kepler (kepler-project.org), Pegasus (pegasus.isi.edu)…

Capture d'environnement : virtual machine, Linux package (CDE pgbovine.net/cde.html), Docker.com …

Sites de dépôt : FigShare.com, Zenodo.org, ResearchCompendia.org, dataverse.org, Dryad (datadryad.org), RunMyCode.org, MyExperiment.org, recomputation.org, Open Science Framework (osf.io) …

Que faire ?

Publieur : afficher l’ouverture d’une publication, évaluer aussi la réalisation (pas seulement le résultat)

Institut / financeur de recherche : pondérer l’importance des publications, proposer des infrastructures

Réseau : partager pratiques, échanger, former, définir des standards, des formats

Laboratoire : encourager description intelligible / transparence / échanges

Individu : passer à la programmation lettrée, tenir un cahier de labo électronique, gérer les versions, tester, rendre accessible données/code/processus

Même s'il n'est pas toujours facile de changer ses pratiques, chacun peut rendre ses travaux plus fiables et reproductibles !

Un effort multi-facettes multi-acteurs qui doit être reconnu et encouragé.Pas de consensus sur une meilleure façon d'améliorer et de mesurer la reproductibilité.

Quelques références

● Reproducibility and Scientific Research. Carole Goble, Open Data Manchester, 2015.http://fr.slideshare.net/carolegoble/open-sciencemcrgoble2015

● 101 Innovations in Scholarly Communication. The changing research workflow.https://figshare.com/articles/101_Innovations_in_Scholarly_Communication_the_Changing_Research_Workflow/1286826

● DataONE Best Practices https://www.dataone.org/all-best-practices ● The FAIR (Findable A ccessible Interoperable Re-usable ) data principles

https://www.force11.org/group/fairgroup/fairprinciples ● Best practices for data management in neurophysiology.

http://rrcns.readthedocs.org/en/latest/index.html#best-practices-for-data-management-in-neurophysiology● Best Practices for Scientific Computing. Wilson G, Aruliah DA, Brown CT, Chue Hong NP, Davis M, Guy

RT, Haddock SH, Huff KD, Mitchell IM, Plumbley MD, Waugh B, White EP, Wilson P. PLoS Biology 12(1), 2014. http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

● Ten simple rules for Reproducible Computational Research. Sandve GK, Nekrutenko A, Taylor J, Hovig E. PLoS Computational Biology 9(10), 2013. http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1003285

● Données de la recherche IST INRA. https://wiki.inra.fr/wiki/donneesrechercheist● MOOC Coursera Reproducible Research.https://www.coursera.org/learn/reproducible-research ● Carpentry sites. http://software-carpentry.org http://www.datacarpentry.org ● Webinars on Reproducible Research. https://github.com/alegrand/RR_webinars/blob/master ● Wiki Pratiques&Outils du Cati Cascisdi INRA. http://wiki.inra.fr/wiki/cascisdi● Reproductibilité des calculs scientifiques. MJ Cros. http://www7.inra.fr/mia/T/cros/Interests.html ● Liste R^4 : [email protected]

http://fr.slideshare.net/carolegoble/open-sciencemcrgoble2015

https://figshare.com/articles/101_Innovations_in_Scholarly_Communication_the_Changing_Research_Workflow/1286826

https://www.dataone.org/all-best-practices

https://www.force11.org/group/fairgroup/fairprinciples

http://rrcns.readthedocs.org/en/latest/index.html#best-practices-for-data-management-in-neurophysiology

http://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1001745

https://wiki.inra.fr/wiki/donneesrechercheist

https://www.coursera.org/learn/reproducible-research

http://software-carpentry.org/

http://www.datacarpentry.org/

https://github.com/alegrand/RR_webinars/blob/master

http://wiki.inra.fr/wiki/cascisdi

http://www7.inra.fr/mia/T/cros/Interests.html


Documents

Marie-Josée Cros MIAT, INRA Toulouse Atelier ... · Contexte Le calcul scientifique central dans la recherche Des scandales : climategate (2009), en économie (2013) ... Un haut