Reproducible Research - unimi.itusers.unimi.it/marray/2009/material/lectures/L1b_Reproducible_Rese… · Reproducible Research •Electronic journals are largely electronic only in

Reproducible Research

Robert [email protected]

Copyright 2009

Outline

• What problem are we going to solve?• What is reproducible research?• What tools do we need for this?

Reproducible Research

• Electronic journals are largely electronic only intheir delivery mechanism. A few trees survive butfor the author and the reader little has changed.

• most recipients of electronic documents have acomputational engine available

• this suggests that we could in fact move (in astructured way) to navigable documents withdynamic content

• these documents would allow the reader torecreate (and modify) the results being reported

Reproducibility• means different things in different disciplines

– Physics tends to require a much higher level of reproducibility thanbiology (lower between experiment variance)

– computation should be essentially reproducible, but we seem to beholding it up to a lesser standard

• we are not talking about data collection, but rather aboutdata analysis

• and I want to be careful, at the outset, to be clear that I amnot just talking about publishing in the scientific literaturebut, as we will see from the use cases, there are moregeneral uses for these ideas.

Use Cases

1. A scientific paper:• given some set of data• do some manipulations, fit models• draw conclusions

2. Computational inference• simulation study• MCMC• bootstrapping

Use Cases3. Students/Post-docs need to take up and extend

an existing project• previous work by you, or another student

4. Consulting Reports• you have a client with some data that wants you to

provide them with an analysis5. Analysis of a very controversial topic

• eg. global warming6. Book authoring

• all 3 of my books are written using this approach

Issues (by use case)

1. A publication about scientific computing is notscholarship, it is merely an advertisement ofscholarship, the scholarship lies elsewhere(Claerbout)

2. Very similar to 1. There is an element ofadvertising - you must believe that the authordid the right thing

3. It is hard for them to get started, often monthsare wasted on trivial things (this was Claerbout’smotivation)

Issues (by use case)

4. Client shows up at the last minute with new data• how do you make sure that all tables, figures and

facts are updated?• slightly better scenario is where they come the next

week with a very similar problem - and you can reuseyour existing template

5. Controversial topic (eg global warming)• an explicit reconstruction is needed• you may want to have a variety of knob/buttons so

that users can verify the extent to which your solutionis robust to model assumptions

Related Research

• Claerbout's lab atStanford– use of Makefiles

• Buckheit and Donoho(1995)– plots should be

reproducible• Vince Carey

– Literate Programming

• Duncan Temple Lang– Literate programming– extensible dynamic docs

• Tony Rossini– Literate Data Analsysis

• Fritz Leisch– Sweave

• Peng et al– reproducibility of

epidemiological research

What we are not talking about

• notice that in my original definition I want tocondition on the data used

• we are not discussing the important, but differenttask of replication of findings on independentdata, or about the validity of the data

• Peng et al, provide some very good discussion onthe issues of replication of findings

Scientific Publication: Author

• an author selects a set of data• they transform that data, to produce

pictures, tables and statistics• from these they draw conclusions

Scientific Publication: Reader

• get a static document• you can read it, try to understand what the

author did• electronic publishing has made it easier to

– get the document– carry it around– search it

• but not to understand or comprehend it

Dissection

• Data collection– not reproducible or limited reproducibility– not part of our discussion– but, in all cases, the author must have selected

some specific version to analyze– our interest begins there– from that point on, things should be

reproducible

Dissection

• Data analysis: creation of figures, statistics,tables– should be reproducible– the data have been fixed– well defined and understood statistical methods

have been applied– many readers should be able to understand the

complete details

Dissection

• Conclusions– depend on the data and the analysis– given them, we can agree or disagree with the

author– if we don’t understand the analysis how do we

agree with the conclusions?– download their data -

reanalyze…frustration…unable toreproduce…my fault?…their fault?

A case study

• they developed gene expression signatures that arepredictive of resistance to therapy

• they identified signatures that are consistent withactivation of Src and Rb/E2F pathways

• they put much, but not quite all data on line

Concern• Run Batch Effects Potentially Compromise the

Usefulness of Genomic Signatures for OvarianCancer, Keith A. Baggerly, Kevin R. Coombes,and E. Shannon Neeley,JCO 2008 26: 1186-1187

• point to a variety of concerns with the analysis– primarily that batch effects in the microarrays were not

properly accounted for– sample mislabeling at some point (not clear if this was

just for publication or part of the analysis)• Details of their analysis, including their code,

documentation, figures, and results are availablefromhttp://bioinformatics.mdanderson.org/Supplements/ReproRsch-Ovary.

Rejoinder

• Journal of ClinicalOncology, Vol 26, No7 (March 1), 2008: pp.1187-1188

• it seems to takereproduce to be morerestrictive than I might

A second look

• Vince Carey downloadedand looked at thecorrected data (raw andprocessed)

• sample labels seem to bepermuted

• still batch effects afterlabel adjustment

• FDR is very significantfor batch effect

• we see an associationbetween survival time andrun date

Where are we?

• the original authors are to be commended– we may not agree with their analysis– but they provided sufficient data for us to

question it– we seem to have questions but no easy way to

fix the problem– more complete code from the original authors

would have helped to establish problems earlierand may eventually help resolve the matter

Compendiums• we need to provide an entity that contains

– text: the written content of the article(s)– code: computer code that will execute to

provide outputs such as tables and graphics– data: on which the code operates and about

which the text is reporting• there must be some organization (markup)

that separates the different components• there must be a system or protocol for

transformation

Tools/Ideas

• literate programming gives us the basis for ourapproach

• basically a literate program is a software programthat is arranged for human reading, and thenprocessed to produce a format more suitable tocomputation

• we need something a bit more, as we have toassociate the data and the document, and generallysome processing tools

• we have named such an entity a compendium

Properties of a good solution

• validity: does the supplied code run anddoes it provide the right answers

• accessibility: can a user obtain workingcopies, can the user access code and dataeasily

• concurrency: can the compendium beupdated? Can it be extended? How dopotential users find compendiums?

Requirements

• authoring: creation, testing, versioning• server-side: organization, distribution• client-side: retrieval, translation (weave,

tangle,…)• we have found that some form of caching is

important, as even minor changes in thetext/code/data require completereprocessing

Outputs

• papers suitable for publication• interim reports• long and short versions of articles• reports for clients etc.

DynamicDocuments

ProgramsData

Tools Needed

The dynamic documents are transformed using thedata, and programs to produce the finished outputs.

Compendiums: AnImplementation

• Sweave is a system for combining text andR code in alternating chunks

• the document looks like LaTeX but withcode insterted in a special (but easy to useway)

• the document can be woven to produce aLaTeX document with all code chunksreplaced by their outputs

Sweave

\section{Data}

We see an interestingpattern inFigure~\ref{F1}

<<F1, fig=TRUE>>=plot(data.x,data.y)@And so we like it.

• on the left we see asection of an Sweavedocument

• first, standard LaTeXand then a small codechunk that is R code

• after weaving the codechunk will be replacedby the code to includethe plot (which is ineps or pdf)

Compendiums: AnImplementation

• the R package system provides amechanism for both packaging together,data, code and Sweave documents and fordistributing these

• with these two tools we have a proof ofconcept – one can carry out reproducibleresearch with these tools

• I can give you a package that represents apaper and you can run it on your machine toreproduce that paper

Compendiums

• we need an authoring environment withtools to construct and test compendiums

• since any change to the document, code ordata will require reprocessing the wholedocument many have found that some formof caching is needed

• the weaver package (S. Falcon) providessuch a tool

Other tools

• odfWeave is a tool that provides an Sweavelike approach for producing ODFdocuments

• we need a M’soft compatible version (andthere is some hope as there is some interestat Microsoft in very similar issues)

Problems

• reliance on external libraries/software• reliance on operating system• use of large external data sources• it is relatively easy to get out of date• security

– can we protect private data but still allow somemodeling

– encryption systems

Where to next

• in our JCGS paper, we propose many extensionsof the concept

• multilanguage compendiums– users are free to use any language or statistical package

to process chunks– some in Perl, some using SAS, some using SPSS etc.– this raises large concerns about the evaluation

environment (how do the outputs of one chunk becomethe inputs of the next)

Where to next

• some questions of how long a compendiumshould be reproducible

• software maintenance - this becomes aproblem as you have to maintain and updateyour documents as software changes

• will the algorithms still be available? Arethey still best practices?

Scientific Publishing

• return to some of the ideas of Peng et al• they specify the following criteria for reproducible

epidemiological research– data: must be publicly available– computer code: to produce data tables, figures etc must

be available– documentation: for said code should be available as

should any underlying infrastructure– distribution: all of the materials should be distributed in

some standard/usual way

Data Licenses

• Peng et al also consider the following datalicenses– full access: the data can be used for any purpose– attribution: the data can be used, but the original

authors must be cited in some specified way– share alike: the data can be used, but modifications,

updates etc must be made available under the samerules as the original data

– reproduction: the data can be used only to reproducethe tables figures etc

Data Issues

• Peng et al do raise the issue that some data areintended to be private and cannot be shared

• there are ways to get around some issues (mostGov’t Census depts have methods now)

• Peng et al, do not consider encryption - ie that thedata are provided in an encrypted format that onlycertain individuals can unlock and process

• others can still run the document, but cannot seethe specific data

References

• Peng et al, AJE, 2006, 783-789• Gentleman and Temple Lang, JCGS, 2007,

16, 1-24• Rossini, A. DSC 2001, Literate Statistical

Practice• Leisch, 2002, Compstat 2002, Sweave:

Dynamic Generation of Statistical ReportsUsing Literate Data Analysis

Thanks

• Duncan Temple Lang• Ross Ihaka• Tony Rossini

• Seth Falcon• Vince Carey• Thomas Lumley