Upload
others
View
15
Download
0
Embed Size (px)
Citation preview
Outline
• What problem are we going to solve?• What is reproducible research?• What tools do we need for this?
Reproducible Research
• Electronic journals are largely electronic only intheir delivery mechanism. A few trees survive butfor the author and the reader little has changed.
• most recipients of electronic documents have acomputational engine available
• this suggests that we could in fact move (in astructured way) to navigable documents withdynamic content
• these documents would allow the reader torecreate (and modify) the results being reported
Reproducibility• means different things in different disciplines
– Physics tends to require a much higher level of reproducibility thanbiology (lower between experiment variance)
– computation should be essentially reproducible, but we seem to beholding it up to a lesser standard
• we are not talking about data collection, but rather aboutdata analysis
• and I want to be careful, at the outset, to be clear that I amnot just talking about publishing in the scientific literaturebut, as we will see from the use cases, there are moregeneral uses for these ideas.
Use Cases
1. A scientific paper:• given some set of data• do some manipulations, fit models• draw conclusions
2. Computational inference• simulation study• MCMC• bootstrapping
Use Cases3. Students/Post-docs need to take up and extend
an existing project• previous work by you, or another student
4. Consulting Reports• you have a client with some data that wants you to
provide them with an analysis5. Analysis of a very controversial topic
• eg. global warming6. Book authoring
• all 3 of my books are written using this approach
Issues (by use case)
1. A publication about scientific computing is notscholarship, it is merely an advertisement ofscholarship, the scholarship lies elsewhere(Claerbout)
2. Very similar to 1. There is an element ofadvertising - you must believe that the authordid the right thing
3. It is hard for them to get started, often monthsare wasted on trivial things (this was Claerbout’smotivation)
Issues (by use case)
4. Client shows up at the last minute with new data• how do you make sure that all tables, figures and
facts are updated?• slightly better scenario is where they come the next
week with a very similar problem - and you can reuseyour existing template
5. Controversial topic (eg global warming)• an explicit reconstruction is needed• you may want to have a variety of knob/buttons so
that users can verify the extent to which your solutionis robust to model assumptions
Related Research
• Claerbout's lab atStanford– use of Makefiles
• Buckheit and Donoho(1995)– plots should be
reproducible• Vince Carey
– Literate Programming
• Duncan Temple Lang– Literate programming– extensible dynamic docs
• Tony Rossini– Literate Data Analsysis
• Fritz Leisch– Sweave
• Peng et al– reproducibility of
epidemiological research
What we are not talking about
• notice that in my original definition I want tocondition on the data used
• we are not discussing the important, but differenttask of replication of findings on independentdata, or about the validity of the data
• Peng et al, provide some very good discussion onthe issues of replication of findings
Scientific Publication: Author
• an author selects a set of data• they transform that data, to produce
pictures, tables and statistics• from these they draw conclusions
Scientific Publication: Reader
• get a static document• you can read it, try to understand what the
author did• electronic publishing has made it easier to
– get the document– carry it around– search it
• but not to understand or comprehend it
Dissection
• Data collection– not reproducible or limited reproducibility– not part of our discussion– but, in all cases, the author must have selected
some specific version to analyze– our interest begins there– from that point on, things should be
reproducible
Dissection
• Data analysis: creation of figures, statistics,tables– should be reproducible– the data have been fixed– well defined and understood statistical methods
have been applied– many readers should be able to understand the
complete details
Dissection
• Conclusions– depend on the data and the analysis– given them, we can agree or disagree with the
author– if we don’t understand the analysis how do we
agree with the conclusions?– download their data -
reanalyze…frustration…unable toreproduce…my fault?…their fault?
A case study
• they developed gene expression signatures that arepredictive of resistance to therapy
• they identified signatures that are consistent withactivation of Src and Rb/E2F pathways
• they put much, but not quite all data on line
Concern• Run Batch Effects Potentially Compromise the
Usefulness of Genomic Signatures for OvarianCancer, Keith A. Baggerly, Kevin R. Coombes,and E. Shannon Neeley,JCO 2008 26: 1186-1187
• point to a variety of concerns with the analysis– primarily that batch effects in the microarrays were not
properly accounted for– sample mislabeling at some point (not clear if this was
just for publication or part of the analysis)• Details of their analysis, including their code,
documentation, figures, and results are availablefromhttp://bioinformatics.mdanderson.org/Supplements/ReproRsch-Ovary.
Rejoinder
• Journal of ClinicalOncology, Vol 26, No7 (March 1), 2008: pp.1187-1188
• it seems to takereproduce to be morerestrictive than I might
A second look
• Vince Carey downloadedand looked at thecorrected data (raw andprocessed)
• sample labels seem to bepermuted
• still batch effects afterlabel adjustment
• FDR is very significantfor batch effect
• we see an associationbetween survival time andrun date
Where are we?
• the original authors are to be commended– we may not agree with their analysis– but they provided sufficient data for us to
question it– we seem to have questions but no easy way to
fix the problem– more complete code from the original authors
would have helped to establish problems earlierand may eventually help resolve the matter
Compendiums• we need to provide an entity that contains
– text: the written content of the article(s)– code: computer code that will execute to
provide outputs such as tables and graphics– data: on which the code operates and about
which the text is reporting• there must be some organization (markup)
that separates the different components• there must be a system or protocol for
transformation
Tools/Ideas
• literate programming gives us the basis for ourapproach
• basically a literate program is a software programthat is arranged for human reading, and thenprocessed to produce a format more suitable tocomputation
• we need something a bit more, as we have toassociate the data and the document, and generallysome processing tools
• we have named such an entity a compendium
Properties of a good solution
• validity: does the supplied code run anddoes it provide the right answers
• accessibility: can a user obtain workingcopies, can the user access code and dataeasily
• concurrency: can the compendium beupdated? Can it be extended? How dopotential users find compendiums?
Requirements
• authoring: creation, testing, versioning• server-side: organization, distribution• client-side: retrieval, translation (weave,
tangle,…)• we have found that some form of caching is
important, as even minor changes in thetext/code/data require completereprocessing
Outputs
• papers suitable for publication• interim reports• long and short versions of articles• reports for clients etc.
DynamicDocuments
ProgramsData
Tools Needed
The dynamic documents are transformed using thedata, and programs to produce the finished outputs.
Compendiums: AnImplementation
• Sweave is a system for combining text andR code in alternating chunks
• the document looks like LaTeX but withcode insterted in a special (but easy to useway)
• the document can be woven to produce aLaTeX document with all code chunksreplaced by their outputs
Sweave
\section{Data}
We see an interestingpattern inFigure~\ref{F1}
<<F1, fig=TRUE>>=plot(data.x,data.y)@And so we like it.
• on the left we see asection of an Sweavedocument
• first, standard LaTeXand then a small codechunk that is R code
• after weaving the codechunk will be replacedby the code to includethe plot (which is ineps or pdf)
Compendiums: AnImplementation
• the R package system provides amechanism for both packaging together,data, code and Sweave documents and fordistributing these
• with these two tools we have a proof ofconcept – one can carry out reproducibleresearch with these tools
• I can give you a package that represents apaper and you can run it on your machine toreproduce that paper
Compendiums
• we need an authoring environment withtools to construct and test compendiums
• since any change to the document, code ordata will require reprocessing the wholedocument many have found that some formof caching is needed
• the weaver package (S. Falcon) providessuch a tool
Other tools
• odfWeave is a tool that provides an Sweavelike approach for producing ODFdocuments
• we need a M’soft compatible version (andthere is some hope as there is some interestat Microsoft in very similar issues)
Problems
• reliance on external libraries/software• reliance on operating system• use of large external data sources• it is relatively easy to get out of date• security
– can we protect private data but still allow somemodeling
– encryption systems
Where to next
• in our JCGS paper, we propose many extensionsof the concept
• multilanguage compendiums– users are free to use any language or statistical package
to process chunks– some in Perl, some using SAS, some using SPSS etc.– this raises large concerns about the evaluation
environment (how do the outputs of one chunk becomethe inputs of the next)
Where to next
• some questions of how long a compendiumshould be reproducible
• software maintenance - this becomes aproblem as you have to maintain and updateyour documents as software changes
• will the algorithms still be available? Arethey still best practices?
Scientific Publishing
• return to some of the ideas of Peng et al• they specify the following criteria for reproducible
epidemiological research– data: must be publicly available– computer code: to produce data tables, figures etc must
be available– documentation: for said code should be available as
should any underlying infrastructure– distribution: all of the materials should be distributed in
some standard/usual way
Data Licenses
• Peng et al also consider the following datalicenses– full access: the data can be used for any purpose– attribution: the data can be used, but the original
authors must be cited in some specified way– share alike: the data can be used, but modifications,
updates etc must be made available under the samerules as the original data
– reproduction: the data can be used only to reproducethe tables figures etc
Data Issues
• Peng et al do raise the issue that some data areintended to be private and cannot be shared
• there are ways to get around some issues (mostGov’t Census depts have methods now)
• Peng et al, do not consider encryption - ie that thedata are provided in an encrypted format that onlycertain individuals can unlock and process
• others can still run the document, but cannot seethe specific data
References
• Peng et al, AJE, 2006, 783-789• Gentleman and Temple Lang, JCGS, 2007,
16, 1-24• Rossini, A. DSC 2001, Literate Statistical
Practice• Leisch, 2002, Compstat 2002, Sweave:
Dynamic Generation of Statistical ReportsUsing Literate Data Analysis
Thanks
• Duncan Temple Lang• Ross Ihaka• Tony Rossini
• Seth Falcon• Vince Carey• Thomas Lumley