Upload
yannick-wurm
View
185
Download
0
Embed Size (px)
DESCRIPTION
QMUL - MSc - reproducible research, sustainably software.
Citation preview
Programming in R?
If/else
Logical Operators
going further
Reproducible Research & Sustainable Software
@yannick__ http://yannick.poulet.org
SBSM035 - Stats/Bioinformatics/Programming
Why care?
www.sciencemag.org SCIENCE VOL 314 22 DECEMBER 2006 1875
Aquaculture in
Offshore Zones
THE EDITORIAL BY ROSAMOND NAYLOR,“Offshore aquaculture legislation” (8 Sept.,
p. 1363), suggests that the motivation for
moving aquaculture into the open ocean is
that “marine f ish farming near the shore
is limited by state regulations.” Although
unworkable regulations may exist in a few
states, in the larger scheme this is irrele-
vant. Of the offshore aquaculture projects
currently under way, none are occurring in
the U.S. Exclusive Economic Zone (EEZ);
rather, they are happening in state waters.
Even historically, only two aquaculture
projects have ever occurred in federal
waters (1).
Much of Naylor’s stated concern over
offshore aquaculture is based on historical
experience with near-shore fish farms. This
is in spite of years of more relevant offshore
operations that reveal little, if any, negative
impact on the environment or local ecosys-
tems (2, 3). Naylor criticizes the National
Offshore Aquaculture Act of 2005 because
it lacks specific environmental standards.
Yet, she recommends California’s recent
Sustainable Oceans Act as a legislative
model, although it is similarly silent, leaving
those details to rule-making in response to
the best available science.
Naylor criticizes the use of fishmeal as
an aquaculture ingredient, ignoring the fact
that industrial fisheries are well managed
and would occur with or without aquacul-
ture’s demand. Naylor ignores the higher
efficiency of using fishmeal to feed fish
compared with its use in land-based live-
stock operations (4). Also ignored is the
inefficiency of using small pelagic fish in
the natural setting to feed predator fish (5).
Researchers and entrepreneurs currently
developing the technologies needed for offshore
aquaculture share a vision of a well-managed
industry governed by regulations with a rational
basis in the ecology of the oceans and the eco-
nomic realities of the marketplace.CLIFFORD A. GOUDEY
Massachusetts Institute of Technology, Cambridge, MA02139, USA.
References and Notes1. The SeaStead project a decade ago, four miles off
Massachusetts (see www.nmfs.noaa.gov/mb/sk/saltonstallken/enhancement.htm) and the recentOffshore Aquaculture Consortium experimental cageoperation 22 miles off Mississippi (see www.masgc.org/oac/).
2. See www.lib.noaa.gov/docaqua/reports_noaaresearch/hooarrprept.htm/.
3. See www.blackpearlsinc.com/PDF/hoarpi.pdf.4. See www.salmonoftheamericas.com/env_food.html.5. D. Pauly, V. Christensen, Nature 374, 255 (2002).
IN HER PROVOCATIVE EDITORIAL “OFFSHOREaquaculture legislation” (8 Sept., p. 1363),
R. Naylor raises valid points regarding regu-
lation of oceanic aquaculture, since it is
sure to grow in the future because of dwin-
dling global fishery supplies. This growth is
LETTERS I BOOKS I POLICY FORUM I EDUCATION FORUM I PERSPECTIVES
1878
Generating new sciencein the classroom
How proteins connect
1880 1882
Mathematicalperspectives
LETTERSedited by Etta Kavanagh
Retraction
WE WISH TO RETRACT OUR RESEARCH ARTICLE “STRUCTURE OFMsbA from E. coli: A homolog of the multidrug resistance ATP bind-
ing cassette (ABC) transporters” and both of our Reports “Structure of
the ABC transporter MsbA in complex with ADP•vanadate and
lipopolysaccharide” and “X-ray structure of the EmrE multidrug trans-
porter in complex with a substrate” (1–3).
The recently reported structure of Sav1866 (4) indicated that our
MsbA structures (1, 2, 5) were incorrect in both the hand of the struc-
ture and the topology. Thus, our biological interpretations based on
these inverted models for MsbA are invalid.
An in-house data reduction program introduced a change in sign for
anomalous differences. This program, which was not part of a conven-
tional data processing package, converted the anomalous pairs (I+ and
I-) to (F- and F+), thereby introducing a sign change. As the diffrac-
tion data collected for each set of MsbA crystals and for the EmrE
crystals were processed with the same program, the structures reported
in (1–3, 5, 6) had the wrong hand.
The error in the topology of the original MsbA structure was a con-
sequence of the low resolution of the data as well as breaks in the elec-
tron density for the connecting loop regions. Unfortunately, the use of
the multicopy refinement procedure still allowed us to obtain reason-
able refinement values for the wrong structures.
The Protein Data Bank (PDB) files 1JSQ, 1PF4, and 1Z2R for
MsbA and 1S7B and 2F2M for EmrE have been moved to the archive
of obsolete PDB entries. The MsbA and EmrE structures will be
recalculated from the original data using the proper sign for the anom-
alous differences, and the new Ca coordinates and structure factors
will be deposited.
We very sincerely regret the confusion that these papers have
caused and, in particular, subsequent research efforts that were unpro-
ductive as a result of our original findings.GEOFFREY CHANG, CHRISTOPHER B. ROTH,
CHRISTOPHER L. REYES, OWEN PORNILLOS,
YEN-JU CHEN, ANDY P. CHEN
Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.
References1. G. Chang, C. B. Roth, Science 293, 1793 (2001).2. C. L. Reyes, G. Chang, Science 308, 1028 (2005).3. O. Pornillos, Y.-J. Chen, A. P. Chen, G. Chang, Science 310, 1950 (2005).4. R. J. Dawson, K. P. Locher, Nature 443, 180 (2006).5. G. Chang, J. Mol. Biol. 330, 419 (2003).6. C. Ma, G. Chang, Proc. Natl. Acad. Sci. U.S.A. 101, 2852 (2004).
COMMENTARY
Published by AAAS
on
Sept
embe
r 24,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
o
n Se
ptem
ber 2
4, 2
014
ww
w.s
cien
cem
ag.o
rgD
ownl
oade
d fro
m
on
Sept
embe
r 24,
201
4w
ww
.sci
ence
mag
.org
Dow
nloa
ded
from
• Avoid costly mistakes
• Be faster: “stand on the shoulders of giants”
• Increase impact / visibility
Reproducible Research & Sustainable Software
“Big data” biologyis hard.
• Biology/life is complex • Field is young. • Biologists lack computational training. • Generally, analysis tools suck.
• badly written • badly tested • hard to install • output quality… often questionable.
• Understanding/visualizing/massaging data is hard. • Datasets continue to grow!
“Big data” biologyis hard.
We need great tools.
Some sources of inspiration
arX
iv:1
210.
0530
v3 [
cs.M
S] 2
9 N
ov 2
012
Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††
∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])
Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.
Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.
Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.
We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].
In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.
This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial
and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.
1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].
First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):
...calculation...
or to take two points:def rect_area(point1, point2):
...calculation...
The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not
Reserved for Publication Footnotes
1–7
arX
iv:1
210.
0530
v3 [
cs.M
S] 2
9 N
ov 2
012
Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††
∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])
Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.
Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.
Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.
We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].
In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.
This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial
and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.
1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].
First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):
...calculation...
or to take two points:def rect_area(point1, point2):
...calculation...
The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not
Reserved for Publication Footnotes
1–7
arX
iv:1
210.
0530
v3 [
cs.M
S] 2
9 N
ov 2
012
Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††
∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])
Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.
Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.
Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.
We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].
In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.
This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial
and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.
1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].
First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):
...calculation...
or to take two points:def rect_area(point1, point2):
...calculation...
The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not
Reserved for Publication Footnotes
1–7
1. Write programs for people, not computers. 2. Automate repetitive tasks. 3. Use the computer to record history. 4. Make incremental changes. 5. Use version control. 6. Don’t repeat yourself (or others). 7. Plan for mistakes. 8. Optimize software only after it works correctly. 9. Document the design and purpose of code rather than its mechanics.!10. Conduct code reviews.
Specific Approaches/Tools
• Planning for mistakes
• Automated testing
• Continuous integration
• Writing for people: use style guide
Code for people: Use a style guide• For R: http://r-pkgs.had.co.nz/style.html
R style guide extract
Coding for people: Indent your code!
Programming better
• variable naming
• coding width: 100 characters
• indenting
• Follow conventions -eg “Google R Style”
• Versioning: DropBox & http://github.com/
• Automated testing
• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway
preprocess_snps <- function(snp_table, testing=FALSE) { if (testing) { # run a bunch of tests of extreme situations. # quit if a test gives a weird result. } # real part of function. }
Friday, 22 June 12
Line length Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.
R style guide extract
!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))
!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header = TRUE, sep = '\t', col.names = c('colony', 'individual', 'headwidth', 'mass') )
!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))
Code for people: Use a style guide• For R: http://r-pkgs.had.co.nz/style.html • For Ruby: https://github.com/bbatsov/ruby-style-guide
Automatically check your code:install.packages(“lint”) # once
library(lint) # everytime lint(“file_to_check.R”)
Eliminate redundancyDRY: Don’t Repeat Yourself
knitr (sweave)Analyzing & Reporting in a single file.
analysis.Rmd
### in R: library(knitr) knit(“analysis.Rmd”) # --> creates analysis.md ### in shell: pandoc analysis.md -o analysis.pdf # --> creates MyFile.pdf
A minimal R Markdown example
I know the value of pi is 3.1416, and 2 times pi is 6.2832. To compile me type:
library(knitr); knit(�minimal.Rmd�)
A paragraph here. A code chunk below:
1+1
## [1] 2
.4-.7+.3 # what? it is not zero!
## [1] 5.551e-17
Graphics work too
library(ggplot2)
qplot(speed, dist, data = cars) + geom_smooth()
●●
●
●●
●●●●
●●
●●●● ●
●●
●
●●
●
●
●●
●
●●
●●●
●
●
●●
●●
●
●
●●●● ●
●
●
●●
●
●
0
40
80
120
5 10 15 20 25speed
dist
Figure 1: A scatterplot of cars
1
Organize mindfully
Education
A Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*
1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and
Engineering, University of Washington, Seattle, Washington, United States of America
Introduction
Most bioinformatics coursework focus-es on algorithms, with perhaps somecomponents devoted to learning pro-gramming skills and learning how touse existing bioinformatics software. Un-fortunately, for students who are prepar-ing for a research career, this type ofcurriculum fails to address many of theday-to-day organizational challenges as-sociated with performing computationalexperiments. In practice, the principlesbehind organizing and documentingcomputational experiments are oftenlearned on the fly, and this learning isstrongly influenced by personal predilec-tions as well as by chance interactionswith collaborators or colleagues.
The purpose of this article is to describeone good strategy for carrying out com-putational experiments. I will not describeprofound issues such as how to formulatehypotheses, design experiments, or drawconclusions. Rather, I will focus onrelatively mundane issues such as organiz-ing files and directories and documentingprogress. These issues are importantbecause poor organizational choices canlead to significantly slower research pro-gress. I do not claim that the strategies Ioutline here are optimal. These are simplythe principles and practices that I havedeveloped over 12 years of bioinformaticsresearch, augmented with various sugges-tions from other researchers with whom Ihave discussed these issues.
Principles
The core guiding principle is simple:Someone unfamiliar with your projectshould be able to look at your computerfiles and understand in detail what you didand why. This ‘‘someone’’ could be any of avariety of people: someone who read yourpublished article and wants to try toreproduce your work, a collaborator whowants to understand the details of yourexperiments, a future student working inyour lab who wants to extend your workafter you have moved on to a new job, yourresearch advisor, who may be interested in
understanding your work or who may beevaluating your research skills. Most com-monly, however, that ‘‘someone’’ is you. Afew months from now, you may notremember what you were up to when youcreated a particular set of files, or you maynot remember what conclusions you drew.You will either have to then spend timereconstructing your previous experimentsor lose whatever insights you gained fromthose experiments.
This leads to the second principle,which is actually more like a version ofMurphy’s Law: Everything you do, youwill probably have to do over again.Inevitably, you will discover some flaw inyour initial preparation of the data beinganalyzed, or you will get access to newdata, or you will decide that your param-eterization of a particular model was notbroad enough. This means that theexperiment you did last week, or eventhe set of experiments you’ve been work-ing on over the past month, will probablyneed to be redone. If you have organizedand documented your work clearly, thenrepeating the experiment with the newdata or the new parameterization will bemuch, much easier.
To see how these two principles areapplied in practice, let’s begin by consid-ering the organization of directories andfiles with respect to a particular project.
File and Directory Organization
When you begin a new project, youwill need to decide upon some organiza-tional structure for the relevant directo-ries. It is generally a good idea to storeall of the files relevant to one project
under a common root directory. Theexception to this rule is source code orscripts that are used in multiple projects.Each such program might have a projectdirectory of its own.
Within a given project, I use a top-levelorganization that is logical, with chrono-logical organization at the next level, andlogical organization below that. A sampleproject, called msms, is shown in Figure 1.At the root of most of my projects, I have adata directory for storing fixed data sets, aresults directory for tracking computa-tional experiments peformed on that data,a doc directory with one subdirectory permanuscript, and directories such as srcfor source code and bin for compiledbinaries or scripts.
Within the data and results directo-ries, it is often tempting to apply a similar,logical organization. For example, youmay have two or three data sets againstwhich you plan to benchmark youralgorithms, so you could create onedirectory for each of them under data.In my experience, this approach is risky,because the logical structure of your finalset of experiments may look drasticallydifferent from the form you initiallydesigned. This is particularly true underthe results directory, where you maynot even know in advance what kinds ofexperiments you will need to perform. Ifyou try to give your directories logicalnames, you may end up with a very longlist of directories with names that, sixmonths from now, you no longer knowhow to interpret.
Instead, I have found that organizingmy data and results directories chro-nologically makes the most sense. Indeed,
Citation: Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS ComputBiol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424
Editor: Fran Lewitter, Whitehead Institute, United States of America
Published July 31, 2009
Copyright: ! 2009 William Stafford Noble. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.
Funding: The author received no specific funding for writing this article.
Competing Interests: The author has declared that no competing interests exist.
* E-mail: [email protected]
PLoS Computational Biology | www.ploscompbiol.org 1 July 2009 | Volume 5 | Issue 7 | e1000424
with this approach, the distinction be-tween data and results may not be useful.Instead, one could imagine a top-leveldirectory called something like experi-ments, with subdirectories with names like2008-12-19. Optionally, the directoryname might also include a word or twoindicating the topic of the experimenttherein. In practice, a single experimentwill often require more than one day ofwork, and so you may end up working afew days or more before creating a newsubdirectory. Later, when you or someoneelse wants to know what you did, thechronological structure of your work willbe self-evident.
Below a single experiment directory, theorganization of files and directories islogical, and depends upon the structureof your experiment. In many simpleexperiments, you can keep all of your filesin the current directory. If you startcreating lots of files, then you shouldintroduce some directory structure to storefiles of different types. This directorystructure will typically be generated auto-matically from a driver script, as discussedbelow.
The Lab Notebook
In parallel with this chronologicaldirectory structure, I find it useful tomaintain a chronologically organized labnotebook. This is a document that residesin the root of the results directory andthat records your progress in detail.Entries in the notebook should be dated,and they should be relatively verbose, withlinks or embedded images or tablesdisplaying the results of the experimentsthat you performed. In addition to de-scribing precisely what you did, thenotebook should record your observations,conclusions, and ideas for future work.Particularly when an experiment turns outbadly, it is tempting simply to link the finalplot or table of results and start a newexperiment. Before doing that, it isimportant to document how you knowthe experiment failed, since the interpre-tation of your results may not be obviousto someone else reading your lab note-book.
In addition to the primary text describ-ing your experiments, it is often valuableto transcribe notes from conversations aswell as e-mail text into the lab notebook.
These types of entries provide a completepicture of the development of the projectover time.
In practice, I ask members of myresearch group to put their lab notebooksonline, behind password protection ifnecessary. When I meet with a memberof my lab or a project team, we can referto the online lab notebook, focusing onthe current entry but scrolling up toprevious entries as necessary. The URLcan also be provided to remote collabo-rators to give them status updates on theproject.
Note that if you would rather not createyour own ‘‘home-brew’’ electronic note-book, several alternatives are available.For example, a variety of commercialsoftware systems have been created tohelp scientists create and maintain elec-tronic lab notebooks [1–3]. Furthermore,especially in the context of collaborations,storing the lab notebook on a wiki-basedsystem or on a blog site may be appealing.
Figure 1. Directory structure for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset ofthe files are shown here. Note that the dates are formatted ,year.-,month.-,day. so that they can be sorted in chronological order. Thesource code src/ms-analysis.c is compiled to create bin/ms-analysis and is documented in doc/ms-analysis.html. The READMEfiles in the data directories specify who downloaded the data files from what URL on what date. The driver script results/2009-01-15/runallautomatically generates the three subdirectories split1, split2, and split3, corresponding to three cross-validation splits. The bin/parse-sqt.py script is called by both of the runall driver scripts.doi:10.1371/journal.pcbi.1000424.g001
PLoS Computational Biology | www.ploscompbiol.org 2 July 2009 | Volume 5 | Issue 7 | e1000424
In each results folder: •script getResults.rb •intermediates •output
Track versions of everything
Github: Facebook for code
Github: Facebook for code• Easy versioning
• Random people use your stuff
• And find problems and fix and improve it!
• Greater impact / better planet
• Easily update
• Easily collaborate
• Identify trends
• Build online reputationDemo
Learn how: https://try.github.io/levels/1/challenges/1
Programming languages
Choosing a programming languageGood: Bad:
Excel quick & dirty easy to make mistakes doesn’t scale
R numbers, stats, genomics
programming
Unix command-line == shell == bash
Can’t escape it. Quick & Dirty. HPC.
programming, complicated things
Java 1990s user interfaces overcomplicated.
Perl 1980s. Everything.
Python scripting, text ugly
Ruby scripting, text
Javascript/Node scripting, flexibility(web & client), community only little bio-stuff
Ruby.“Friends don’t let friends do Perl” - reddit user
### in PERL: open INFILE, "my_file.txt"; while (defined ($line = <INFILE>)) { chomp($line); @letters = split(//, $line); @reverse_letters = reverse(@letters); $reverse_string = join("", @reverse_letters); print $reverse_string, "\n"; }
### in Ruby: File.open("a").each { |line| puts line.chomp.reverse }
• example: “reverse each line in file” • read file; with each line
• remove the invisible “end of line” character • reverse the contents • print the reversed line
More ruby examples.
5.times { puts "Hello world" }
# Sorting people people_sorted_by_age = people.sort_by { |person| person.age}
+many tools for bio-data - e.g. check http://biogems.info
Getting help.
• In real life: Make friends with people. Talk to them.
• Online: • Specific discussion mailing lists (e.g.: R, Stacks, bioruby, MAKER...) • Programming: http://stackoverflow.com • Bioinformatics: http://www.biostars.org • Sequencing-related: http://seqanswers.com • Stats: http://stats.stackexchange.com !
• Codeschool!
“Can you BLAST this for me?”
• Once I wanted to set up a BLAST server.
Anurag Priyam, Mechanical engineering student, Kharagpur
Aim: An open source idiot-proof web-interface
for custom BLASTFriday, 22 June 12
Anurag Priyam, Mechanical engineering student, IIT Kharagpur
Sure, I can help you…
“Can you BLAST this for me?”
Antgenomes.org SequenceServer BLAST made easy
(well, we’re trying...)
Aim: An open source idiot-proof web-interface for custom BLAST
Today: SequenceServerUsed in >200 labs
• Once I wanted to set up a BLAST server.
Anurag Priyam, Mechanical engineering student, Kharagpur
Aim: An open source idiot-proof web-interface
for custom BLASTFriday, 22 June 12
Anurag Priyam, Mechanical engineering student, IIT Kharagpur
Sure, I can help you…
xkcd