51
Programming in R ?

2014 11-13-sbsm032-reproducible research

Embed Size (px)

DESCRIPTION

QMUL - MSc - reproducible research, sustainably software.

Citation preview

Page 1: 2014 11-13-sbsm032-reproducible research

Programming in R?

Page 2: 2014 11-13-sbsm032-reproducible research

If/else

Page 3: 2014 11-13-sbsm032-reproducible research

Logical Operators

Page 4: 2014 11-13-sbsm032-reproducible research
Page 5: 2014 11-13-sbsm032-reproducible research
Page 6: 2014 11-13-sbsm032-reproducible research
Page 7: 2014 11-13-sbsm032-reproducible research

going further

Page 8: 2014 11-13-sbsm032-reproducible research
Page 9: 2014 11-13-sbsm032-reproducible research

Reproducible Research & Sustainable Software

@yannick__ http://yannick.poulet.org

SBSM035 - Stats/Bioinformatics/Programming

Page 10: 2014 11-13-sbsm032-reproducible research

Why care?

Page 11: 2014 11-13-sbsm032-reproducible research
Page 12: 2014 11-13-sbsm032-reproducible research
Page 13: 2014 11-13-sbsm032-reproducible research

www.sciencemag.org SCIENCE VOL 314 22 DECEMBER 2006 1875

Aquaculture in

Offshore Zones

THE EDITORIAL BY ROSAMOND NAYLOR,“Offshore aquaculture legislation” (8 Sept.,

p. 1363), suggests that the motivation for

moving aquaculture into the open ocean is

that “marine f ish farming near the shore

is limited by state regulations.” Although

unworkable regulations may exist in a few

states, in the larger scheme this is irrele-

vant. Of the offshore aquaculture projects

currently under way, none are occurring in

the U.S. Exclusive Economic Zone (EEZ);

rather, they are happening in state waters.

Even historically, only two aquaculture

projects have ever occurred in federal

waters (1).

Much of Naylor’s stated concern over

offshore aquaculture is based on historical

experience with near-shore fish farms. This

is in spite of years of more relevant offshore

operations that reveal little, if any, negative

impact on the environment or local ecosys-

tems (2, 3). Naylor criticizes the National

Offshore Aquaculture Act of 2005 because

it lacks specific environmental standards.

Yet, she recommends California’s recent

Sustainable Oceans Act as a legislative

model, although it is similarly silent, leaving

those details to rule-making in response to

the best available science.

Naylor criticizes the use of fishmeal as

an aquaculture ingredient, ignoring the fact

that industrial fisheries are well managed

and would occur with or without aquacul-

ture’s demand. Naylor ignores the higher

efficiency of using fishmeal to feed fish

compared with its use in land-based live-

stock operations (4). Also ignored is the

inefficiency of using small pelagic fish in

the natural setting to feed predator fish (5).

Researchers and entrepreneurs currently

developing the technologies needed for offshore

aquaculture share a vision of a well-managed

industry governed by regulations with a rational

basis in the ecology of the oceans and the eco-

nomic realities of the marketplace.CLIFFORD A. GOUDEY

Massachusetts Institute of Technology, Cambridge, MA02139, USA.

References and Notes1. The SeaStead project a decade ago, four miles off

Massachusetts (see www.nmfs.noaa.gov/mb/sk/saltonstallken/enhancement.htm) and the recentOffshore Aquaculture Consortium experimental cageoperation 22 miles off Mississippi (see www.masgc.org/oac/).

2. See www.lib.noaa.gov/docaqua/reports_noaaresearch/hooarrprept.htm/.

3. See www.blackpearlsinc.com/PDF/hoarpi.pdf.4. See www.salmonoftheamericas.com/env_food.html.5. D. Pauly, V. Christensen, Nature 374, 255 (2002).

IN HER PROVOCATIVE EDITORIAL “OFFSHOREaquaculture legislation” (8 Sept., p. 1363),

R. Naylor raises valid points regarding regu-

lation of oceanic aquaculture, since it is

sure to grow in the future because of dwin-

dling global fishery supplies. This growth is

LETTERS I BOOKS I POLICY FORUM I EDUCATION FORUM I PERSPECTIVES

1878

Generating new sciencein the classroom

How proteins connect

1880 1882

Mathematicalperspectives

LETTERSedited by Etta Kavanagh

Retraction

WE WISH TO RETRACT OUR RESEARCH ARTICLE “STRUCTURE OFMsbA from E. coli: A homolog of the multidrug resistance ATP bind-

ing cassette (ABC) transporters” and both of our Reports “Structure of

the ABC transporter MsbA in complex with ADP•vanadate and

lipopolysaccharide” and “X-ray structure of the EmrE multidrug trans-

porter in complex with a substrate” (1–3).

The recently reported structure of Sav1866 (4) indicated that our

MsbA structures (1, 2, 5) were incorrect in both the hand of the struc-

ture and the topology. Thus, our biological interpretations based on

these inverted models for MsbA are invalid.

An in-house data reduction program introduced a change in sign for

anomalous differences. This program, which was not part of a conven-

tional data processing package, converted the anomalous pairs (I+ and

I-) to (F- and F+), thereby introducing a sign change. As the diffrac-

tion data collected for each set of MsbA crystals and for the EmrE

crystals were processed with the same program, the structures reported

in (1–3, 5, 6) had the wrong hand.

The error in the topology of the original MsbA structure was a con-

sequence of the low resolution of the data as well as breaks in the elec-

tron density for the connecting loop regions. Unfortunately, the use of

the multicopy refinement procedure still allowed us to obtain reason-

able refinement values for the wrong structures.

The Protein Data Bank (PDB) files 1JSQ, 1PF4, and 1Z2R for

MsbA and 1S7B and 2F2M for EmrE have been moved to the archive

of obsolete PDB entries. The MsbA and EmrE structures will be

recalculated from the original data using the proper sign for the anom-

alous differences, and the new Ca coordinates and structure factors

will be deposited.

We very sincerely regret the confusion that these papers have

caused and, in particular, subsequent research efforts that were unpro-

ductive as a result of our original findings.GEOFFREY CHANG, CHRISTOPHER B. ROTH,

CHRISTOPHER L. REYES, OWEN PORNILLOS,

YEN-JU CHEN, ANDY P. CHEN

Department of Molecular Biology, The Scripps Research Institute, La Jolla, CA 92037, USA.

References1. G. Chang, C. B. Roth, Science 293, 1793 (2001).2. C. L. Reyes, G. Chang, Science 308, 1028 (2005).3. O. Pornillos, Y.-J. Chen, A. P. Chen, G. Chang, Science 310, 1950 (2005).4. R. J. Dawson, K. P. Locher, Nature 443, 180 (2006).5. G. Chang, J. Mol. Biol. 330, 419 (2003).6. C. Ma, G. Chang, Proc. Natl. Acad. Sci. U.S.A. 101, 2852 (2004).

COMMENTARY

Published by AAAS

on

Sept

embe

r 24,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

o

n Se

ptem

ber 2

4, 2

014

ww

w.s

cien

cem

ag.o

rgD

ownl

oade

d fro

m

on

Sept

embe

r 24,

201

4w

ww

.sci

ence

mag

.org

Dow

nloa

ded

from

Page 14: 2014 11-13-sbsm032-reproducible research

• Avoid costly mistakes

• Be faster: “stand on the shoulders of giants”

• Increase impact / visibility

Reproducible Research & Sustainable Software

Page 15: 2014 11-13-sbsm032-reproducible research

“Big data” biologyis hard.

Page 16: 2014 11-13-sbsm032-reproducible research

• Biology/life is complex • Field is young. • Biologists lack computational training. • Generally, analysis tools suck.

• badly written • badly tested • hard to install • output quality… often questionable.

• Understanding/visualizing/massaging data is hard. • Datasets continue to grow!

“Big data” biologyis hard.

Page 17: 2014 11-13-sbsm032-reproducible research

We need great tools.

Page 18: 2014 11-13-sbsm032-reproducible research

Some sources of inspiration

Page 19: 2014 11-13-sbsm032-reproducible research
Page 20: 2014 11-13-sbsm032-reproducible research

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012

Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††

∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])

Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.

Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.

Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.

We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].

In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.

This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial

and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.

1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].

First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):

...calculation...

or to take two points:def rect_area(point1, point2):

...calculation...

The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not

Reserved for Publication Footnotes

1–7

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012

Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††

∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])

Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.

Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.

Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.

We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].

In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.

This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial

and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.

1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].

First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):

...calculation...

or to take two points:def rect_area(point1, point2):

...calculation...

The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not

Reserved for Publication Footnotes

1–7

arX

iv:1

210.

0530

v3 [

cs.M

S] 2

9 N

ov 2

012

Best Practices for Scientific ComputingGreg Wilson ∗, D.A. Aruliah †, C. Titus Brown ‡, Neil P. Chue Hong §, Matt Davis ¶, Richard T. Guy ∥,Steven H.D. Haddock ∗∗, Katy Huff ††, Ian M. Mitchell ‡‡, Mark D. Plumbley §§, Ben Waugh ¶¶,Ethan P. White ∗∗∗, Paul Wilson †††

∗Software Carpentry ([email protected]),†University of Ontario Institute of Technology ([email protected]),‡MichiganState University ([email protected]),§Software Sustainability Institute ([email protected]),¶Space Telescope Science Institute([email protected]),∥University of Toronto ([email protected]),∗∗Monterey Bay Aquarium Research Institute([email protected]),††University of Wisconsin ([email protected]),‡‡University of British Columbia ([email protected]),§§QueenMary University of London ([email protected]),¶¶University College London ([email protected]),∗∗∗Utah StateUniversity ([email protected]), and †††University of Wisconsin ([email protected])

Scientists spend an increasing amount of time building and usingsoftware. However, most scientists are never taught how to do thisefficiently. As a result, many are unaware of tools and practices thatwould allow them to write more reliable and maintainable code withless effort. We describe a set of best practices for scientific softwaredevelopment that have solid foundations in research and experience,and that improve scientists’ productivity and the reliability of theirsoftware.

Software is as important to modern scientific research astelescopes and test tubes. From groups that work exclusivelyon computational problems, to traditional laboratory and fieldscientists, more and more of the daily operation of science re-volves around computers. This includes the development ofnew algorithms, managing and analyzing the large amountsof data that are generated in single research projects, andcombining disparate datasets to assess synthetic problems.

Scientists typically develop their own software for thesepurposes because doing so requires substantial domain-specificknowledge. As a result, recent studies have found that scien-tists typically spend 30% or more of their time developingsoftware [19, 52]. However, 90% or more of them are primar-ily self-taught [19, 52], and therefore lack exposure to basicsoftware development practices such as writing maintainablecode, using version control and issue trackers, code reviews,unit testing, and task automation.

We believe that software is just another kind of experi-mental apparatus [63] and should be built, checked, and usedas carefully as any physical apparatus. However, while mostscientists are careful to validate their laboratory and fieldequipment, most do not know how reliable their software is[21, 20]. This can lead to serious errors impacting the cen-tral conclusions of published research [43]: recent high-profileretractions, technical comments, and corrections because oferrors in computational methods include papers in Science[6], PNAS [39], the Journal of Molecular Biology [5], EcologyLetters [37, 8], the Journal of Mammalogy [33], and Hyper-tension [26].

In addition, because software is often used for more than asingle project, and is often reused by other scientists, comput-ing errors can have disproportional impacts on the scientificprocess. This type of cascading impact caused several promi-nent retractions when an error from another group’s code wasnot discovered until after publication [43]. As with bench ex-periments, not everything must be done to the most exactingstandards; however, scientists need to be aware of best prac-tices both to improve their own approaches and for reviewingcomputational work by others.

This paper describes a set of practices that are easy toadopt and have proven effective in many research settings.Our recommendations are based on several decades of collec-tive experience both building scientific software and teach-ing computing to scientists [1, 65], reports from many othergroups [22, 29, 30, 35, 41, 50, 51], guidelines for commercial

and open source software development [61, 14], and on empir-ical studies of scientific computing [4, 31, 59, 57] and softwaredevelopment in general (summarized in [48]). None of thesepractices will guarantee efficient, error-free software develop-ment, but used in concert they will reduce the number oferrors in scientific software, make it easier to reuse, and savethe authors of the software time and effort that can used forfocusing on the underlying scientific questions.

1. Write programs for people, not computers.Scientists writing software need to write code that both exe-cutes correctly and can be easily read and understood by otherprogrammers (especially the author’s future self). If softwarecannot be easily read and understood it is much more difficultto know that it is actually doing what it is intended to do. Tobe productive, software developers must therefore take severalaspects of human cognition into account: in particular, thathuman working memory is limited, human pattern matchingabilities are finely tuned, and human attention span is short[2, 23, 38, 3, 55].

First, a program should not require its readers to hold morethan a handful of facts in memory at once (1.1). Human work-ing memory can hold only a handful of items at a time, whereeach item is either a single fact or a “chunk” aggregating sev-eral facts [2, 23], so programs should limit the total number ofitems to be remembered to accomplish a task. The primaryway to accomplish this is to break programs up into easilyunderstood functions, each of which conducts a single, easilyunderstood, task. This serves to make each piece of the pro-gram easier to understand in the same way that breaking up ascientific paper using sections and paragraphs makes it easierto read. For example, a function to calculate the area of arectangle can be written to take four separate coordinates:def rect_area(x1, y1, x2, y2):

...calculation...

or to take two points:def rect_area(point1, point2):

...calculation...

The latter function is significantly easier for people to readand remember, while the former is likely to lead to errors, not

Reserved for Publication Footnotes

1–7

1. Write programs for people, not computers. 2. Automate repetitive tasks. 3. Use the computer to record history. 4. Make incremental changes. 5. Use version control. 6. Don’t repeat yourself (or others). 7. Plan for mistakes. 8. Optimize software only after it works correctly. 9. Document the design and purpose of code rather than its mechanics.!10. Conduct code reviews.

Page 21: 2014 11-13-sbsm032-reproducible research
Page 22: 2014 11-13-sbsm032-reproducible research
Page 23: 2014 11-13-sbsm032-reproducible research

Specific Approaches/Tools

• Planning for mistakes

• Automated testing

• Continuous integration

• Writing for people: use style guide

Page 24: 2014 11-13-sbsm032-reproducible research

Code for people: Use a style guide• For R: http://r-pkgs.had.co.nz/style.html

Page 25: 2014 11-13-sbsm032-reproducible research

R style guide extract

Page 26: 2014 11-13-sbsm032-reproducible research

Coding for people: Indent your code!

Programming better

• variable naming

• coding width: 100 characters

• indenting

• Follow conventions -eg “Google R Style”

• Versioning: DropBox & http://github.com/

• Automated testing

• “being able to use understand and improve your code in 6 months & in 60 years” - approximate Damian Conway

preprocess_snps <- function(snp_table, testing=FALSE) { if (testing) { # run a bunch of tests of extreme situations. # quit if a test gives a weird result. } # real part of function. }

Friday, 22 June 12

Page 27: 2014 11-13-sbsm032-reproducible research

Line length Strive to limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font. If you find yourself running out of room, this is a good indication that you should encapsulate some of the work in a separate function.

R style guide extract

!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))

!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header = TRUE, sep = '\t', col.names = c('colony', 'individual', 'headwidth', 'mass') )

!ant_measurements <- read.table(file = '~/Downloads/Web/ant_measurements.txt', header=TRUE, sep='\t', col.names = c('colony', 'individual', 'headwidth', ‘mass'))

Page 28: 2014 11-13-sbsm032-reproducible research

Code for people: Use a style guide• For R: http://r-pkgs.had.co.nz/style.html • For Ruby: https://github.com/bbatsov/ruby-style-guide

Automatically check your code:install.packages(“lint”) # once

library(lint) # everytime lint(“file_to_check.R”)

Page 29: 2014 11-13-sbsm032-reproducible research

Eliminate redundancyDRY: Don’t Repeat Yourself

Page 30: 2014 11-13-sbsm032-reproducible research

knitr (sweave)Analyzing & Reporting in a single file.

analysis.Rmd

### in R: library(knitr) knit(“analysis.Rmd”) # --> creates analysis.md ### in shell: pandoc analysis.md -o analysis.pdf # --> creates MyFile.pdf

A minimal R Markdown example

I know the value of pi is 3.1416, and 2 times pi is 6.2832. To compile me type:

library(knitr); knit(�minimal.Rmd�)

A paragraph here. A code chunk below:

1+1

## [1] 2

.4-.7+.3 # what? it is not zero!

## [1] 5.551e-17

Graphics work too

library(ggplot2)

qplot(speed, dist, data = cars) + geom_smooth()

●●

●●

●●●●

●●

●●●● ●

●●

●●

●●

●●

●●●

●●

●●

●●●● ●

●●

0

40

80

120

5 10 15 20 25speed

dist

Figure 1: A scatterplot of cars

1

Page 31: 2014 11-13-sbsm032-reproducible research

Organize mindfully

Page 32: 2014 11-13-sbsm032-reproducible research

Education

A Quick Guide to Organizing Computational BiologyProjectsWilliam Stafford Noble1,2*

1 Department of Genome Sciences, School of Medicine, University of Washington, Seattle, Washington, United States of America, 2 Department of Computer Science and

Engineering, University of Washington, Seattle, Washington, United States of America

Introduction

Most bioinformatics coursework focus-es on algorithms, with perhaps somecomponents devoted to learning pro-gramming skills and learning how touse existing bioinformatics software. Un-fortunately, for students who are prepar-ing for a research career, this type ofcurriculum fails to address many of theday-to-day organizational challenges as-sociated with performing computationalexperiments. In practice, the principlesbehind organizing and documentingcomputational experiments are oftenlearned on the fly, and this learning isstrongly influenced by personal predilec-tions as well as by chance interactionswith collaborators or colleagues.

The purpose of this article is to describeone good strategy for carrying out com-putational experiments. I will not describeprofound issues such as how to formulatehypotheses, design experiments, or drawconclusions. Rather, I will focus onrelatively mundane issues such as organiz-ing files and directories and documentingprogress. These issues are importantbecause poor organizational choices canlead to significantly slower research pro-gress. I do not claim that the strategies Ioutline here are optimal. These are simplythe principles and practices that I havedeveloped over 12 years of bioinformaticsresearch, augmented with various sugges-tions from other researchers with whom Ihave discussed these issues.

Principles

The core guiding principle is simple:Someone unfamiliar with your projectshould be able to look at your computerfiles and understand in detail what you didand why. This ‘‘someone’’ could be any of avariety of people: someone who read yourpublished article and wants to try toreproduce your work, a collaborator whowants to understand the details of yourexperiments, a future student working inyour lab who wants to extend your workafter you have moved on to a new job, yourresearch advisor, who may be interested in

understanding your work or who may beevaluating your research skills. Most com-monly, however, that ‘‘someone’’ is you. Afew months from now, you may notremember what you were up to when youcreated a particular set of files, or you maynot remember what conclusions you drew.You will either have to then spend timereconstructing your previous experimentsor lose whatever insights you gained fromthose experiments.

This leads to the second principle,which is actually more like a version ofMurphy’s Law: Everything you do, youwill probably have to do over again.Inevitably, you will discover some flaw inyour initial preparation of the data beinganalyzed, or you will get access to newdata, or you will decide that your param-eterization of a particular model was notbroad enough. This means that theexperiment you did last week, or eventhe set of experiments you’ve been work-ing on over the past month, will probablyneed to be redone. If you have organizedand documented your work clearly, thenrepeating the experiment with the newdata or the new parameterization will bemuch, much easier.

To see how these two principles areapplied in practice, let’s begin by consid-ering the organization of directories andfiles with respect to a particular project.

File and Directory Organization

When you begin a new project, youwill need to decide upon some organiza-tional structure for the relevant directo-ries. It is generally a good idea to storeall of the files relevant to one project

under a common root directory. Theexception to this rule is source code orscripts that are used in multiple projects.Each such program might have a projectdirectory of its own.

Within a given project, I use a top-levelorganization that is logical, with chrono-logical organization at the next level, andlogical organization below that. A sampleproject, called msms, is shown in Figure 1.At the root of most of my projects, I have adata directory for storing fixed data sets, aresults directory for tracking computa-tional experiments peformed on that data,a doc directory with one subdirectory permanuscript, and directories such as srcfor source code and bin for compiledbinaries or scripts.

Within the data and results directo-ries, it is often tempting to apply a similar,logical organization. For example, youmay have two or three data sets againstwhich you plan to benchmark youralgorithms, so you could create onedirectory for each of them under data.In my experience, this approach is risky,because the logical structure of your finalset of experiments may look drasticallydifferent from the form you initiallydesigned. This is particularly true underthe results directory, where you maynot even know in advance what kinds ofexperiments you will need to perform. Ifyou try to give your directories logicalnames, you may end up with a very longlist of directories with names that, sixmonths from now, you no longer knowhow to interpret.

Instead, I have found that organizingmy data and results directories chro-nologically makes the most sense. Indeed,

Citation: Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects. PLoS ComputBiol 5(7): e1000424. doi:10.1371/journal.pcbi.1000424

Editor: Fran Lewitter, Whitehead Institute, United States of America

Published July 31, 2009

Copyright: ! 2009 William Stafford Noble. This is an open-access article distributed under the terms of theCreative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in anymedium, provided the original author and source are credited.

Funding: The author received no specific funding for writing this article.

Competing Interests: The author has declared that no competing interests exist.

* E-mail: [email protected]

PLoS Computational Biology | www.ploscompbiol.org 1 July 2009 | Volume 5 | Issue 7 | e1000424

with this approach, the distinction be-tween data and results may not be useful.Instead, one could imagine a top-leveldirectory called something like experi-ments, with subdirectories with names like2008-12-19. Optionally, the directoryname might also include a word or twoindicating the topic of the experimenttherein. In practice, a single experimentwill often require more than one day ofwork, and so you may end up working afew days or more before creating a newsubdirectory. Later, when you or someoneelse wants to know what you did, thechronological structure of your work willbe self-evident.

Below a single experiment directory, theorganization of files and directories islogical, and depends upon the structureof your experiment. In many simpleexperiments, you can keep all of your filesin the current directory. If you startcreating lots of files, then you shouldintroduce some directory structure to storefiles of different types. This directorystructure will typically be generated auto-matically from a driver script, as discussedbelow.

The Lab Notebook

In parallel with this chronologicaldirectory structure, I find it useful tomaintain a chronologically organized labnotebook. This is a document that residesin the root of the results directory andthat records your progress in detail.Entries in the notebook should be dated,and they should be relatively verbose, withlinks or embedded images or tablesdisplaying the results of the experimentsthat you performed. In addition to de-scribing precisely what you did, thenotebook should record your observations,conclusions, and ideas for future work.Particularly when an experiment turns outbadly, it is tempting simply to link the finalplot or table of results and start a newexperiment. Before doing that, it isimportant to document how you knowthe experiment failed, since the interpre-tation of your results may not be obviousto someone else reading your lab note-book.

In addition to the primary text describ-ing your experiments, it is often valuableto transcribe notes from conversations aswell as e-mail text into the lab notebook.

These types of entries provide a completepicture of the development of the projectover time.

In practice, I ask members of myresearch group to put their lab notebooksonline, behind password protection ifnecessary. When I meet with a memberof my lab or a project team, we can referto the online lab notebook, focusing onthe current entry but scrolling up toprevious entries as necessary. The URLcan also be provided to remote collabo-rators to give them status updates on theproject.

Note that if you would rather not createyour own ‘‘home-brew’’ electronic note-book, several alternatives are available.For example, a variety of commercialsoftware systems have been created tohelp scientists create and maintain elec-tronic lab notebooks [1–3]. Furthermore,especially in the context of collaborations,storing the lab notebook on a wiki-basedsystem or on a blog site may be appealing.

Figure 1. Directory structure for a sample project. Directory names are in large typeface, and filenames are in smaller typeface. Only a subset ofthe files are shown here. Note that the dates are formatted ,year.-,month.-,day. so that they can be sorted in chronological order. Thesource code src/ms-analysis.c is compiled to create bin/ms-analysis and is documented in doc/ms-analysis.html. The READMEfiles in the data directories specify who downloaded the data files from what URL on what date. The driver script results/2009-01-15/runallautomatically generates the three subdirectories split1, split2, and split3, corresponding to three cross-validation splits. The bin/parse-sqt.py script is called by both of the runall driver scripts.doi:10.1371/journal.pcbi.1000424.g001

PLoS Computational Biology | www.ploscompbiol.org 2 July 2009 | Volume 5 | Issue 7 | e1000424

In each results folder: •script getResults.rb •intermediates •output

Page 33: 2014 11-13-sbsm032-reproducible research

Track versions of everything

Page 34: 2014 11-13-sbsm032-reproducible research

Github: Facebook for code

Page 35: 2014 11-13-sbsm032-reproducible research
Page 36: 2014 11-13-sbsm032-reproducible research

Github: Facebook for code• Easy versioning

• Random people use your stuff

• And find problems and fix and improve it!

• Greater impact / better planet

• Easily update

• Easily collaborate

• Identify trends

• Build online reputationDemo

Page 37: 2014 11-13-sbsm032-reproducible research

Learn how: https://try.github.io/levels/1/challenges/1

Page 38: 2014 11-13-sbsm032-reproducible research

Programming languages

Page 39: 2014 11-13-sbsm032-reproducible research

Choosing a programming languageGood: Bad:

Excel quick & dirty easy to make mistakes doesn’t scale

R numbers, stats, genomics

programming

Unix command-line == shell == bash

Can’t escape it. Quick & Dirty. HPC.

programming, complicated things

Java 1990s user interfaces overcomplicated.

Perl 1980s. Everything.

Python scripting, text ugly

Ruby scripting, text

Javascript/Node scripting, flexibility(web & client), community only little bio-stuff

Page 40: 2014 11-13-sbsm032-reproducible research

Ruby.“Friends don’t let friends do Perl” - reddit user

### in PERL: open INFILE, "my_file.txt"; while (defined ($line = <INFILE>)) { chomp($line); @letters = split(//, $line); @reverse_letters = reverse(@letters); $reverse_string = join("", @reverse_letters); print $reverse_string, "\n"; }

### in Ruby: File.open("a").each { |line| puts line.chomp.reverse }

• example: “reverse each line in file” • read file; with each line

• remove the invisible “end of line” character • reverse the contents • print the reversed line

Page 41: 2014 11-13-sbsm032-reproducible research

More ruby examples.

5.times { puts "Hello world" }

# Sorting people people_sorted_by_age = people.sort_by { |person| person.age}

+many tools for bio-data - e.g. check http://biogems.info

Page 42: 2014 11-13-sbsm032-reproducible research

Getting help.

• In real life: Make friends with people. Talk to them.

• Online: • Specific discussion mailing lists (e.g.: R, Stacks, bioruby, MAKER...) • Programming: http://stackoverflow.com • Bioinformatics: http://www.biostars.org • Sequencing-related: http://seqanswers.com • Stats: http://stats.stackexchange.com !

• Codeschool!

Page 43: 2014 11-13-sbsm032-reproducible research
Page 44: 2014 11-13-sbsm032-reproducible research
Page 45: 2014 11-13-sbsm032-reproducible research

“Can you BLAST this for me?”

Page 46: 2014 11-13-sbsm032-reproducible research

• Once I wanted to set up a BLAST server.

Anurag Priyam, Mechanical engineering student, Kharagpur

Aim: An open source idiot-proof web-interface

for custom BLASTFriday, 22 June 12

Anurag Priyam, Mechanical engineering student, IIT Kharagpur

Sure, I can help you…

Page 47: 2014 11-13-sbsm032-reproducible research

“Can you BLAST this for me?”

Antgenomes.org SequenceServer BLAST made easy

(well, we’re trying...)

Aim: An open source idiot-proof web-interface for custom BLAST

Page 48: 2014 11-13-sbsm032-reproducible research

Today: SequenceServerUsed in >200 labs

Page 49: 2014 11-13-sbsm032-reproducible research

• Once I wanted to set up a BLAST server.

Anurag Priyam, Mechanical engineering student, Kharagpur

Aim: An open source idiot-proof web-interface

for custom BLASTFriday, 22 June 12

Anurag Priyam, Mechanical engineering student, IIT Kharagpur

Sure, I can help you…

Page 50: 2014 11-13-sbsm032-reproducible research
Page 51: 2014 11-13-sbsm032-reproducible research

xkcd