31
Reproducible Research Karl Broman Biostatistics & Medical Informatics, UW–Madison kbroman.org github.com/kbroman @kwbroman Slides: bit.ly/Memphis2015b

Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Reproducible Research

Karl Broman

Biostatistics & Medical Informatics, UW–Madison

kbroman.orggithub.com/kbroman

@kwbromanSlides: bit.ly/Memphis2015b

Page 2: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Karl -- this is very interesting ,however you used an old version ofthe data (n=143 rather than n=226).

I'm really sorry you did all thatwork on the incomplete dataset.

Bruce

2

Page 3: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

The results in Table 1 don't seem tocorrespond to those in Figure 2.

3

Page 4: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

In what order do I run these scripts?

4

Page 5: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Where did we get this data file?

5

Page 6: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Why did I omit those samples?

6

Page 7: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

How did I make that figure?

7

Page 8: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

"Your script is now giving an error."

8

Page 9: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

"The attached is similar to the code we used."

9

Page 10: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Reproducible

vs.

(Replicable) invisible text

10

Page 11: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Reproducible

vs.

Replicable

10

Page 12: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Reproducible

vs.

Correct

10

Page 13: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Levels of quality

▶ Are the tables and figures reproducible from the codeand data?

▶ Does the code actually do what you think it does?

▶ In addition to what was done, is it clear why it wasdone?

(e.g., how were parameter settings chosen?)

▶ Can the code be used for other data?

▶ Can you extend the code to do other things?

11

Page 14: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Steps toward reproducible research

kbroman.org/steps2rr

12

Page 15: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

1. Everything with a script

If you do something once,you'll do it 1000 times.

13

Page 16: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

2. Organize your data & code

File organization and namingare powerful weapons against chaos.

– Jenny Bryan

14

Page 17: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

2. Organize your data & code

Your closest collaborator is you six months ago,but you don't reply to emails.

(paraphrasing Mark Holder)

14

Page 18: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

2. Organize your data & code

RawData/ Notes/DerivedData/ Refs/

Python/ ReadMe.txtR/ ToDo.txtRuby/ Makefile

14

Page 19: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

3. Automate the process (GNU Make)

R/analysis.html: R/analysis.Rmd Data/cleandata.csvcd R;R -e "rmarkdown::render('analysis.Rmd')"

Data/cleandata.csv: R/prepData.R RawData/rawdata.csvcd R;R CMD BATCH prepData.R

RawData/rawdata.csv: Python/xls2csv.py RawData/rawdata.xlsPython/xls2csv.py RawData/rawdata.xls > RawData/rawdata.csv

15

Page 20: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

4. Turn scripts into reproducible reports

16

Page 21: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

4. Turn scripts into reproducible reports

16

Page 22: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

5. Turn repeated code into functions

# Pythondef read_genotypes (filename):

"Read matrix of genotype data"

# Rplot_genotypes <-function(genotypes , ...){}

17

Page 23: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

6. Create a package/module

Don't repeat yourself

18

Page 24: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

7. Use version control (git/GitHub)

19

Page 25: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

7. Use version control (git/GitHub)

19

Page 26: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

7. Use version control (git/GitHub)

19

Page 27: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

7. Use version control (git/GitHub)

19

Page 28: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

7. Use version control (git/GitHub)

19

Page 29: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

8. License your software

Pick a license, any license

– Jeff Atwood

20

Page 30: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Summary1. Everything with a script

2. Organize your data & code

3. Automate the process (GNU Make)

4. Turn scripts into reproducible reports

5. Turn repeated code into functions

6. Create a package/module

7. Use version control (git/GitHub)

8. Pick a license, any license

21

Page 31: Reproducible Research - Biostatistics and Medical Informatics › ~kbroman › presentations › repro_rese… · Reproducible Research Author: Karl Broman Created Date: 11/3/2015

Slides: bit.ly/Memphis2015b

kbroman.org

github.com/kbroman

@kwbroman

22