55
How to write bioinformatics software that people will use & cite A/Prof Torsten Seemann @torstenseemann Bioinfosummer 2016 - Adelaide AU, Fri 2 Dec

How to write bioinformatics software people will use and cite - t.seemann - fri 2 dec - bis 2016 - adelaide, au

Embed Size (px)

Citation preview

Page 1: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

How to write bioinformatics software that people will use & cite

A/Prof Torsten Seemann

@torstenseemann

Bioinfosummer 2016 - Adelaide AU, Fri 2 Dec

Page 2: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Who am I ?

Page 3: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Doherty Applied Microbial Genomics

Microbial genomics and bioinformatics

Page 4: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Public health and clinical microbiology

Page 5: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Before bioinformatics

● Undergraduate○ Science / Engineering - Computer Science + Electrical Engineering

● Honours○ Computer Science - Digital image compression

● PhD○ Computer Science - Digital image processing

● Never studied any biology

Page 6: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

An opportunity

Page 7: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

First “fully Aussie” bacterial genome

● Leptospira hardjobovis str. L550● 2 chromosomes● 4 Mbp

● $1M dollar project● Sanger sequencing

● Led by Dieter Bulach

Page 8: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

First Illumina instrument in Australia

● Dept MicrobiologyMonash University, 2008

● 36 bp single end reads

● 2 weeks to run

● 2 lanes for 1.6 Mbp genome

Page 9: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Things have improved a bit since then

Q32

36 bp

Q20

Page 10: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Why am I here?

Page 11: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Bioinformatics software and me

Installed >1000 packages manually

Authored >100 packages into Brew

Written and maintain >10 packages

Page 12: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

How to get a bioinformatics headache

1. See tweet about new published tool2. Read abstract - sounds awesome!3. Fail to find link to source code - eventually Google it4. Attempt to compile and install it5. Google for 30 min for fixes6. Finally get it built7. Run it on tiny data set8. Get a vague error9. Delete and never revisit it again

Page 13: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au
Page 14: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au
Page 15: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Should I stay for this talk ?

YESIt will help you write good tools

YESIt will help you identify bad tools

Page 16: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Should you write a tool?

Page 17: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Should you write a new tool?

● NO○ It already exists○ You are unable to maintain it

○ You won’t really use it

● YES○ YOU need the tool○ YOU will use the tool○ YOU want others to use the tool○ Desire to give back to the community

Page 18: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Eating my own dog food

Page 19: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Lessons from the Prokka experience

● Nearly all feedback is positive

● People all over the world are grateful

● Warm fuzzy feeling inside

● Increase your public profile

● But maintenance burden and guilt

Page 20: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Discoverability

Page 21: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Choosing a home base

University page

Personal home page

Page 22: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Naming

● Try to be unique○ Google to check for conflicts○ Consider how internationals will pronounce it

○ Be creative!

● Avoid dodgy acronyms ○ Try not to win a JABBA Award○ “Just Another Bogus Bioinformatics Acronym”

Page 23: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Don’t be this person

Page 24: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

First impressions count

● Keep It Simple Stupid

● First page of documentation○ What does it do?○ How do I install it?

○ How do I run it?

● Try to keep in one place○ Otherwise becomes inconsistent or missed

Page 25: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Usability

Page 26: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

A lesson from history

Page 27: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Print something useful if no parameters

% biotool

Please use --help for instructions

Page 28: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Always have a --help flag

% biotool -h

% biotool --help

Usage: biotool [options] seq.fa--help Show this help--version Print version and exit--top N Keep top N sequences

Page 29: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Always have a --version flag

% biotool -v

% biotool -V

% biotool --version

biotool 1.3

Page 30: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Always raise an error when things go wrong

% biotool seq.fa

ERROR: can not open file ‘seq.fa’

Page 31: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Check that dependencies are installed

% biotool seq.fa

Checking BLAST... okChecking SAMtools... NOT FOUND!

Please install ‘samtools’ and add it to your PATH.

Page 32: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Always let users control output filenames

% biotool seq.fa

Processing ‘seq.fa’Wrote result to ‘filt.seq.fa.out’

# ARGH!

% biotool --out seq.filt.fa

Page 33: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

KISS - run with minimum parameters

% biotool seq.faERROR: missing -x parameter

% biotool -x 3 seq.faERROR: missing -y parameter

% biotool -x 3 -y 7 seq.faERROR: need -n name

# ARGH!

Page 34: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Standards

Page 35: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au
Page 36: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Use the standard getopt interfaceShort options ( -h ) and long options ( --help )

● C #include <getopt.h>● C++ boost:program_options● Python import argparse● Perl use Getopt::Long● R library(argparse)

Command line interface

Page 37: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Unix exit codes

● A positive integer

● Loose standards ○ 0 = success○ 1 = general failure○ 2 = error with command line

○ 3..127 = user defined specific failures

● Result in shell $? Variable

Page 38: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Accessing exit codes in the shell

% ls /tmp/fakels: cannot access /tmp/fake% echo $?1

% ls /proc/cpuinfo/proc/cpuinfo% echo $?0

Page 39: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Using stdin, stderr and stdout

● Stdin (0) command < input● Stdout (1) command > output● Stderr (2) command 2> errors

● All command < input > output 2> errors

● Allows piping!sort input | command1 1> output 2> errors

Page 40: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

This makes your tool useful in streaming

% zcat seq.fastq.gz |

cutadapt -a adapters.fa |

qualtrim -Q 20 |

bwa mem -t 8 ref.fa |

samtools sort --threads 4

> seq.bam

Page 41: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Use standards compliant files *

● Feature coordinates○ BED, GFF

● Columnar data (put headings!)○ TSV

○ CSV

● Structured data○ JSON

○ YAML * XML excepted

Page 42: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Installation

Page 43: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Keeping your audience

“Each equation in a book will halve your audience”

“Each difficulty encountered in installation will halve your number of users”

Page 44: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Traditional systems level packaging

● Debian / DEBapt-get install blastdpkg -i blast-2.2.5-amd64.deb

● Redhat / RPMyum install blastrpm -i blast-2.2.5-x86_64.rpm

● Various others

Page 45: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Cross platform solutions: Linux, Mac, Windows

● Brewbrew install blast

● Condaconda install blast

● Others○ GUIX, ... ○ Docker, AMI images

Page 46: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Language specific repositories

● Python - PIPpip install ariba

● Perl - CPANcpanm Bio::Roary

● R - CRANinstall.packages(“edgeseq3”)

Page 47: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Marketing

Page 48: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Publish it

● Preprint archive○ PeerJ, bioRxiv

● Method focussed journal○ Bioinformatics, BMC Bioinformatics

● Software focussed journal○ Journal of Open Source Software

Page 49: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Plug it

● Twitter○ Ask someone popular you know to retweet it

● Blog○ Start a general blog and slot

● Conferences○ Tell people about it

Page 50: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Support your users

● Reply to emails

● Monitor your “Issues” web site

● Monitor Biostars and SeqAnswers

● Have a mailing list

● Update your documentation

● Fix bugs

Page 51: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Conclusions

Page 52: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Take home messages

● Make it as painless as possible to install

● Keep documentation clear and simple

● Get people to use it before you publish

● People are not judging your coding skills

● But they will curse you if waste their time

● Most users are grateful - leads to free beer

● A good tools worth much more than a paper

Page 53: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

Acknowledgments

● Gary Glonek● David Adelson

● Bernard Pope - VLSCI● Dieter Bulach - VLSCI● Anna Syme - VLSCI● David Powell - Monash University● Anders Goncalves da Silva - University of Melbourne

Page 55: How to write bioinformatics software people will use and cite -  t.seemann - fri 2 dec - bis 2016 - adelaide, au

The end.