Upload
torsten-seemann
View
1.704
Download
1
Embed Size (px)
Citation preview
How to write bioinformatics software that people will use & cite
A/Prof Torsten Seemann
@torstenseemann
Bioinfosummer 2016 - Adelaide AU, Fri 2 Dec
Who am I ?
Doherty Applied Microbial Genomics
Microbial genomics and bioinformatics
Public health and clinical microbiology
Before bioinformatics
● Undergraduate○ Science / Engineering - Computer Science + Electrical Engineering
● Honours○ Computer Science - Digital image compression
● PhD○ Computer Science - Digital image processing
● Never studied any biology
An opportunity
First “fully Aussie” bacterial genome
● Leptospira hardjobovis str. L550● 2 chromosomes● 4 Mbp
● $1M dollar project● Sanger sequencing
● Led by Dieter Bulach
First Illumina instrument in Australia
● Dept MicrobiologyMonash University, 2008
● 36 bp single end reads
● 2 weeks to run
● 2 lanes for 1.6 Mbp genome
Things have improved a bit since then
Q32
36 bp
Q20
Why am I here?
Bioinformatics software and me
Installed >1000 packages manually
Authored >100 packages into Brew
Written and maintain >10 packages
How to get a bioinformatics headache
1. See tweet about new published tool2. Read abstract - sounds awesome!3. Fail to find link to source code - eventually Google it4. Attempt to compile and install it5. Google for 30 min for fixes6. Finally get it built7. Run it on tiny data set8. Get a vague error9. Delete and never revisit it again
Should I stay for this talk ?
YESIt will help you write good tools
YESIt will help you identify bad tools
Should you write a tool?
Should you write a new tool?
● NO○ It already exists○ You are unable to maintain it
○ You won’t really use it
● YES○ YOU need the tool○ YOU will use the tool○ YOU want others to use the tool○ Desire to give back to the community
Eating my own dog food
Lessons from the Prokka experience
● Nearly all feedback is positive
● People all over the world are grateful
● Warm fuzzy feeling inside
● Increase your public profile
● But maintenance burden and guilt
Discoverability
Choosing a home base
University page
Personal home page
Naming
● Try to be unique○ Google to check for conflicts○ Consider how internationals will pronounce it
○ Be creative!
● Avoid dodgy acronyms ○ Try not to win a JABBA Award○ “Just Another Bogus Bioinformatics Acronym”
Don’t be this person
First impressions count
● Keep It Simple Stupid
● First page of documentation○ What does it do?○ How do I install it?
○ How do I run it?
● Try to keep in one place○ Otherwise becomes inconsistent or missed
Usability
A lesson from history
Print something useful if no parameters
% biotool
Please use --help for instructions
Always have a --help flag
% biotool -h
% biotool --help
Usage: biotool [options] seq.fa--help Show this help--version Print version and exit--top N Keep top N sequences
Always have a --version flag
% biotool -v
% biotool -V
% biotool --version
biotool 1.3
Always raise an error when things go wrong
% biotool seq.fa
ERROR: can not open file ‘seq.fa’
Check that dependencies are installed
% biotool seq.fa
Checking BLAST... okChecking SAMtools... NOT FOUND!
Please install ‘samtools’ and add it to your PATH.
Always let users control output filenames
% biotool seq.fa
Processing ‘seq.fa’Wrote result to ‘filt.seq.fa.out’
# ARGH!
% biotool --out seq.filt.fa
KISS - run with minimum parameters
% biotool seq.faERROR: missing -x parameter
% biotool -x 3 seq.faERROR: missing -y parameter
% biotool -x 3 -y 7 seq.faERROR: need -n name
# ARGH!
Standards
Use the standard getopt interfaceShort options ( -h ) and long options ( --help )
● C #include <getopt.h>● C++ boost:program_options● Python import argparse● Perl use Getopt::Long● R library(argparse)
Command line interface
Unix exit codes
● A positive integer
● Loose standards ○ 0 = success○ 1 = general failure○ 2 = error with command line
○ 3..127 = user defined specific failures
● Result in shell $? Variable
Accessing exit codes in the shell
% ls /tmp/fakels: cannot access /tmp/fake% echo $?1
% ls /proc/cpuinfo/proc/cpuinfo% echo $?0
Using stdin, stderr and stdout
● Stdin (0) command < input● Stdout (1) command > output● Stderr (2) command 2> errors
● All command < input > output 2> errors
● Allows piping!sort input | command1 1> output 2> errors
This makes your tool useful in streaming
% zcat seq.fastq.gz |
cutadapt -a adapters.fa |
qualtrim -Q 20 |
bwa mem -t 8 ref.fa |
samtools sort --threads 4
> seq.bam
Use standards compliant files *
● Feature coordinates○ BED, GFF
● Columnar data (put headings!)○ TSV
○ CSV
● Structured data○ JSON
○ YAML * XML excepted
Installation
Keeping your audience
“Each equation in a book will halve your audience”
“Each difficulty encountered in installation will halve your number of users”
Traditional systems level packaging
● Debian / DEBapt-get install blastdpkg -i blast-2.2.5-amd64.deb
● Redhat / RPMyum install blastrpm -i blast-2.2.5-x86_64.rpm
● Various others
Cross platform solutions: Linux, Mac, Windows
● Brewbrew install blast
● Condaconda install blast
● Others○ GUIX, ... ○ Docker, AMI images
Language specific repositories
● Python - PIPpip install ariba
● Perl - CPANcpanm Bio::Roary
● R - CRANinstall.packages(“edgeseq3”)
Marketing
Publish it
● Preprint archive○ PeerJ, bioRxiv
● Method focussed journal○ Bioinformatics, BMC Bioinformatics
● Software focussed journal○ Journal of Open Source Software
Plug it
● Twitter○ Ask someone popular you know to retweet it
● Blog○ Start a general blog and slot
● Conferences○ Tell people about it
Support your users
● Reply to emails
● Monitor your “Issues” web site
● Monitor Biostars and SeqAnswers
● Have a mailing list
● Update your documentation
● Fix bugs
Conclusions
Take home messages
● Make it as painless as possible to install
● Keep documentation clear and simple
● Get people to use it before you publish
● People are not judging your coding skills
● But they will curse you if waste their time
● Most users are grateful - leads to free beer
● A good tools worth much more than a paper
Acknowledgments
● Gary Glonek● David Adelson
● Bernard Pope - VLSCI● Dieter Bulach - VLSCI● Anna Syme - VLSCI● David Powell - Monash University● Anders Goncalves da Silva - University of Melbourne
References
1. https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-2-15
2. http://berniepope.id.au/scientific_software_etiquette.html
3. http://thegenomefactory.blogspot.com.au/
The end.