How to write bioinformatics software people will use and cite - t.seemann - fri 2 dec - bis 2016 -...

  • View
    1.704

  • Download
    1

  • Category

    Science

Preview:

Citation preview

How to write bioinformatics software that people will use & cite

A/Prof Torsten Seemann

@torstenseemann

Bioinfosummer 2016 - Adelaide AU, Fri 2 Dec

Who am I ?

Doherty Applied Microbial Genomics

Microbial genomics and bioinformatics

Public health and clinical microbiology

Before bioinformatics

● Undergraduate○ Science / Engineering - Computer Science + Electrical Engineering

● Honours○ Computer Science - Digital image compression

● PhD○ Computer Science - Digital image processing

● Never studied any biology

An opportunity

First “fully Aussie” bacterial genome

● Leptospira hardjobovis str. L550● 2 chromosomes● 4 Mbp

● $1M dollar project● Sanger sequencing

● Led by Dieter Bulach

First Illumina instrument in Australia

● Dept MicrobiologyMonash University, 2008

● 36 bp single end reads

● 2 weeks to run

● 2 lanes for 1.6 Mbp genome

Things have improved a bit since then

Q32

36 bp

Q20

Why am I here?

Bioinformatics software and me

Installed >1000 packages manually

Authored >100 packages into Brew

Written and maintain >10 packages

How to get a bioinformatics headache

1. See tweet about new published tool2. Read abstract - sounds awesome!3. Fail to find link to source code - eventually Google it4. Attempt to compile and install it5. Google for 30 min for fixes6. Finally get it built7. Run it on tiny data set8. Get a vague error9. Delete and never revisit it again

Should I stay for this talk ?

YESIt will help you write good tools

YESIt will help you identify bad tools

Should you write a tool?

Should you write a new tool?

● NO○ It already exists○ You are unable to maintain it

○ You won’t really use it

● YES○ YOU need the tool○ YOU will use the tool○ YOU want others to use the tool○ Desire to give back to the community

Eating my own dog food

Lessons from the Prokka experience

● Nearly all feedback is positive

● People all over the world are grateful

● Warm fuzzy feeling inside

● Increase your public profile

● But maintenance burden and guilt

Discoverability

Choosing a home base

University page

Personal home page

Naming

● Try to be unique○ Google to check for conflicts○ Consider how internationals will pronounce it

○ Be creative!

● Avoid dodgy acronyms ○ Try not to win a JABBA Award○ “Just Another Bogus Bioinformatics Acronym”

Don’t be this person

First impressions count

● Keep It Simple Stupid

● First page of documentation○ What does it do?○ How do I install it?

○ How do I run it?

● Try to keep in one place○ Otherwise becomes inconsistent or missed

Usability

A lesson from history

Print something useful if no parameters

% biotool

Please use --help for instructions

Always have a --help flag

% biotool -h

% biotool --help

Usage: biotool [options] seq.fa--help Show this help--version Print version and exit--top N Keep top N sequences

Always have a --version flag

% biotool -v

% biotool -V

% biotool --version

biotool 1.3

Always raise an error when things go wrong

% biotool seq.fa

ERROR: can not open file ‘seq.fa’

Check that dependencies are installed

% biotool seq.fa

Checking BLAST... okChecking SAMtools... NOT FOUND!

Please install ‘samtools’ and add it to your PATH.

Always let users control output filenames

% biotool seq.fa

Processing ‘seq.fa’Wrote result to ‘filt.seq.fa.out’

# ARGH!

% biotool --out seq.filt.fa

KISS - run with minimum parameters

% biotool seq.faERROR: missing -x parameter

% biotool -x 3 seq.faERROR: missing -y parameter

% biotool -x 3 -y 7 seq.faERROR: need -n name

# ARGH!

Standards

Use the standard getopt interfaceShort options ( -h ) and long options ( --help )

● C #include <getopt.h>● C++ boost:program_options● Python import argparse● Perl use Getopt::Long● R library(argparse)

Command line interface

Unix exit codes

● A positive integer

● Loose standards ○ 0 = success○ 1 = general failure○ 2 = error with command line

○ 3..127 = user defined specific failures

● Result in shell $? Variable

Accessing exit codes in the shell

% ls /tmp/fakels: cannot access /tmp/fake% echo $?1

% ls /proc/cpuinfo/proc/cpuinfo% echo $?0

Using stdin, stderr and stdout

● Stdin (0) command < input● Stdout (1) command > output● Stderr (2) command 2> errors

● All command < input > output 2> errors

● Allows piping!sort input | command1 1> output 2> errors

This makes your tool useful in streaming

% zcat seq.fastq.gz |

cutadapt -a adapters.fa |

qualtrim -Q 20 |

bwa mem -t 8 ref.fa |

samtools sort --threads 4

> seq.bam

Use standards compliant files *

● Feature coordinates○ BED, GFF

● Columnar data (put headings!)○ TSV

○ CSV

● Structured data○ JSON

○ YAML * XML excepted

Installation

Keeping your audience

“Each equation in a book will halve your audience”

“Each difficulty encountered in installation will halve your number of users”

Traditional systems level packaging

● Debian / DEBapt-get install blastdpkg -i blast-2.2.5-amd64.deb

● Redhat / RPMyum install blastrpm -i blast-2.2.5-x86_64.rpm

● Various others

Cross platform solutions: Linux, Mac, Windows

● Brewbrew install blast

● Condaconda install blast

● Others○ GUIX, ... ○ Docker, AMI images

Language specific repositories

● Python - PIPpip install ariba

● Perl - CPANcpanm Bio::Roary

● R - CRANinstall.packages(“edgeseq3”)

Marketing

Publish it

● Preprint archive○ PeerJ, bioRxiv

● Method focussed journal○ Bioinformatics, BMC Bioinformatics

● Software focussed journal○ Journal of Open Source Software

Plug it

● Twitter○ Ask someone popular you know to retweet it

● Blog○ Start a general blog and slot

● Conferences○ Tell people about it

Support your users

● Reply to emails

● Monitor your “Issues” web site

● Monitor Biostars and SeqAnswers

● Have a mailing list

● Update your documentation

● Fix bugs

Conclusions

Take home messages

● Make it as painless as possible to install

● Keep documentation clear and simple

● Get people to use it before you publish

● People are not judging your coding skills

● But they will curse you if waste their time

● Most users are grateful - leads to free beer

● A good tools worth much more than a paper

Acknowledgments

● Gary Glonek● David Adelson

● Bernard Pope - VLSCI● Dieter Bulach - VLSCI● Anna Syme - VLSCI● David Powell - Monash University● Anders Goncalves da Silva - University of Melbourne

The end.

Recommended