Upload
barbera-van-schaik
View
1.023
Download
2
Tags:
Embed Size (px)
DESCRIPTION
Presented at the ISMB/ECCB 2011 conference. https://www.iscb.org/cms_addon/conferences/ismbeccb2011/highlights.php#HL13
Citation preview
Initial steps towards a production platformfor DNA sequence analysis on the grid
ISMB/ECCB conference – 18 July 2011
Barbera van Schaik, Angela Luyf, Michel de Vries,
Frank Baas, Antoine van Kampen and Silvia Olabarriaga
Overview
Grid computing and workflow technology
Example: Virus discovery
Analysis of larger data sets
Example: Genome of the Netherlands
Challenges and summary
Sequencing, Moore’s law and personnel
http://www.politigenomics.com/2009/02/the-scale-up.html
Accele
ration Note:
Only slope is
meaningful in
this graph
What are the options?
Local cluster
Desktop grid
Super computer
Hadoop cluster
GPU cluster
Cloud computing
(Inter) national Grid
DNA computing
National computing facilities
Each system has its own interfaceNeed to learn how they all work
Grids
Distributed resources
ComputingData storage
Open protocols
It's all about sharing
ResourcesMethodsCollaborations
Dutch grid (resources)
grid
http://www.biggrid.nl/
People, resources and data flow
My role
grid
Sequencefacility
Researchlaboratories
BioinformaticsNGS team
e-BioScienceteam
Example: Virus discovery
Virus discovery unit
VIDISCAmethod
GenBank - NR
exp1exp1
exp1exp1
exp1exp1
exp6exp1
exp1exp3
exp2exp1
Goal: Identify known and discover new viruses in samples
Michel de Vries et al (2011) PloS one
BLAST analysis workflow
Input: sequence reads
Conversion step (sff to fasta)
BLAST
Output: BLAST results
Workflow description (XML)
Component 1 (XML) Component 2 (XML)
Implementation of workflow components
Executable/script:
BLAST
Executable/script:
sff2fasta.pl
In: sequences(fasta)
In: database(fasta)
Out: blast result
(txt)
In: sequences(sff)
Out: sequences(fasta)
X
Tristan Glatard (2008) Future generation computer systems
http://gwendia.i3s.unice.fr/doku.php?id=gwendia
Run workflow on the grid
Silvia Olabarriaga et al (2010) IEEE Transactions on Information Technology In BiomedicineTristan Glatard (2008) International Journal of High Performance Computing Applications
Graphical user interface: VBrowser
htt
p:/
/ww
w.v
l-e.
nl/
vbro
wse
r
Workflow monitoring
Speed upexp1
exp1exp1
exp1exp1
exp1exp6
exp1exp1
exp3exp2
exp1
Blast
15 experiments722 samples
2 databases:Human ribosomal
Viruses
Total CPU time: 413 hrs (~17 days)Elapsed time workflow: 13.7 hrs= 30x speed up
Angela Luyf, Barbera van Schaik et al (2010) BMC Bioinformatics
Benefits workflow technology
Agile development
Re-use of components
Iteration strategy
Knowledge about analysis
steps captured in workflow
Analysis of larger data setsGenome of the Netherlands (GoNL)
Whole genome
sequencing of
250 trios
Enrich biobanks
Reference set for
disease studies http://www.bbmri.nl/http://www.nlgenome.nl/
770 samples45 TB raw data
Many partners(data sharing)
Analysis ondistributed sites
GoNL alignment pipeline
BWA aln, sampe, sam-to-bam, sort bam, index
Picard mark duplicates
GATK realignment
GATK recalibration
Picard fix mates
Pair1.fastq
Pair2.fastq
Pipeline similar to what is used at the Broad Institute. Implemented for GoNL by Freerk van Dijk (Groningen)
Referencegenome
Result.bam
160 samples (478 lanes) are
currently analyzed on the Dutch grid
Development and small tests:
Nov 22, 2010 - now
Analysis:
Mar 25, 2011 - Jul 15, 2011
Jobs: 13,981
Total CPU time: 5.5 years
Disk space used: 315 TB
Challenges
• Error handling
• Data management
• Data protection
• Provenance tracking
• Transparent addition of other resources
Summary
More research and development needed in e-bioscience
Latest IT infrastructures needed for scaling up NGS data analysis (grids, clouds, big clusters)
Workflow technology assists agile implementation of bioinformatics software
Separate workflow development from IT infrastructure for easier migration and expansion (middleware)
AcknowledgementsGenome of the
Netherlands, NL
Cisca Wijmenga
Morris Swertz
All project partners
Virus discovery unit, AMC
Lia van der Hoek
Michel de Vries
Department of
genome analysis, AMC
Frank Baas
Ted Bradley
Marja Jakobs
Bioinformatics Laboratory, AMC
Antoine van Kampen
NGS bioinformatics team
Aldo Jongejan
Marcel Willemsen
e-Bioscience team
Silvia Olabarriaga
Angela Luyf
Mark Santcroos
Shayan Shahand
University of Amsterdam
Piter de Boer
BiG Grid
Jan Just Keijser
Tom Visser
Grid support
Modalis, France
Johan Montagnat
Creatis, France
Tristan Glatard
http://www.bioinformaticslaboratory.nl/
22
BWA on grid – component description
23
BWA on grid – component description
24
BWA on grid – workflow description
e-BioInfra gateway
No grid certificate neededData upload via sFTP (intranet)Synced with grid storageWorkflows are started from web page
htt
p:/
/ora
nge
.eb
iosc
ien
ce.a
mc.
nl/
ebio
infr
agat
eway
/
Implemented workflow componentsfor next generation sequencing
Existing software
• BLAST
• BLAT
• BWA
• Annovar
• Varscan
• Newbler
• FastQC
In-house software
• Data format converters
• Quality trimming
• Alternative splice product detection
• CDR3 detection (T- and B-cell variation)
• Genome comparison (small genomes)
• Roche software
• GATK
• Picard
• Samtools