Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES
WITH NEXTFLOW Paolo Di Tommaso, CRG
Wellcome Trust Sanger Institute, 1 May 2018, Cambridge
WHO IS THIS CHAP? @PaoloDiTommasoResearch software engineerComparative Bioinformatics, Notredame LabCenter for Genomic Regulation (CRG)Author of Nextflow project
AGENDA• The challenges with computational workflows
• Nextflow main principles
• Handling parallelisation and portability
• Deployments scenarios
• Comparison with other tools
• Future plans
GENOMIC WORKFLOWS• Data analysis applications to extract information from
(large) genomic datasets
• Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster
• Mash-up of many different tools and scripts
• Complex dependency trees and configuration → very fragile ecosystem
Steinbiss et al., Companion parassite genome annotation pipeline, DOI: 10.1093/nar/gkw292
To reproduce the result of a typical computational biology paper
requires 280 hours. ≈1.7 months!
THE SAME APPLICATION DEPLOYED IN
DIFFERENT ENVIRONMENTSPRODUCES
DIFFERENT RESULTS (!)
Platform Amazon Linux Debian Linux Mac OSX
Number of chromosomes 36 36 36
Overall length (bp) 32,032,223 32,032,223 32,032,223
Number of genes 7,781 7,783 7,771
Gene density 236.64 236.64 236.32
Number of coding genes 7,580 7,580 7570
Average coding length (bp) 1,764 1,764 1,762
Number of genes with multiple CDS 113 113 111
Number of genes with known function 4,147 4,147 4,142
Number of t-RNAs 88 90 88
Comparison of the Companion pipeline annotation of Leishmania infantum genome executed across different platforms *
* Di Tommaso P, et al., Nextflow enables computational reproducibility, Nature Biotech, 2017
CHALLENGES• Reproducibility, replicate results over time
• Portability, run across different platforms
• Scalability ie. deploy big distributed workloads
• Usability, streamline execution and deployment of complex workloads ie. remove complexity instead of adding new one
• Consistency ie. track changes and revisions consistently for code, config files and binary dependencies
PUSH-THE-BUTTON PIPELINES
HOW?
• Fast prototyping ⇒ custom DSL that enables tasks composition, simplifies
most use cases + general purpose programming lang. for corner cases
• Easy parallelisation ⇒ declarative reactive programming model based on
dataflow paradigm, implicit portable parallelism
• Self-contained ⇒ functional approach, a task execution is idempotent ie.
cannot modify the state of other tasks + isolate dependencies with containers
• Portable deployments ⇒ executor abstraction layer + deployment
configuration from implementation logic
Orchestration& Parallelisation
Scalability& Portability
Deployment &Reproducibility
containers
Git GitHub
TASK EXAMPLE
bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam
process align_sample {
input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch
output: file 'sample.bam' into bam_ch
script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """
}
TASK EXAMPLE
bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam
TASKS COMPOSITION
process index_sample {
input: file 'sample.bam' from bam_ch
output: file 'sample.bai' into bai_ch
script: """ samtools index sample.bam """
}
process align_sample {
input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch
output: file 'sample.bam' into bam_ch
script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """
}
DATAFLOW • Declarative computational model for parallel
process executions
• Processes wait for data, when an input set is ready the process is executed
• They communicate by using dataflow variables i.e. async FIFO queues called channels
• Parallelisation and tasks dependencies are implicitly defined by process in/out declarations
HOW PARALLELISATION WORKSsamples_ch = Channel.fromPath('data/sample.fastq')
process FASTQC {
input: file reads from samples_ch
output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }
samples_ch = Channel.fromPath('data/*.fastq')
process FASTQC {
input: file reads from samples_ch
output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }
HOW PARALLELISATION WORKS
IMPLICIT PARALLELISM
clustalo
Channel.fromPath("data/*.fastq")
clustaloFASTQC
SUPPORTED PLATFORMS
DEPLOYMENT SCENARIOS
LOCAL EXECUTION
• Common development scenario
• Dependencies can be managed using a container runtime
• Parallelisations is managed spawning posix processes
• Can scale vertically using fat server / shared mem. machine
nextflow
OS
local storage
docker/singularity
laptop / workstation
CENTRALISED ORCHESTRATION
computer cluster• Nextflow orchestrates
workflow execution submitting jobs to a compute cluster eg. SLURM
• It can run in the head node or a compute node
• Requires a shared storage to exchange data between tasks
• Ideal for corse-grained parallelisms
NFS/Lustre
cluster node
cluster node
cluster node
cluster node
submit jobs
cluster node
nextflow
DISTRIBUTED ORCHESTRATION
login node
NFS/Lustre
job request
cluster node
cluster node
launcher wrapper
nextflow cluster
nextflow driver
nextflow worker
nextflow worker
nextflow worker
HPC cluster
• A single job request allocates the desired computes nodes
• Nextflow deploys its own embedded compute cluster
• The main instance orchestrate the workflow execution
• The worker instances execute workflow jobs (work stealing approach)
KUBERNETES
• Next generation native cloud clustering for containerised workloads
• There's the need of workflow orchestration
• Latest NF version includes a new command that streamline the workflow deployment to K8s
K8S DEPLOYMENT
PORTABILITY
PORTABILITY
process { executor = 'slurm' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }
PORTABILITY
process { executor = 'awsbatch' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }
CONFIGURATION DECOUPLING IS THE KEY TO
PORTABLE DEPLOYMENTS
DEMO!
A QUICK COMPARISON
GALAXY vs. NEXTFLOW• Command line oriented tool
• Can incorporate any tool w/o any extra adapter
• Fine control over tasks parallelisation
• Scalability 100⇒1M jobs
• One liner installer
• Suited for production workflows + experienced bioinformaticians
• Web based platform
• Built-in integration with many tools and dataset
• Little control over tasks parallelisation
• Scalability 10⇒1K jobs
• Complex installation and maintenance
• Suited for training + not experienced bioinformaticians
SNAKEMAKE vs. NEXTFLOW• Command line oriented tool
• Push model
• Can manage any data structure
• Compute DAG at runtime
• All major container runtimes
• Built-in support for clusters and cloud
• No (yet) support for sub-workflows
• Built-in support for Git/GitHub, etc., manage pipeline revisions
• Groovy/JVM based
• Command line oriented tool
• Pull model
• Rules defined using file name patterns
• Compute DAG ahead
• Built-in support for Singularity
• Custom scripts for cluster deployments
• Support for sub-workflows
• No support for source code management system
• Python based
CWL vs. NEXTFLOW
• Language + app. runtime
• DSL on top of a general purpose programming lang.
• Concise, fluent (at least try to be!)
• Community driven
• Single implementation, quick iterations
• Language specification
• Declarative meta-language (YAML/JSON)
• Verbose
• Committee driven
• Many vendors/implementations (and specification version)
CONTAINERISATION
CONTAINERISATION• Nextflow envisioned the use
of software containers to fix computational reproducibility
• Mar 2014 (ver 0.7), support for Docker
• Dec 2016 (ver 0.23), support for Singularity
Nextflow
job job job
SINGULARITY FEATURES
Kurtzer et al. Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459
BENCHMARK*
* Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 https://dx.doi.org/10.7717/peerj.1273
container execution can have an impact on short running tasks ie. < 1min
SINGULARITY BENCHMARK
https://github.com/wresch/python_import_problem
Singularity image format speeds up Python execution having many imports from a shared file system !
WHEN USE CONTAINERS?
ALWAYS!
BEST PRACTICES• Helps to isolate dependencies from dev or local deployment
environment
• Provides a reproducibles sandbox for third party users
• Binary images preserve against software decay
• Make it transparent ie. always include the Dockefile
• Docker image format is de-facto standard, it can be executed by different runtime eg. Singularity, Shifter, uDocker, etc.
ERROR RECOVERY• Each task outputs are saved in a
separate directory
• This allows to safely record interrupted executions discarding
• Dramatically simplify debugging !
• Computing resources can be defined in a *dynamic* manner, so that a failing task can be automatically re-execute with more memory, longer timeout, etc.
EXECUTION REPORT
EXECUTION REPORT
EXECUTION TIMELINE
DAG VISUALISATION
EDITORS !
WHAT'S NEXT
IMPROVEMENTS
• Built-in support for Bioconda recipies
• Better meta-data and provenance handling
• Workflow composition aka sub-workflows
• More clouds support ie. Azure and GCP
APACHE SPARK
• Native support for Apache Spark clusters and execution model
• Allow hybrid Nextflow and Spark applications
• Mix the best of the two worlds, Nextflow for legacy tools/corse grain parallelisation and Spark for fine grain/distributed execution eg. GATK4
• Partecipate in Cloud Work Stream working group
• TES: Task Execution API (working prototype)
• WES: Workflow Execution API
• Enable interoperability with GA4GH complaint platforms eg. Cancer Genomics Cloud and Broad FireCloud
WHO IS USING NEXTFLOW?
• Community effort to collect production ready analysis pipelines built with Nextflow
• Initially supported by SciLifeLab, QBiC and A*Star Genome Institute Singapore
• https://nf-core.github.io Alexander
PeltzerPhil
EwelsAndreas
Wilm
CONCLUSION• Data analysis reproducibility is hard and it's often underestimated.
• Nextflow does not provide a magic solution but enables best-practices and provide support for community and industry standards.
• It strictly separates the application logic from the configuration and deployment logic, enabling self-contained workflows.
• Applications can be easily deployed across different environment in a reproducible manner with a single command.
• The functional/reactive model allows applications to scale to millions of jobs with ease.
ACKNOWLEDGMENT
Evan Floden
Emilio Palumbo
Cedric Notredame
Notredame Lab, CRG
http://nextflow.io