57
ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH NEXTFLOW Paolo Di Tommaso, CRG Wellcome Trust Sanger Institute, 1 May 2018, Cambridge

ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES

WITH NEXTFLOW Paolo Di Tommaso, CRG

Wellcome Trust Sanger Institute, 1 May 2018, Cambridge

Page 2: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

WHO IS THIS CHAP? @PaoloDiTommasoResearch software engineerComparative Bioinformatics, Notredame LabCenter for Genomic Regulation (CRG)Author of Nextflow project

Page 3: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

AGENDA• The challenges with computational workflows

• Nextflow main principles

• Handling parallelisation and portability

• Deployments scenarios

• Comparison with other tools

• Future plans

Page 4: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

GENOMIC WORKFLOWS• Data analysis applications to extract information from

(large) genomic datasets

• Embarrassingly parallelisation, can spawn 100s-100k jobs over distributed cluster

• Mash-up of many different tools and scripts

• Complex dependency trees and configuration → very fragile ecosystem

Page 5: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

Steinbiss et al., Companion parassite genome annotation pipeline, DOI: 10.1093/nar/gkw292

Page 6: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

To reproduce the result of a typical computational biology paper

requires 280 hours. ≈1.7 months!

Page 7: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number
Page 8: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

THE SAME APPLICATION DEPLOYED IN

DIFFERENT ENVIRONMENTSPRODUCES

DIFFERENT RESULTS (!)

Page 9: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

Platform Amazon Linux Debian Linux Mac OSX

Number of chromosomes 36 36 36

Overall length (bp) 32,032,223 32,032,223 32,032,223

Number of genes 7,781 7,783 7,771

Gene density 236.64 236.64 236.32

Number of coding genes 7,580 7,580 7570

Average coding length (bp) 1,764 1,764 1,762

Number of genes with multiple CDS 113 113 111

Number of genes with known function 4,147 4,147 4,142

Number of t-RNAs 88 90 88

Comparison of the Companion pipeline annotation of Leishmania infantum genome executed across different platforms *

* Di Tommaso P, et al., Nextflow enables computational reproducibility, Nature Biotech, 2017

Page 10: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

CHALLENGES• Reproducibility, replicate results over time

• Portability, run across different platforms

• Scalability ie. deploy big distributed workloads

• Usability, streamline execution and deployment of complex workloads ie. remove complexity instead of adding new one

• Consistency ie. track changes and revisions consistently for code, config files and binary dependencies

Page 11: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

PUSH-THE-BUTTON PIPELINES

Page 12: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

HOW?

• Fast prototyping ⇒ custom DSL that enables tasks composition, simplifies

most use cases + general purpose programming lang. for corner cases

• Easy parallelisation ⇒ declarative reactive programming model based on

dataflow paradigm, implicit portable parallelism

• Self-contained ⇒ functional approach, a task execution is idempotent ie.

cannot modify the state of other tasks + isolate dependencies with containers

• Portable deployments ⇒ executor abstraction layer + deployment

configuration from implementation logic

Page 13: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

Orchestration& Parallelisation

Scalability& Portability

Deployment &Reproducibility

containers

Git GitHub

Page 14: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

TASK EXAMPLE

bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam

Page 15: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

process align_sample {

input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch

output: file 'sample.bam' into bam_ch

script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """

}

TASK EXAMPLE

bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam

Page 16: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

TASKS COMPOSITION

process index_sample {

input: file 'sample.bam' from bam_ch

output: file 'sample.bai' into bai_ch

script: """ samtools index sample.bam """

}

process align_sample {

input: file 'reference.fa' from genome_ch file 'sample.fq' from reads_ch

output: file 'sample.bam' into bam_ch

script: """ bwa mem reference.fa sample.fq \ | samtools sort -o sample.bam """

}

Page 17: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

DATAFLOW • Declarative computational model for parallel

process executions

• Processes wait for data, when an input set is ready the process is executed

• They communicate by using dataflow variables i.e. async FIFO queues called channels

• Parallelisation and tasks dependencies are implicitly defined by process in/out declarations

Page 18: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

HOW PARALLELISATION WORKSsamples_ch = Channel.fromPath('data/sample.fastq')

process FASTQC {

input: file reads from samples_ch

output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }

Page 19: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

samples_ch = Channel.fromPath('data/*.fastq')

process FASTQC {

input: file reads from samples_ch

output: file 'fastqc_logs' into fastqc_ch """ mkdir fastqc_logs fastqc -o fastqc_logs -f fastq -q ${reads} """ }

HOW PARALLELISATION WORKS

Page 20: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

IMPLICIT PARALLELISM

clustalo

Channel.fromPath("data/*.fastq")

clustaloFASTQC

Page 21: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

SUPPORTED PLATFORMS

Page 22: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

DEPLOYMENT SCENARIOS

Page 23: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

LOCAL EXECUTION

• Common development scenario

• Dependencies can be managed using a container runtime

• Parallelisations is managed spawning posix processes

• Can scale vertically using fat server / shared mem. machine

nextflow

OS

local storage

docker/singularity

laptop / workstation

Page 24: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

CENTRALISED ORCHESTRATION

computer cluster• Nextflow orchestrates

workflow execution submitting jobs to a compute cluster eg. SLURM

• It can run in the head node or a compute node

• Requires a shared storage to exchange data between tasks

• Ideal for corse-grained parallelisms

NFS/Lustre

cluster node

cluster node

cluster node

cluster node

submit jobs

cluster node

nextflow

Page 25: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

DISTRIBUTED ORCHESTRATION

login node

NFS/Lustre

job request

cluster node

cluster node

launcher wrapper

nextflow cluster

nextflow driver

nextflow worker

nextflow worker

nextflow worker

HPC cluster

• A single job request allocates the desired computes nodes

• Nextflow deploys its own embedded compute cluster

• The main instance orchestrate the workflow execution

• The worker instances execute workflow jobs (work stealing approach)

Page 26: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

KUBERNETES

• Next generation native cloud clustering for containerised workloads

• There's the need of workflow orchestration

• Latest NF version includes a new command that streamline the workflow deployment to K8s

Page 27: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

K8S DEPLOYMENT

Page 28: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

PORTABILITY

Page 29: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

PORTABILITY

process { executor = 'slurm' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }

Page 30: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

PORTABILITY

process { executor = 'awsbatch' queue = 'my-queue' memory = '8 GB' cpus = 4 container = 'user/image' }

Page 31: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

CONFIGURATION DECOUPLING IS THE KEY TO

PORTABLE DEPLOYMENTS

Page 32: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

DEMO!

Page 33: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

A QUICK COMPARISON

Page 34: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

GALAXY vs. NEXTFLOW• Command line oriented tool

• Can incorporate any tool w/o any extra adapter

• Fine control over tasks parallelisation

• Scalability 100⇒1M jobs

• One liner installer

• Suited for production workflows + experienced bioinformaticians

• Web based platform

• Built-in integration with many tools and dataset

• Little control over tasks parallelisation

• Scalability 10⇒1K jobs

• Complex installation and maintenance

• Suited for training + not experienced bioinformaticians

Page 35: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

SNAKEMAKE vs. NEXTFLOW• Command line oriented tool

• Push model

• Can manage any data structure

• Compute DAG at runtime

• All major container runtimes

• Built-in support for clusters and cloud

• No (yet) support for sub-workflows

• Built-in support for Git/GitHub, etc., manage pipeline revisions

• Groovy/JVM based

• Command line oriented tool

• Pull model

• Rules defined using file name patterns

• Compute DAG ahead

• Built-in support for Singularity

• Custom scripts for cluster deployments

• Support for sub-workflows

• No support for source code management system

• Python based

Page 36: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

CWL vs. NEXTFLOW

• Language + app. runtime

• DSL on top of a general purpose programming lang.

• Concise, fluent (at least try to be!)

• Community driven

• Single implementation, quick iterations

• Language specification

• Declarative meta-language (YAML/JSON)

• Verbose

• Committee driven

• Many vendors/implementations (and specification version)

Page 37: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

CONTAINERISATION

Page 38: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

CONTAINERISATION• Nextflow envisioned the use

of software containers to fix computational reproducibility

• Mar 2014 (ver 0.7), support for Docker

• Dec 2016 (ver 0.23), support for Singularity

Nextflow

job job job

Page 39: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

SINGULARITY FEATURES

Kurtzer et al. Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459

Page 40: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

BENCHMARK*

* Di Tommaso P, Palumbo E, Chatzou M, Prieto P, Heuer ML, Notredame C. (2015) The impact of Docker containers on the performance of genomic pipelines. PeerJ 3:e1273 https://dx.doi.org/10.7717/peerj.1273

container execution can have an impact on short running tasks ie. < 1min

Page 41: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

SINGULARITY BENCHMARK

https://github.com/wresch/python_import_problem

Singularity image format speeds up Python execution having many imports from a shared file system !

Page 42: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

WHEN USE CONTAINERS?

ALWAYS!

Page 43: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

BEST PRACTICES• Helps to isolate dependencies from dev or local deployment

environment

• Provides a reproducibles sandbox for third party users

• Binary images preserve against software decay

• Make it transparent ie. always include the Dockefile

• Docker image format is de-facto standard, it can be executed by different runtime eg. Singularity, Shifter, uDocker, etc.

Page 44: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

ERROR RECOVERY• Each task outputs are saved in a

separate directory

• This allows to safely record interrupted executions discarding

• Dramatically simplify debugging !

• Computing resources can be defined in a *dynamic* manner, so that a failing task can be automatically re-execute with more memory, longer timeout, etc.

Page 45: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

EXECUTION REPORT

Page 46: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

EXECUTION REPORT

Page 47: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

EXECUTION TIMELINE

Page 48: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

DAG VISUALISATION

Page 49: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

EDITORS !

Page 50: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

WHAT'S NEXT

Page 51: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

IMPROVEMENTS

• Built-in support for Bioconda recipies

• Better meta-data and provenance handling

• Workflow composition aka sub-workflows

• More clouds support ie. Azure and GCP

Page 52: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

APACHE SPARK

• Native support for Apache Spark clusters and execution model

• Allow hybrid Nextflow and Spark applications

• Mix the best of the two worlds, Nextflow for legacy tools/corse grain parallelisation and Spark for fine grain/distributed execution eg. GATK4

Page 53: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

• Partecipate in Cloud Work Stream working group

• TES: Task Execution API (working prototype)

• WES: Workflow Execution API

• Enable interoperability with GA4GH complaint platforms eg. Cancer Genomics Cloud and Broad FireCloud

Page 54: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

WHO IS USING NEXTFLOW?

Page 55: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

• Community effort to collect production ready analysis pipelines built with Nextflow

• Initially supported by SciLifeLab, QBiC and A*Star Genome Institute Singapore

• https://nf-core.github.io Alexander

PeltzerPhil

EwelsAndreas

Wilm

Page 56: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

CONCLUSION• Data analysis reproducibility is hard and it's often underestimated.

• Nextflow does not provide a magic solution but enables best-practices and provide support for community and industry standards.

• It strictly separates the application logic from the configuration and deployment logic, enabling self-contained workflows.

• Applications can be easily deployed across different environment in a reproducible manner with a single command.

• The functional/reactive model allows applications to scale to millions of jobs with ease.

Page 57: ENABLING REPRODUCIBLE IN-SILICO DATA ANALISES WITH …genoweb.toulouse.inra.fr/~formation/8_Galaxy_Admin/2019/... · 2019. 9. 26. · Platform Amazon Linux Debian Linux Mac OSX Number

ACKNOWLEDGMENT

Evan Floden

Emilio Palumbo

Cedric Notredame

Notredame Lab, CRG

http://nextflow.io