27
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA, SAN DIEGO Bioinformatics Meets Big Data Wayne Pfeiffer SDSC/UCSD August 15, 2013 (in US) August 16, 2013 (in Australia)

Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

Embed Size (px)

Citation preview

Page 1: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Bioinformatics Meets Big Data

Wayne Pfeiffer

SDSC/UCSD

August 15, 2013 (in US)

August 16, 2013 (in Australia)

Page 2: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Questions for today

• What is causing the flood of data in bioinformatics?

• How much data are we talking about?

• What are typical compute- & data-intensive analyses in bioinformatics?

• What are their computational requirements & challenges?

• How can gateways help?

Page 3: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Cost of DNA sequencing has dropped much faster than cost of computing in recent years,

producing the flood of data

Page 4: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Size matters: how much data are we talking about?

• 3.1 GB for human genome

• Fits on flash drive; assumes FASTA format (1 B per base)

• >100 GB/day from a single Illumina HiSeq 2000

• 50 Gbases/day of reads in FASTQ format (2.5 B per base)

• 500 GB to 1 TB of reads needed as input for analysis of whole human genome, depending upon coverage

• 500 GB for 60x coverage

• 1 TB for 130x coverage

• Multiple TB often needed for subsequent analysis

• Especially when doing de novo assembly

• Sometimes two genomes per person!

• May only be looking for kB or MB in the end

read

read

Page 5: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

SDSC has a rich set of bioinformatics software;

representative codes are listed here

• Pairwise sequence alignment

• ATAC, BFAST, BLAST, BLAT, Bowtie, BWA

• Multiple sequence alignment (via CIPRES gateway)

• ClustalW, MAFFT

• RNA-Seq analysis

• GSNAP, Tophat

• De novo assembly

• ABySS, Edena, SOAPdenovo, Velvet

• Phylogenetic tree inference (via CIPRES gateway)

• BEAST with BEAGLE, GARLI, MrBayes, RAxML, RAxML-Light

• Tool kits

• BEDTools, GATK, SAMtools

Page 6: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational workflow for read mapping & variant calling

DNA reads in

FASTQ format

Read mapping, i.e.,

pairwise alignment:

BFAST, BWA, …

Reference

genome in

FASTA format

Variant calling:

GATK, …

Variants: SNPs

& indels

Alignment info

in SAM/BAM

format

Goal: identify simple variants, e.g.,

• single nucleotide polymorphisms (SNPs)

or single nucleotide variants (SNVs)

• short insertions & deletions (indels)

CACCGGCGCAGTCATTCTCAT

AAT

||||||||||| ||||||||||||

CACCGGCGCAGACATTCTCAT

AAT

CACCGGCGCAGTCATTCTCATAAT

|||||||||| |||||||||||

CACCGGCGCA ATTCTCATAAT

Page 7: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Pileup diagram shows mapping of reads to reference; example from HuTS shows a SNP in KRAS gene; this

means that cetuximab is not effective for chemotherapy

BWA analysis by Sam Levy, STSI; diagram from Andrew Carson, STSI

Page 8: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational workflow for de novo assembly & variant calling

DNA reads in

FASTQ format

De novo assembly:

SOAPdenovo,

Velvet, …

Reference

genome in

FASTA format

Contigs &

scaffolds in

FASTA format

Pairwise alignment:

ATAC, BLAST, …

Variant calling:

custom scripts

Alignment info

in various

formats

Variants: indels

& others

Goal: identify complex

variants, e.g.,

• long indels

• block replacements

• inversions

• translocations

Page 9: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Key conceptual steps in de novo assembly

1. Find reads that overlap by a specified

number of bases (the k-mer size),

typically by building a graph in memory

2. Merge overlapping, “good” reads into

longer contigs, typically by simplifying

the graph

3. Link contigs to form scaffolds using

paired-end information

Diagrams from Serafim Batzoglou, Stanford

Page 10: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Phylogenetics is the study of evolutionary relationships

among groups of organisms called taxa (typically species)

• The result of a phylogenetic analysis is a phylogeny, most often represented as a tree

• In olden times, phylogenies were based on morphology

• Now phylogenies are usually based on DNA sequences

/-------- Human

|

|---------- Chimpanzee

+

| /---------- Gorilla

| |

\---+ /-------------------------------- Orangutan

\-------------+

\----------------------------------------------- Gibbon

Page 11: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational workflow for de novo assembly plus phylogenetic analysis using DNA sequence data

DNA reads in

FASTQ format

Phylogenetic tree

inference: BEAST,

MrBayes, RAxML, …

. . . ...... . .

Human AAGCTTCACCGGCGCAGTCATTCTCATAAT...

Chimpanzee AAGCTTCACCGGCGCAATTATCCTCATAAT...

Gorilla AAGCTTCACCGGCGCAGTTGTTCTTATAAT...

Orangutan AAGCTTCACCGGCGCAACCACCCTCATGAT...

Gibbon AAGCTTTACAGGTGCAACCGTCCTCATAAT...

Multiple sequence alignment is matrix of taxa vs characters

Final output is phylogeny or tree with taxa at its tips

/-------- Human |

|---------- Chimpanzee

+

| /---------- Gorilla

| |

\---+ /-------------------------------- Orangutan

\-------------+

\----------------------------------------------- Gibbon

Aligned sequences

in various formats

Multiple sequence

alignment: ClustalW,

MAFFT, Mauve …

Gene sequences in

FASTA format

Gene finding:

BLAST, Glimmer,

Prodigal, …

Contigs & scaffolds

in FASTA format

De novo assembly:

Edena, SOAPdenovo,

Velvet, …

Page 12: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Here are some specs for SDSC’s supercomputers

Cores/ Memory/

Computer Processors node node (GB)

Gordon (NSF) 2.6-GHz Intel Sandy Bridge 16 64

TSCC 2.6-GHz Intel Sandy Bridge 16 64

Triton CC (ret) 2.4-GHz Intel Nehalems 8 48

Trestles (NSF) 2.4-GHz AMD Magny-Cours 32 64

Triton PDAF 2.5-GHz AMD Shanghais 32 512

Page 13: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational requirements are substantial for some codes & data sets

Code type, code, Input Output Memory Time

& data set (GB) (GB) (GB) (h) Cores Computer

Pairwise sequence alignment

BWA 0.7.5a 105 129 8 2 16 TSCC

398M 100-bp reads (1 lane)

De novo assembly

SOAPdenovo 1.05 424 77 387 26 16 Triton P+CC

1.7B 100-bp reads (whole genome)

Velvet 1.1.06 35 617 539 9 16 Triton PDAF

562M ≤50-bp reads (Chr 1)

Phylogenetic tree inference

MrBayes 3.1.2h <1 <1 13 194 32 Gordon

AA data, 73 taxa, 10.4k patterns,

3M generations (HL)

RAxML 7.2.7 <1 <1 171 106 160 Trestles

AA data, 1.6k taxa, 8.8k patterns,

160 bootstraps + 20 thorough searches (JG2)

Page 14: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

So how compute- and data-intensive are the bioinformatics analyses we considered?

Compute- Memory- I/O-

Analysis intensive intensive* intensive

Pairwise sequence alignment x

De novo assembly x x x

Tree inference (usually) x

Tree inference (sometimes) x x

* I.e., large memory per node is needed for shared-memory implementations

Here is a qualitative summary

Page 15: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Computational challenges abound in bioinformatics

• Large amounts of data, which can grow substantially during analysis

• Complex workflows, often with different computational requirements along the way

• Parallelism that varies between steps in the workflow pipeline

• Large shared memory needed for some analyses

Page 16: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Gateways or portals provide browser interfaces to execute bioinformatics analyses

• Galaxy: executes & documents workflows for

• Read mapping

• Multiple sequence alignment

• Variant calling

• Statistical analysis

• CIPRES: executes codes for

• Multiple sequence alignment

• Phylogenetic tree inference

Page 17: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

The CIPRES gateway (or portal) lets biologists run phylogenetics codes at SDSC via a browser interface;

http://www.phylo.org/index.php/portal

Page 18: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Browser interface simplifies access to community codes,

especially for users who only occasionally compute

• Users do not log onto HPC systems & so do not need to learn about Linux, parallelization, or job scheduling

• Users simply use browser interface to

• pick code, select options, & set parameters

• upload sequence data

• Numbers of cores, processes, & threads are selected automatically based on

• input options & parameters

• rules developed from benchmarking

• Occasionally we make special runs not allowed by rules

• In most cases, users do not need individual allocations

• Users still need to understand code options!

Page 19: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Parallel versions of six phylogenetics codes are available via the CIPRES gateway

Code & version Parallelization Cores Computer

MAFFT 7.037 Pthreads 8 Trestles

BEAST 1.7.5 Pthreads 8 Trestles

GARLI 2.0 MPI ≤32 Trestles

MrBayes 3.1.2h MPI/OpenMP 10 to 32 Gordon

MrBayes 3.2.1 MPI 8 to 16 Gordon

RAxML 7.6.6 MPI/Pthreads 8, 30, Trestles

or 60

RAxML-Light 1.0.9 bash/Pthreads ≤1,000 Trestles

Page 20: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

RAxML parallel efficiency is >0.5 up to 60 cores for >1,000 patterns*; speedup is superlinear for comprehensive analysis at some core counts;

scalability generally improves with number of patterns

* Number of patterns = number of unique columns in multiple sequence alignment

Page 21: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Rules for running RAxML on Trestles were developed

based on benchmarking

• Check number of searches specified by -N option

• If -N is not specified,

• Run with 8 Pthreads on 8 cores of a single node in shared queue

• If -N n is specified with n < 50,

• Run with 5 MPI processes & 6 Pthreads on 30 cores of a single node in normal queue

• If -N n is specified with n ≥ 50 or n = auto,

• Run with 10 MPI processes & 6 Pthreads on 60 cores of two nodes in normal queue

Page 22: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Some operational facts & considerations

• >100 jobs are usually running; a July 3 snapshot showed

• 66 MrBayes jobs using 920 cores on Gordon

• 79 BEAST jobs using 632 cores on Trestles

• 14 RAxML jobs using 896 cores on Trestles

• 1 GARLI job using 32 cores on Trestles

• Jobs are run on both systems to distribute load

• ~15% of load on Trestles is from CIPRES gateway jobs

• Jobs can run a long time; allowable limits are

• 168 hours (1 week) on Gordon

• 334 hours (2 weeks) on Trestles

• I/O is done via NFS (/projects), not Lustre (/oasis)

• BEAST & MrBayes output frequent, small updates to log files

• This can overwhelm the Lustre metadata servers

Page 23: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

The CIPRES gateway has been extremely popular

• >6,000 users have run on TeraGrid/XSEDE supercomputers

• ~173,000 jobs were run & ~29M Trestles SUs were used thru Feb 2013

• >600 publications have been enabled by CIPRES use

Year

12 24 36

Us

ers

/Mo

nth

200

400

600

800

Total Users

Repeat Users

New Users

2010 2011 2012 2013

Page 24: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Most CIPRES gateway jobs are submitted from US, but many come from elsewhere

• Screen shot shows locations of 1,000 consecutive user logons as

of April 20, 2011

• Highlighted dots show users online

Page 25: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Protected clover fern in Azores was shown to be an invasive species from Australia introduced from the US

• RAxML & MrBayes analyses were done via CIPRES gateway

• H. Schaefer, M.A. Carine, & F.J. Rumsey, “From European Priority Species to Invasive Weed: Marsilea azorica (Marsileaceae) is a Misidentified Alien,” Systematic Biology, v. 36, pp. 845-853 (2011)

Page 26: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Questions for you

• How many bytes of DNA sequence data are typically needed to analyze an entire human genome?

• 500 GB to 1 TB

• What kind of analysis can need >250 GB shared memory?

• De novo assembly of short reads

• What kind of analysis can take a week or more to run?

• Phylogenetic tree inference

• How do users typically access a gateway?

• Through a browser interface

Page 27: Bioinformatics Meets Big Data - Monash University · read mapping & variant calling DNA reads ... in FASTA format Variant calling: GATK, … Variants: SNPs & indels Alignment info

SAN DIEGO SUPERCOMPUTER CENTER

at the UNIVERSITY OF CALIFORNIA, SAN DIEGO

Questions for me?

Moki Dugway in Utah Mt Haeckel in California

Part of 107-mi bicycle ride

on May 10, 2013

First ascent of year

on May 27, 2013