Large Scale Resequencing: Approaches and Challenges

AGBT Tutorial Workshop 15th February, 2012

Thomas Keane Vertebrate Resequencing Informatics group Wellcome Trust Sanger Institute Hinxton, Cambridge, UK

thomas.keane@sanger.ac.uk

Sanger total sequence (2007-2009) G

Sanger total sequence to-date G

Vertebrate Resequencing Informatics Group

 Established in 2008 with Jim Stalker  PIs: Richard Durbin and David Adams

 Initial projects  1000 Genomes project (http://www.1000genomes.org)

 Data processing, releases, aligner evaluation, sequencing  Pilot 2008-2009: ~5Tbp (Nature 2011;467)  Phase 1 2009-2011: ~30Tbp  Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)

 Mouse Genomes Project (http://www.sanger.ac.uk/mousegenomes)  Sequencing 17 laboratory mouse strains  SNPs, indels, SVs, de novo assembly  Approx. ~1.2Tbp (Nature 2011;477)

Investigating the role of rare genetic variants in health and disease Whole genome cohorts: 4,000 individuals across two well-established and deeply phenotyped UK cohorts with ongoing longitudinal phenotype collection:   TWINSUK – 2,000  ALSPAC – 2,000   6x (18Gbp) per sample

Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals  Neurodevelopmental diseases – 3,000

 e.g. schizophrenia, autism spectrum disorders  Obesity – 2,000

 e.g. severe childhood onset obesity  Rare diseases – 1,000

 e.g. severe insulin resistance, congenital heart disease, ciliopathies   5Gbp per sample

Expect to generate ~100Tbp by end 2012   ~40Tbp from BGI

Current Status

Recently passed 1000 genomes in terms of total Gbp

What are the challenges?

Storage

Compute Power

Software/Workflows

Data Production Workflow

Merge Up

BAM BAM BAM Library merge Library

NA34842 NA87465 Sample/Platform Sample merge

Import +

Improvement Fastq Fastq Fastq …… Fastq Fastq

BAM BAM BAM BAM BAM Alignment (bwa, smalt etc)

BAM BAM BAM BAM BAM BAM

Improvement ……

……

Freeze

Data Production Workflow

Cross-sample BAMs

Merge across

… Chr1 Chr2 Chr3

NA19294

NA18943

NA19305 . .

NA19309

RG:NA19294 RG:NA18943 RG:NA19305

Variant Calling

samtools GATK

BEAGLE/Impute2

Genome STRiP

Final VCF

VEP Annotation

SVMerge SNPs/indels

Storage Challenges

Expect ~200Tbp of sequence in 2011-2012  Working estimate including processing, release, and variant calling  10bytes per bp

Storage considerations  Scalability – can we easily add more storage units?  Backup and disaster recovery – what do we really need to keep?  Performance – sufficient I/O throughput to serve compute nodes  Cost

Data Formats  Standardised formats – BAM & VCF

Minimise the number of copies  Aim for two copies at most – original lanes + release (stripped) BAM

A Tiered Storage Solution

Off-site

3Gb/sec

800Mb/sec

U Farm

2 Level 1

  Data: Current release vertical BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)

Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment

Level 3   Data: Previous release BAMs + variant calls backup

Data release + archiving: iRODs

Rule-Oriented Data management systems   Open source – origins in particle physics world   Most important feature of iRODS is the Rule Engine   Akin to source control system

Customise own application level metadata   e.g. run, lane, plex, sample, library….

Stores/searches key-value metadata on files:   List all files from UK10K studies:

imeta -z seq qu -d study like 'UK10K_%’!/seq/5363/5363_1.bam!/seq/5363/5363_2.bam (.....and a whole lot more)!

  Get metadata about a file: imeta ls -d /seq/6534/6534_3#7.bam sample!

attribute: sample!value: QTL191953!

Sanger production: BAM files from runs per lane per plex deposited   BMC Bioinformatics 2011, 12:361

Recently adopted for UK10K internal data release and archiving   Users use meta-data queries to find their data   Files can be part of multiple releases

Off-site

http://www.irods.org

Compute Pipeline Management: VRPipe

VRPipe  Managed and automated execution of sequences of arbitrary

software against massive datasets across large compute clusters  Error handling, optimal memory requests, batching of jobs, retrying

failures, failure reporting, highly extendable, detailed job statistics 1000 Genomes Phase 2 processed through VRPipe  Tracked ~1 million jobs  Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs  bwa_aln_fastq: ~2443 days total serial wall time  Mean memory: 941MB/job (max 5637)

2012  Fully migrate all NGS processes to VRPipe (data processing, SNP/

indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)  Management front-ends  Create distributable VM for cloud rollout

http://www.github.com/VertebrateResequencing/vr-pipe/wiki

sb10@sanger.ac.uk

Even more scale up in 2012 – HiSeq 2500

Currently takes 1-2 weeks to sequence a human genome  High depth human genomes in a single day – Illumina HiSeq

2500  Caucasian family with a severe T-cell deficiency in affected

sibling  Single run on HiSeq 2500 by Illumina per individual

Sample PF

Yield (Gbp)

% Align % ≥Q30 value

Mismatch R1 (%)

Mismatch R2 (%)

Run time (hrs)

Father 117.7 89 92.6 0.4 0.5 25.5 Mother 125.7 90.2 92.8 0.4 0.5 25.5

Affected 124.4 90.3 92.4 0.4 0.5 25.5

What does the data look like?

Upcoming Changes in 2012

We cannot keep all of the data  2007-2008: Keep everything including images from runs  2009: BAM/Fastq – all of the base quality information  2010-2011: Stripping original qualities and other unused tags  2012-: Current formats contain lots of repetition

 Reference based compression  Reducing quality information e.g. quality binning or quality

budgets  Potential formats: CRAM and/or Reduced BAM

CRAM Format

0.1 1 10 100

TGAGCTCTAAGTACC!329183050298757!

-2---30---9---7!TGAGCTCTAAGTACC!

002020010022212!TGAGCTCTAAGTACC!

Do nothing Lossless Quality lossy

Horizontal Vertical

CRAM models for compression

CRAM combination

CRAM lossless

Untreated CRAM substitutions/insertions model

CRAM current performance

CRAM v0.6 released 13.2.12: •  Pairing information preservation regardless of distance •  Revised and improved lossless mode

•  Option to preserve all unmapped reads •  Performance and bug fixes •  Arbitrary tags

http://www.ebi.ac.uk/ena/about/cram_toolkit Source: Ewan Birney/Guy Cochrane, EBI

URLs •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361 •  http://www.slideshare.net/thomaskeane

Any questions?

Richard Durbin

David Adams

Large Scale Resequencing: Approaches and Challenges

Technology

Parallelization approaches for the simulation of large-scale … · 2016-09-09 · Parallelization approaches for the simulation of large-scale multibody systems. M. Klöppel1,*,

Resequencing of 672 Native Rice Accessions to Explore

Multi-Scale Approaches for Process Synthesis and Intensification · Multi-Scale Approaches for Process Synthesis and Intensification . ... Flowsheet Design 7:8 . Ranking Process Alternatives

Approaches to continuous improvement using large-scale ...iom.nationalacademies.org/~/media/Files/Activity Files/Quality/VSRT... · Approaches to continuous improvement using large-scale

Fine-Scale Modeling Approaches for Two-Phase Flow...1 Fine-Scale Modeling Approaches for Two-Phase Flow SanjoyBanerjee CUNY Distinguished Professor of Chemical Engineering Director,

Proteasome B Subunit Pharmacogenomics: Gene Resequencing ... · Proteasome B Subunit Pharmacogenomics: Gene Resequencing and Functional Genomics Liewei Wang,1Shaji Kumar, 2 Brooke

A comparison of three approaches for simulating fine-scale

Modelling Approaches for Predictive Control of Large-Scale

A systematic, large-scale resequencing screen of X ... · A systematic, large-scale resequencing screen of X-chromosome coding exons in mental retardation Patrick S Tarpey, Raffaella

Geomorphological concepts and broad scale approaches for

Development of Effective Approaches to the Large-Scale

A Comparison of Approaches to Large-Scale Data Analysisbarbie.uta.edu/~hdfeng/bigdata/Papers/A comparison of approaches to... · A Comparison of Approaches to Large-Scale Data Analysis

Whole exome resequencing reveals recessive mutations in

Efficient Approaches to High-Scale Apache Hadoop Processing

Human Genome Resequencing - Stanford Universityweb.stanford.edu/class/cs262/presentations/lecture4.pdf · Human Genome Resequencing Which human did we sequence? Answer one: Answer

Lab 3: Analysis of Resequencing Data

Characterization of Genetic Diversity in the Nematode ... · INVESTIGATION Characterization of Genetic Diversity in the Nematode Pristionchus paciﬁcus from Population-Scale Resequencing

Resequencing Carrier Routes

Approaches towards sustainability in small-scale fisheries in Indonesia

Integrative approaches for large-scale transcriptome-wide ... · Integrative approaches for large-scale transcriptome-wide association studies Alexander Gusev1,2, Arthur Ko3,4, Huwenbo