View
1.006
Download
3
Category
Tags:
Preview:
Citation preview
AGBT Tutorial Workshop 15th February, 2012
Large Scale Resequencing: Approaches and Challenges
Thomas Keane Vertebrate Resequencing Informatics group Wellcome Trust Sanger Institute Hinxton, Cambridge, UK
thomas.keane@sanger.ac.uk
AGBT Tutorial Workshop 15th February, 2012
Sanger total sequence (2007-2009) G
bp
AGBT Tutorial Workshop 15th February, 2012
Sanger total sequence to-date G
bp
AGBT Tutorial Workshop 15th February, 2012
Vertebrate Resequencing Informatics Group
Established in 2008 with Jim Stalker PIs: Richard Durbin and David Adams
Initial projects 1000 Genomes project (http://www.1000genomes.org)
Data processing, releases, aligner evaluation, sequencing Pilot 2008-2009: ~5Tbp (Nature 2011;467) Phase 1 2009-2011: ~30Tbp Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)
Mouse Genomes Project (http://www.sanger.ac.uk/mousegenomes) Sequencing 17 laboratory mouse strains SNPs, indels, SVs, de novo assembly Approx. ~1.2Tbp (Nature 2011;477)
AGBT Tutorial Workshop 15th February, 2012
Investigating the role of rare genetic variants in health and disease Whole genome cohorts: 4,000 individuals across two well-established and deeply phenotyped UK cohorts with ongoing longitudinal phenotype collection: TWINSUK – 2,000 ALSPAC – 2,000 6x (18Gbp) per sample
Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals Neurodevelopmental diseases – 3,000
e.g. schizophrenia, autism spectrum disorders Obesity – 2,000
e.g. severe childhood onset obesity Rare diseases – 1,000
e.g. severe insulin resistance, congenital heart disease, ciliopathies 5Gbp per sample
Expect to generate ~100Tbp by end 2012 ~40Tbp from BGI
UK10K
AGBT Tutorial Workshop 15th February, 2012
Current Status
Recently passed 1000 genomes in terms of total Gbp
AGBT Tutorial Workshop 15th February, 2012
What are the challenges?
NGS
Storage
Compute Power
Software/Workflows
AGBT Tutorial Workshop 15th February, 2012
Data Production Workflow
Merge Up
BAM BAM BAM Library merge Library
NA34842 NA87465 Sample/Platform Sample merge
Import +
Improvement Fastq Fastq Fastq …… Fastq Fastq
BAM BAM BAM BAM BAM Alignment (bwa, smalt etc)
BAM BAM BAM BAM BAM BAM
Improvement ……
……
Freeze
AGBT Tutorial Workshop 15th February, 2012
Data Production Workflow
Cross-sample BAMs
Merge across
… Chr1 Chr2 Chr3
NA19294
NA18943
NA19305 . .
NA19309
…
…
RG:NA19294 RG:NA18943 RG:NA19305
.
.
.
.
.
.
.
.
.
Variant Calling
samtools GATK
VQSR
BEAGLE/Impute2
Genome STRiP
Final VCF
VEP Annotation
SVMerge SNPs/indels
AGBT Tutorial Workshop 15th February, 2012
Storage Challenges
Expect ~200Tbp of sequence in 2011-2012 Working estimate including processing, release, and variant calling 10bytes per bp
Storage considerations Scalability – can we easily add more storage units? Backup and disaster recovery – what do we really need to keep? Performance – sufficient I/O throughput to serve compute nodes Cost
Data Formats Standardised formats – BAM & VCF
Minimise the number of copies Aim for two copies at most – original lanes + release (stripped) BAM
AGBT Tutorial Workshop 15th February, 2012
A Tiered Storage Solution
Off-site
Off-site
3Gb/sec
800Mb/sec
CP
U Farm
Cost
2
1
2
Size
1
3
2 Level 1
Data: Current release vertical BAMs Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
Level 2 Data: Lane level BAMs Processes: Alignment, recalibration, local realignment
Level 3 Data: Previous release BAMs + variant calls backup
AGBT Tutorial Workshop 15th February, 2012
Data release + archiving: iRODs
Rule-Oriented Data management systems Open source – origins in particle physics world Most important feature of iRODS is the Rule Engine Akin to source control system
Customise own application level metadata e.g. run, lane, plex, sample, library….
Stores/searches key-value metadata on files: List all files from UK10K studies:
imeta -z seq qu -d study like 'UK10K_%’!/seq/5363/5363_1.bam!/seq/5363/5363_2.bam (.....and a whole lot more)!
Get metadata about a file: imeta ls -d /seq/6534/6534_3#7.bam sample!
attribute: sample!value: QTL191953!
Sanger production: BAM files from runs per lane per plex deposited BMC Bioinformatics 2011, 12:361
Recently adopted for UK10K internal data release and archiving Users use meta-data queries to find their data Files can be part of multiple releases
nfs03
nfs02
nfs01
nfs20
Off-site
iRODs
http://www.irods.org
AGBT Tutorial Workshop 15th February, 2012
Compute Pipeline Management: VRPipe
VRPipe Managed and automated execution of sequences of arbitrary
software against massive datasets across large compute clusters Error handling, optimal memory requests, batching of jobs, retrying
failures, failure reporting, highly extendable, detailed job statistics 1000 Genomes Phase 2 processed through VRPipe Tracked ~1 million jobs Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs bwa_aln_fastq: ~2443 days total serial wall time Mean memory: 941MB/job (max 5637)
2012 Fully migrate all NGS processes to VRPipe (data processing, SNP/
indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines) Management front-ends Create distributable VM for cloud rollout
http://www.github.com/VertebrateResequencing/vr-pipe/wiki
sb10@sanger.ac.uk
AGBT Tutorial Workshop 15th February, 2012
Even more scale up in 2012 – HiSeq 2500
Currently takes 1-2 weeks to sequence a human genome High depth human genomes in a single day – Illumina HiSeq
2500 Caucasian family with a severe T-cell deficiency in affected
sibling Single run on HiSeq 2500 by Illumina per individual
Sample PF
Yield (Gbp)
% Align % ≥Q30 value
Mismatch R1 (%)
Mismatch R2 (%)
Run time (hrs)
Father 117.7 89 92.6 0.4 0.5 25.5 Mother 125.7 90.2 92.8 0.4 0.5 25.5
Affected 124.4 90.3 92.4 0.4 0.5 25.5
AGBT Tutorial Workshop 15th February, 2012
What does the data look like?
AGBT Tutorial Workshop 15th February, 2012
Upcoming Changes in 2012
We cannot keep all of the data 2007-2008: Keep everything including images from runs 2009: BAM/Fastq – all of the base quality information 2010-2011: Stripping original qualities and other unused tags 2012-: Current formats contain lots of repetition
Reference based compression Reducing quality information e.g. quality binning or quality
budgets Potential formats: CRAM and/or Reduced BAM
AGBT Tutorial Workshop 15th February, 2012
CRAM Format
0.1 1 10 100
TGAGCTCTAAGTACC!329183050298757!
-2---30---9---7!TGAGCTCTAAGTACC!
002020010022212!TGAGCTCTAAGTACC!
Do nothing Lossless Quality lossy
Horizontal Vertical
CRAM models for compression
CRAM combination
model
CRAM lossless
Untreated CRAM substitutions/insertions model
CRAM current performance
CRAM v0.6 released 13.2.12: • Pairing information preservation regardless of distance • Revised and improved lossless mode
• Option to preserve all unmapped reads • Performance and bug fixes • Arbitrary tags
http://www.ebi.ac.uk/ena/about/cram_toolkit Source: Ewan Birney/Guy Cochrane, EBI
AGBT Tutorial Workshop 15th February, 2012
URLs • VRPipe: https://github.com/VertebrateResequencing/vr-pipe • iRODS@Sanger: BMC Bioinformatics 2011, 12:361 • http://www.slideshare.net/thomaskeane
Any questions?
Richard Durbin
David Adams
Recommended