Upload
thomas-keane
View
1.884
Download
0
Tags:
Embed Size (px)
DESCRIPTION
Structural variation calling with the SVMerge pipeline. see http://svmerge.sourceforge.net
Citation preview
Vertebrate Resequencing Informatics 22nd March, 2011
Enhanced structural variant and breakpoint detection using SVMerge by integration of multiple detection methods and local assembly
Kim Wong/Thomas Keane Vertebrate Resequencing Informatics
http://svmerge.sourceforge.net
Vertebrate Resequencing Informatics 22nd March, 2011
Genomic Structural Variation
Large DNA rearrangements (>100bp) Frequent causes of disease Referred to as genomic disorders Mendelian diseases or complex traits such as behaviors
E.g. increase in gene dosage due to increase in copy number Prevalent in cancer genomes
Many types of genomic structural variation (SV) Insertions, deletions, copy number changes, inversions,
translocations & complex events Comparative genomic hybridization (CGH) traditionally used to for copy number discovery CNVs of 1–50 kb in size have been under-ascertained
Next-gen sequencing revolutionised field of SV discovery Parallel sequencing of ends of large numbers of DNA fragments Examine alignment distance of reads to discover presence of
genomic rearrangments Resolution down to ~100bp
Vertebrate Resequencing Informatics 22nd March, 2011
Deletion
SV Visualisation LookSeq viewer Read pairs displayed Y axis is aligned insert size
Deletions are easily spotted Read pairs are mapped
further apart than expected Coverage is zero across
the deletion sequence Deletion in NOD/ShiLtJ
Vertebrate Resequencing Informatics 22nd March, 2011
Inversion
Mate pairs align in the same orientation
Coverage zero at breakpoints
Vertebrate Resequencing Informatics 22nd March, 2011
Insertion
One end mapped reads
Coverage zero at breakpoint
Vertebrate Resequencing Informatics 22nd March, 2011
Insertion Insertion
Inversion
Complex SV Events
Vertebrate Resequencing Informatics 22nd March, 2011
Human Examples
Stankiewicz and Lupski (2010) Ann. Rev. Med.
Vertebrate Resequencing Informatics 22nd March, 2011
Example 2: Transposable element insertion in mice
Vertebrate Resequencing Informatics 22nd March, 2011
SVMerge
Initially developed for mouse genomes project Several software packages currently available to discover SVs
Various approaches using information from anomalously mapped read pairs OR read depth analysis No single SV caller is able to detect the full range of structural variants Paired-end mapping information, for example, cannot detect SVs where the
read pairs do not flank the SV breakpoints Insertion calls made using the split-mapping approach are also size-limited
because the whole insertion breakpoint must be contained within a read Read-depth approaches can identify copy number changes without the need
for read-pair support, but cannot find copy number neutral events SVMerge, a meta SV calling pipeline, which makes SV predictions with a collection of SV callers Input is a BAM file per sample Run callers individually + outputs sanitized into standard BED format SV calls merged, and computationally validated using local de novo assembly Primarily a SV discovery/calling + validation tool
Vertebrate Resequencing Informatics 22nd March, 2011
Local Assembly Validation
Key to the approach is the computational validation step Local assembly and breakpoint refinement All SV calls (except those lacking read
pair support e.g. CNG/CNL) Algorithm Gather mapped reads, and any
unmapped mate-pairs (<1kb of a insertion breakpoint, <2kb of all other SV types)
Run local velvet assembly Realign the contigs produced with
exonerate Detect contig breaks proximal to the
breakpoint(s)
Vertebrate Resequencing Informatics 22nd March, 2011
Breakpoint Improvement (Real data)
Yalchin and Wong et al, in prep
Vertebrate Resequencing Informatics 22nd March, 2011
Application to HapMap trio dataset
High-depth HapMap trio (NA18506, NA18507, NA18508) 42x, 42x and 40x
Reads processed through Vert. Reseq. Pipeline Aligned to the GRCh37 human reference using BWA Single BAM file for each individual
BreakDancerMax, Pindel, RDXplorer, SECluster, and RetroSeq Exclude calls 600 bp from a reference sequence gap 1 Mb from a centromere or telomere
Computational validation of raw candidate calls
Vertebrate Resequencing Informatics 22nd March, 2011
How do the calls measure up?
Compared the overlap of the deletion, gain, and inversion calls against the curated Database of Genomic Variants Overlapped with calls in DGV at a rate significantly higher than
expected by random chance Deletions in DGV: 71% (NA18506), 81% (NA18507), and 71%
(NA18508) Copy number gains in DGV: 29% (NA18506), 32% (NA18507),
and 36% (NA18508) Inversions in DGV: 47% (NA18506), 69% (NA18507), and 51%
(NA18508) Child calls not in DGV also called in the parents Further 18% deletions, 32% inversions, 54% duplications Estimated max. false positive rate of 11%, 21%, and 17%
All child-only SV calls comprise 11% of the child's final SV call Considerable improvement from 'merged raw’ (50% unique)
Vertebrate Resequencing Informatics 22nd March, 2011
Complex SV Types
Yalchin and Wong et al, in prep
Vertebrate Resequencing Informatics 22nd March, 2011
Future Work
SVMerge primarily a discovery and validation tool Extensible pipeline so that calls from any method to be easily
incorporated Developed primarily for mouse genomes project Successfully applied to human trio dataset Computationally validation approach reduces false positives
Complex SVs Cataloging repeating combinations of multiple SV events in small
loci 2011 development Low coverage cross-population SV discovery Genotyping existing SVs in new samples Better support for heterozygous calls Integration of SVMerge into Vert. Reseq. pipeline for UK10K