29
Using the whole read: Structural Variation detection with RPSR Presented by Derek Bickhart

Using the whole read: Structural Variation detection with RPSR

  • Upload
    kateb

  • View
    38

  • Download
    0

Embed Size (px)

DESCRIPTION

Using the whole read: Structural Variation detection with RPSR. Presented by Derek Bickhart. Presentation Outline. Variant classification and detection Theory on read structure and bias Simulations and real data. Nuc. Genetic Variation. How genomes change over time. - PowerPoint PPT Presentation

Citation preview

Page 1: Using the whole read: Structural Variation detection with RPSR

Using the whole read: Structural Variation detection with RPSR

Presented by Derek Bickhart

Page 2: Using the whole read: Structural Variation detection with RPSR

Presentation Outline

• Variant classification and detection

• Theory on read structure and bias

• Simulations and real data

Page 3: Using the whole read: Structural Variation detection with RPSR

Genetic Variation

• Single nucleotide variations – SNP (human millions of variants)

• Indels – Insertions/Deletions (1 bp – 1000 bp)

• Mobile Elements – SINE, LINE Transposition (300bp - 6 kb)

• Genomic structural variation (1 kb – 5 Mb)

– Large-scale Insertions/Deletions (Copy Number Variation: CNV)

– Segmental Duplications (> 1kb, > 90% sequence similarity)

– Chromosomal Inversions, Translocations, Fusions.

How genomes change over time

Nuc

Chr

Page 4: Using the whole read: Structural Variation detection with RPSR

SVs contribute to phenotype

Picture from Seo et al. 2007. BMC Genetics

Picture from Wright et al. 2009. PLoS Genet.

SOX5

KIT

ASIP

Nuc

Chr

KIT

Page 5: Using the whole read: Structural Variation detection with RPSR

Tracking variants with LDNuc

Chr

In LinkageDisequilibrium

Form a Haplotype

A causative variant?

Page 6: Using the whole read: Structural Variation detection with RPSR

Underestimating the number of genetic variants

Disease variants Or

New high productive QTV

• Mostly impute these

• Still some issues:• Relatively new mutations• Ref. assembly errors• Difficult to sequence

locations

• Can we get biological function from imputation?

Page 7: Using the whole read: Structural Variation detection with RPSR

Sequencing to find novel variants

Raw Data

ACGTAAAGGTACGACGATCGACG

ACGTAAAGGGACGGTAAAGGTACGACGTAAAGGTACGAC

GGGACGACGATCGA

REF:ACGTAAAGGTACGACGATCGACG

ACGTAAAGGGACGGTAAAGGTACGACGTAAAGGTACGAC

GGGACGACGATCGA

REF:

Page 8: Using the whole read: Structural Variation detection with RPSR

Making use of new variants

• What do you do after variant calling?– Check for functional impact– Check for frequency

• Variants can be placed on chips• Deletions can be tracked

ACGACTAGACGATGGACGA

ACGACTA TGGACGA

WildType:

Deletion: ACGACTA TGGACGAACGACTAG

ACGACTAT

Page 9: Using the whole read: Structural Variation detection with RPSR

Presentation Outline

• Variant classification and detection

• Theory on read structure and bias

• Simulations and real data

Page 10: Using the whole read: Structural Variation detection with RPSR

Structural variant detection still proves to be a challenge

• No Perfect Calls

• Most datasets have high FDR

• A majority of variants are missed

Good PerformancePretty conservative

Moderate PerformanceToo many False Positives

Poor performance

Page 11: Using the whole read: Structural Variation detection with RPSR

Understanding the sequencing process

• DNA sheared to fragments

• Fragments follow a size distribution

• Fragments sequenced from both sides

• We don’t know the middle, but we know the size!

Page 12: Using the whole read: Structural Variation detection with RPSR

Using information from alignments

Reference Genome

Reference Genome

ACGAGATAGTAGATACCATAGACG

ACGAGATAGT ACCATAGACG

7bp

3bp

ACGAGATAGCCATAGACG

ACGAGATAGGGCCATAGACG3bp

1bp

Deletions Insertions

What the Alignment should look like:

Page 13: Using the whole read: Structural Variation detection with RPSR

Using information from alignments

Reference Genome

Reference Genome

CGATAGACGAC GGAGAGAGATAG GGAGAGAGATAG ACCCAGATAA

CGATAGACGAC GGAGAGAGATAG ACCCAGATAA

Tandem Duplication

What the Alignment should look like:

Page 14: Using the whole read: Structural Variation detection with RPSR

TTGCGATTGCGA

Getting more from your reads

Reference Genome

Unaligned

Aligned

Split ReadDeletion call

ACGACGAGGGTGTGATTGACGATCGATACGACGA

Aligned ReadUnaligned Read

ACGACGAGGGTGTGATTG CGATA

Page 15: Using the whole read: Structural Variation detection with RPSR

Overcoming read biases and creating useful information

• Chemistry problems confuse detection

• Alignment issues occur

• Use clustering algorithm

Chimeric Read

Repeat Repeat

Page 16: Using the whole read: Structural Variation detection with RPSR

Ease of Use

• Designed to process BWA-aligned BAM files

• Scalable to system resources

• Multi-threaded

• Tunable to reduce false positives

Page 17: Using the whole read: Structural Variation detection with RPSR

RPSR: Read Pair, Split Read

• Written in Java (version 8)

• Currently two modes:– Preprocess– Cluster

• Uses map-reduce paradigms for easy threading

Page 18: Using the whole read: Structural Variation detection with RPSR

Presentation Outline

• Variant classification and detection

• Theory on read structure and bias

• Simulations and real data

Page 19: Using the whole read: Structural Variation detection with RPSR

Simulation dataset

• Started with Cattle chr29– 51 megabases– Acrocentric– Synthetic 10X coverage

Variant Type Avg. Count Avg. sizeDeletion 12.3 350 bpTandem Dup 11.8 350 bp

Variants per simulated chromosome

Page 20: Using the whole read: Structural Variation detection with RPSR

Comparison Program

• Delly / Duppy

• Rausch et al. 2012. Bioinformatics

• Combined read pair, split read caller

• Discovers discordant reads, then uses split reads to validate

• Run with default settings and in “split read” mode

Page 21: Using the whole read: Structural Variation detection with RPSR

DELLY RPSRDELS DUPPY RPSRTAND0

50

100

150

200

250

300

350

400

Program results: Simulation

DELLY RPSRDELS DUPPY RPSRTAND0

2

4

6

8

10

12

14

DuplicationsDeletions

True Positives

Actual Positives

TotalCalls

Page 22: Using the whole read: Structural Variation detection with RPSR

Program results: Simulation

• RPSR vs Delly/Duppy precision

• RPSR is far more precise than Delly/Duppy

Program PrecisionRPSR Dels 8.7%Delly Dels 0.9%RPSR Tandem 73.7%Duppy Duplications 1.8%

Page 23: Using the whole read: Structural Variation detection with RPSR

Real Dataset: Angus Individual

• Provided by Jerry Taylor and Bob Schnabel

• Sequence statistics– 20 X coverage of Illumina reads– Reads quality trimmed– Aligned with BWA to UMD3.1

Page 24: Using the whole read: Structural Variation detection with RPSR

Program results: Angus• RPSRVariant type Total Calls Avg. Size

(bp)Total

LengthLargest call

Deletions 4171 237 991 kb 43.7 kb

Duplications 9617 554 5,335 kb 152.0 kb

Variant type Total Calls Avg. Size (bp)

Total Length

Largest call

Deletions 1867 1,304,683 2.4 gb 149 Mb

Duplications 10263 232,472 2.38 gb 150 Mb

• Delly/Duppy

Page 25: Using the whole read: Structural Variation detection with RPSR

Conclusions

• Structural variants are a type of mutation we can track inexpensively

• Accurate assessment of SVs is needed

• Future developments for RPSR– Smaller resource footprint– Automatic threshold detection

Page 26: Using the whole read: Structural Variation detection with RPSR

Acknowledgements

• Colleagues at the USDA– George Liu– Tad Sonstegard– Curt Van Tassell– AGIL

• Jerry Taylor• Bob Schnabel• The Reecy Lab

Projects mentioned in this presentation were funded in part by NRI grant numbers 2007-35205-17869 and 2011-67015-30183, and by USDA NIFA grant number 2013-00831

Page 27: Using the whole read: Structural Variation detection with RPSR

Questions?

Page 28: Using the whole read: Structural Variation detection with RPSR

Delly/Duppy

• After removing the initial calls greater than 1 megabase and then merging:

Variant type Total Calls Avg. Size (bp)

Total Length

Largest call

Deletions 1833 5219 9.6 Mb 885 kb

Duplications 97407 1217 118 Mb 2.3 Mb

Page 29: Using the whole read: Structural Variation detection with RPSR

Pipeline: Resource Consumption