30
The Value of Mate-pairs for Repeat Resolution An Analysis on Graphs Created From Short Reads Joshua Wetzel Department of Computer Science Rutgers University–Camden in conjunction with CBCB at University of Maryland [email protected] August 7th, 2009 Joshua Wetzel 1 / 30

The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

The Value of Mate-pairs for Repeat ResolutionAn Analysis on Graphs Created From Short Reads

Joshua Wetzel

Department of Computer ScienceRutgers University–Camden

in conjunction withCBCB at University of Maryland

[email protected]

August 7th, 2009

Joshua Wetzel 1 / 30

Page 2: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Next Generation Sequencing

Approx. 25 to 250 bp per readAdvantages:

high throughputrelatively low cost

Disadvantages:less certainty with error detectiongreater difficulty with resolution of repeats

lower likelihood of spanning an entire repeat within one read

Joshua Wetzel 2 / 30

Page 3: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

The Assembly Problem and the De Bruijn Graph

1 Choose a value k2 1 Node for each length k − 1 substring that exists in any

read3 Length k substrings are represented by directed edges

between k − 2 suffix/prefix overlapEdge multiplicity rep. # of times a substring appears

Example

CGATATTCGCTAATTCGCG

TTCGATTC

k = 5

Pevzner et al.

Joshua Wetzel 3 / 30

Page 4: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Eulerian Paths

Every length k substring of read → 1 edge

With perfect data, graph is Eulerian

Any Eulerian tour → valid assembly of the reads

Good News:Eulerian Tour can be found quickly

Bad News:Repeats → many Eulerian tours

Joshua Wetzel 4 / 30

Page 5: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Eulerian Simplifications

Three types of forks:

Forward

ForkBackward

Fork

Full

Fork

Reduce Unambiguous Traversal Regions into SingleNodes

Simple Path CompressionCompressing Tree-like RegionsSplitting Half-decision Nodes

Kingsford

Joshua Wetzel 5 / 30

Page 6: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

What Remains Now

A graph of only full-fork nodes (any two of which may beseparated by a non-decision node)

R1

R2R4

R3

We need additional information for finishingBest known avenue: Use matepairs

Joshua Wetzel 6 / 30

Page 7: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Mate-pairs

Random fragment with a approximately known size

Both ends are sequenced

Unique path of same approximate size in the assemblygraph → partial traversal of the graph

Widely accepted empirically, but no work has beendone to assess their theoretical usefulness with shortreads

Joshua Wetzel 7 / 30

Page 8: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Before We Spend Our Money

Mate-pair libraries are expensive.How much are they worth?

Can Repeats Be Resolved By Mate-pairsIf Data Is Perfect?

No sequencing errors

Perfect coverage of the genome - exactly one edge in thegraph for every length k substring of the genome

Assume we know exact length of each insert

Joshua Wetzel 8 / 30

Page 9: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

The Plan

IdealizedData Compress

ApplyMatepairInfo

FinishingComplexityReduction

ApplyEulerian

Simplifications

Create Perfect Graphswith k as:35, 50, 100

Simulate VariousLengths

Standardvs.Ideal

Measure

Joshua Wetzel 9 / 30

Page 10: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Simulating Mate-pairs

1 Choose an insert size, j2 Choose a random starting point, s, in the known sequence3 Choose an ending point based on a Gaussian dist. using j

as the mean and a 10% stand. dev.4 First and last k letters are pairs with known distance

Joshua Wetzel 10 / 30

Page 11: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Reducing Finishing Complexity Using Mate-pairs

Use a shortest path heuristicFind a pair of nodes where the mate sequences are locatedIf shortest path = insert size → assume path is correct

We can verify correctness by comparison to knownsequence

An assembler cannot do this

Solution:Only use mates if shortest path is unique

Joshua Wetzel 11 / 30

Page 12: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Which Mate Pairs Can Be Used?

R1 R1

R2 R2

10

10 12

10

10 10

Good Bad

VS.

Joshua Wetzel 12 / 30

Page 13: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Finishing Complexity

Goal: Reduce manual finishing efforts

Define Finishing Complexity:A value proportional to # of experiments needed to finishthe genome

Fin. Comp. of one node isd∑

i=2i = d2

+d2 − 1

FinComplex( G) = Sum over all nodes

Joshua Wetzel 13 / 30

Page 14: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Increasing Clone Coverage to Infinity

Bartonella quintana Toulouse: Genome Size ≈ 1.5 Mb

0 10000 80000 160000 240000 320000 400000 480000

0

10

20

30

40

50

60

70

80

90

100

Num. Inserts vs. % Reduction in Finishing Complexity

Num. Inserts

% R

educ

tion

Insert Size (bp)

200035000300

60008000

Joshua Wetzel 14 / 30

Page 15: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

An Important Observation

Many areas of these Graphs Look Like This

R1 R210

10

10

These repeats can only be resolved by inserts thatspan exactly 1 repeat

Joshua Wetzel 15 / 30

Page 16: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

What Causes Localized Complexity?

A While Ago...

Evolution

Now

1 Big Repeat

2 Small Repeats

ATTCCGGCAGGATATGCTCAGGGGACC

ATTCCGGAGGA TATGCTCAGG GGACC

ATTCCGGAGGA TATGCTCCGG GGACC

How often does this situation arise in the graph?

Joshua Wetzel 16 / 30

Page 17: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Graph Complexity Statistic: Concept

Goal: Measure the amount of localized complexity in thegraph (C-Statistic)

Label each fork in the graph as either trivial or non-trivial

Trivial Non-Trivial

R1 R2

8

10

7 8

8

7

Joshua Wetzel 17 / 30

Page 18: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Graph Complexity Statistic

Let S be the set of all forks in the graph, G

Let N be the set of all non-trivial forks in G

Let Cv be the finishing complexity of node v

cstat(G) =

∑v∈N Cv

∑v∈S Cv

∗ 100

Hypothesis: Graphs with a high C-Stat will not experiencegood reductions in finishing complexity using traditionalmate-pair libraries

Joshua Wetzel 18 / 30

Page 19: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

C-Statistic Across All Graphs

0−10 10−20 20−30 30−40 40−50 50−60 60−70 70−80 80−90 90−100

C−Stat Range

% o

f Gra

phs

05

1015

20% of Graphs with C−Stat in a Particular Range

cStat

35mer50mer

100mer

Choosing Graph-Specific Insert Sizes is Important

Joshua Wetzel 19 / 30

Page 20: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Tailoring Mate-pair Size to the Graph

Typical sequencing experiments choose 2 common sizemate-pair libraries without considering repeat size

Typical sizes: 2000, 6000, 8000, 35000 (bp)

Hypothesis:Given the prevalence of non-trivial repeats, we can likelyreduce finishing complexity more by choosing 2 librarysizes targeted at just barely crossing the forks of greatestcomplexity

How do we choose what size inserts to use?

Joshua Wetzel 20 / 30

Page 21: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Complexity Bar Chart for Repeats

(0, 200) (201, 401) (402, 602) (804, 1004) (1005, 1205) (1206, 1406)

0

5

10

15

20

25

30

35

40

45

50

Complexity Bar Chart for Thermus thermophilus

Repeat Length Range (bp)

Fin

ishi

ng C

ompl

exity

Joshua Wetzel 21 / 30

Page 22: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

How the Complexity Bar Chart is Used

Two tallest bins (most combined complexity)

For of these two bins:1 Take the mean repeat length of all the repeats in the bin2 Add 3 ∗ k to the avg.3 Create inserts of this approximate size

Average optimal mate-pair length to cover nodes ofmost complexity was between 4.5k and 6k

Small σ for graphs with C-Stat ≥ 50Large σ for graphs with C-Stat < 50

Joshua Wetzel 22 / 30

Page 23: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

How Many Mate-pairs? - Poisson Process

Want mate-pair to begin in a non-decision node and crossexactly one repeatIn worst case non-decision node maybe be as short as k

Want at least one insert for every k bps in the genomeIf average arrival rate, λ, is 10 across k bp

P[X = x ] =λxe−λ

x!

P[X = 1] =10e10 ≈ 0.00045

P[X < 1] ≤ 0.00045

To get λ = 10 across k bp we need 10Gk inserts

Joshua Wetzel 23 / 30

Page 24: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Simulations Across 360 Organisms

How Much Can We Reduce Finishing Complexity Using 2libraries each with 10G

k # of inserts if we use:

2 standard mate-pair libraries

The 2 ’ideal’ libraries based on the complexity chart

Joshua Wetzel 24 / 30

Page 25: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Using Two ’Ideal’ Mate-Pair Libraries(Avgs: 4.5k to 6k for smallest)

0−10 10−20 20−30 30−40 40−50 50−60 60−70 70−80 80−90 90−100

C−Stat Range

% R

educ

tion

020

4060

80C−stat Range vs. % Reduction in Fin. Complexity

Read Length

35mer50mer

100mer

Overall Avg Reduction: 82.70%, σ ≈ 17%

Joshua Wetzel 25 / 30

Page 26: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Using Standard Libraries: 2000 and 8000

0−10 10−20 20−30 30−40 40−50 50−60 60−70 70−80 80−90 90−100

C−Stat Range

% R

educ

tion

020

4060

8010

0C−stat Range vs. % Reduction in Fin. Complexity

Read Length

35mer50mer

100mer

Overall Avg Reduction: 47.52%, σ ≈ 22%

Joshua Wetzel 26 / 30

Page 27: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Local Complexity Vs. Global Complexity(Using 2000 and 8000)

0 1000 2000 3000 4000 5000 6000 7000

20

40

60

80

100

Total Finsishing Complexity vs. C−Statistic

Total Complexity

C−

Sta

tistic

% Reduction in Finishing Complexity

< 80% >=80%

Joshua Wetzel 27 / 30

Page 28: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Some Conclusions

Short inserts are better for repeat resolution

A large confounding factor in repeat resolution is localizedcomplexity caused by intra-repeat nucleotide mutations

The C-Stat is simple and useful measurement of localizedcomplexity

Most graphs have a C-Stat ≥ 60If C-Stat ≥ 50, then standard mp libs. are not useful forresolving repeats

Tailoring mp libs. to the repeat structure of the assemblygraph is a useful technique

Joshua Wetzel 28 / 30

Page 29: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Are Mate-pairs Worth the Money?

Even with perfect data and an ideal approach to creatingmate-pair libs, we can only resolve an avg. of 83% offinishing complexity (σ ≈17%)

Does not bode well for noisy data

On the other hand:

The only known technique for better resolving localizedcomplexity is use of longer reads

Joshua Wetzel 29 / 30

Page 30: The Value of Mate-pairs for Repeat Resolutioncrab.rutgers.edu/~rajivg/studentTalks/shortreads.pdf · The Assembly Problem and the De Bruijn Graph 1 Choose a value k 2 1 Node for each

Thanks

Joshua Wetzel 30 / 30