Upload
others
View
4
Download
0
Embed Size (px)
Citation preview
The Value of Mate-pairs for Repeat ResolutionAn Analysis on Graphs Created From Short Reads
Joshua Wetzel
Department of Computer ScienceRutgers University–Camden
in conjunction withCBCB at University of Maryland
August 7th, 2009
Joshua Wetzel 1 / 30
Next Generation Sequencing
Approx. 25 to 250 bp per readAdvantages:
high throughputrelatively low cost
Disadvantages:less certainty with error detectiongreater difficulty with resolution of repeats
lower likelihood of spanning an entire repeat within one read
Joshua Wetzel 2 / 30
The Assembly Problem and the De Bruijn Graph
1 Choose a value k2 1 Node for each length k − 1 substring that exists in any
read3 Length k substrings are represented by directed edges
between k − 2 suffix/prefix overlapEdge multiplicity rep. # of times a substring appears
Example
CGATATTCGCTAATTCGCG
TTCGATTC
k = 5
Pevzner et al.
Joshua Wetzel 3 / 30
Eulerian Paths
Every length k substring of read → 1 edge
With perfect data, graph is Eulerian
Any Eulerian tour → valid assembly of the reads
Good News:Eulerian Tour can be found quickly
Bad News:Repeats → many Eulerian tours
Joshua Wetzel 4 / 30
Eulerian Simplifications
Three types of forks:
Forward
ForkBackward
Fork
Full
Fork
Reduce Unambiguous Traversal Regions into SingleNodes
Simple Path CompressionCompressing Tree-like RegionsSplitting Half-decision Nodes
Kingsford
Joshua Wetzel 5 / 30
What Remains Now
A graph of only full-fork nodes (any two of which may beseparated by a non-decision node)
R1
R2R4
R3
We need additional information for finishingBest known avenue: Use matepairs
Joshua Wetzel 6 / 30
Mate-pairs
Random fragment with a approximately known size
Both ends are sequenced
Unique path of same approximate size in the assemblygraph → partial traversal of the graph
Widely accepted empirically, but no work has beendone to assess their theoretical usefulness with shortreads
Joshua Wetzel 7 / 30
Before We Spend Our Money
Mate-pair libraries are expensive.How much are they worth?
Can Repeats Be Resolved By Mate-pairsIf Data Is Perfect?
No sequencing errors
Perfect coverage of the genome - exactly one edge in thegraph for every length k substring of the genome
Assume we know exact length of each insert
Joshua Wetzel 8 / 30
The Plan
IdealizedData Compress
ApplyMatepairInfo
FinishingComplexityReduction
ApplyEulerian
Simplifications
Create Perfect Graphswith k as:35, 50, 100
Simulate VariousLengths
Standardvs.Ideal
Measure
Joshua Wetzel 9 / 30
Simulating Mate-pairs
1 Choose an insert size, j2 Choose a random starting point, s, in the known sequence3 Choose an ending point based on a Gaussian dist. using j
as the mean and a 10% stand. dev.4 First and last k letters are pairs with known distance
Joshua Wetzel 10 / 30
Reducing Finishing Complexity Using Mate-pairs
Use a shortest path heuristicFind a pair of nodes where the mate sequences are locatedIf shortest path = insert size → assume path is correct
We can verify correctness by comparison to knownsequence
An assembler cannot do this
Solution:Only use mates if shortest path is unique
Joshua Wetzel 11 / 30
Which Mate Pairs Can Be Used?
R1 R1
R2 R2
10
10 12
10
10 10
Good Bad
VS.
Joshua Wetzel 12 / 30
Finishing Complexity
Goal: Reduce manual finishing efforts
Define Finishing Complexity:A value proportional to # of experiments needed to finishthe genome
Fin. Comp. of one node isd∑
i=2i = d2
+d2 − 1
FinComplex( G) = Sum over all nodes
Joshua Wetzel 13 / 30
Increasing Clone Coverage to Infinity
Bartonella quintana Toulouse: Genome Size ≈ 1.5 Mb
0 10000 80000 160000 240000 320000 400000 480000
0
10
20
30
40
50
60
70
80
90
100
Num. Inserts vs. % Reduction in Finishing Complexity
Num. Inserts
% R
educ
tion
Insert Size (bp)
200035000300
60008000
Joshua Wetzel 14 / 30
An Important Observation
Many areas of these Graphs Look Like This
R1 R210
10
10
These repeats can only be resolved by inserts thatspan exactly 1 repeat
Joshua Wetzel 15 / 30
What Causes Localized Complexity?
A While Ago...
Evolution
Now
1 Big Repeat
2 Small Repeats
ATTCCGGCAGGATATGCTCAGGGGACC
ATTCCGGAGGA TATGCTCAGG GGACC
ATTCCGGAGGA TATGCTCCGG GGACC
How often does this situation arise in the graph?
Joshua Wetzel 16 / 30
Graph Complexity Statistic: Concept
Goal: Measure the amount of localized complexity in thegraph (C-Statistic)
Label each fork in the graph as either trivial or non-trivial
Trivial Non-Trivial
R1 R2
8
10
7 8
8
7
Joshua Wetzel 17 / 30
Graph Complexity Statistic
Let S be the set of all forks in the graph, G
Let N be the set of all non-trivial forks in G
Let Cv be the finishing complexity of node v
cstat(G) =
∑v∈N Cv
∑v∈S Cv
∗ 100
Hypothesis: Graphs with a high C-Stat will not experiencegood reductions in finishing complexity using traditionalmate-pair libraries
Joshua Wetzel 18 / 30
C-Statistic Across All Graphs
0−10 10−20 20−30 30−40 40−50 50−60 60−70 70−80 80−90 90−100
C−Stat Range
% o
f Gra
phs
05
1015
20% of Graphs with C−Stat in a Particular Range
cStat
35mer50mer
100mer
Choosing Graph-Specific Insert Sizes is Important
Joshua Wetzel 19 / 30
Tailoring Mate-pair Size to the Graph
Typical sequencing experiments choose 2 common sizemate-pair libraries without considering repeat size
Typical sizes: 2000, 6000, 8000, 35000 (bp)
Hypothesis:Given the prevalence of non-trivial repeats, we can likelyreduce finishing complexity more by choosing 2 librarysizes targeted at just barely crossing the forks of greatestcomplexity
How do we choose what size inserts to use?
Joshua Wetzel 20 / 30
Complexity Bar Chart for Repeats
(0, 200) (201, 401) (402, 602) (804, 1004) (1005, 1205) (1206, 1406)
0
5
10
15
20
25
30
35
40
45
50
Complexity Bar Chart for Thermus thermophilus
Repeat Length Range (bp)
Fin
ishi
ng C
ompl
exity
Joshua Wetzel 21 / 30
How the Complexity Bar Chart is Used
Two tallest bins (most combined complexity)
For of these two bins:1 Take the mean repeat length of all the repeats in the bin2 Add 3 ∗ k to the avg.3 Create inserts of this approximate size
Average optimal mate-pair length to cover nodes ofmost complexity was between 4.5k and 6k
Small σ for graphs with C-Stat ≥ 50Large σ for graphs with C-Stat < 50
Joshua Wetzel 22 / 30
How Many Mate-pairs? - Poisson Process
Want mate-pair to begin in a non-decision node and crossexactly one repeatIn worst case non-decision node maybe be as short as k
Want at least one insert for every k bps in the genomeIf average arrival rate, λ, is 10 across k bp
P[X = x ] =λxe−λ
x!
P[X = 1] =10e10 ≈ 0.00045
P[X < 1] ≤ 0.00045
To get λ = 10 across k bp we need 10Gk inserts
Joshua Wetzel 23 / 30
Simulations Across 360 Organisms
How Much Can We Reduce Finishing Complexity Using 2libraries each with 10G
k # of inserts if we use:
2 standard mate-pair libraries
The 2 ’ideal’ libraries based on the complexity chart
Joshua Wetzel 24 / 30
Using Two ’Ideal’ Mate-Pair Libraries(Avgs: 4.5k to 6k for smallest)
0−10 10−20 20−30 30−40 40−50 50−60 60−70 70−80 80−90 90−100
C−Stat Range
% R
educ
tion
020
4060
80C−stat Range vs. % Reduction in Fin. Complexity
Read Length
35mer50mer
100mer
Overall Avg Reduction: 82.70%, σ ≈ 17%
Joshua Wetzel 25 / 30
Using Standard Libraries: 2000 and 8000
0−10 10−20 20−30 30−40 40−50 50−60 60−70 70−80 80−90 90−100
C−Stat Range
% R
educ
tion
020
4060
8010
0C−stat Range vs. % Reduction in Fin. Complexity
Read Length
35mer50mer
100mer
Overall Avg Reduction: 47.52%, σ ≈ 22%
Joshua Wetzel 26 / 30
Local Complexity Vs. Global Complexity(Using 2000 and 8000)
0 1000 2000 3000 4000 5000 6000 7000
20
40
60
80
100
Total Finsishing Complexity vs. C−Statistic
Total Complexity
C−
Sta
tistic
% Reduction in Finishing Complexity
< 80% >=80%
Joshua Wetzel 27 / 30
Some Conclusions
Short inserts are better for repeat resolution
A large confounding factor in repeat resolution is localizedcomplexity caused by intra-repeat nucleotide mutations
The C-Stat is simple and useful measurement of localizedcomplexity
Most graphs have a C-Stat ≥ 60If C-Stat ≥ 50, then standard mp libs. are not useful forresolving repeats
Tailoring mp libs. to the repeat structure of the assemblygraph is a useful technique
Joshua Wetzel 28 / 30
Are Mate-pairs Worth the Money?
Even with perfect data and an ideal approach to creatingmate-pair libs, we can only resolve an avg. of 83% offinishing complexity (σ ≈17%)
Does not bode well for noisy data
On the other hand:
The only known technique for better resolving localizedcomplexity is use of longer reads
Joshua Wetzel 29 / 30
Thanks
Joshua Wetzel 30 / 30