Upload
joella-clark
View
215
Download
1
Tags:
Embed Size (px)
Citation preview
Bioinformatics Algorithms
Sequence Assembly
2
Copy Right Notice
Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!
3
Genome Sequencing
Goal figuring the order of nucleotides across a
genome
Problem Current DNA sequencing methods can handle
only short stretches of DNA at once (<1-2Kbp)
Solution Sequence and then use computers to assemble
the small pieces
4
Genome Sequencing
4
ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…
ACGTGGTAATGGCGTATACACCCTTAGGCCATA
Short fragments of DNA
AC..GCTT..TC
CG..CA
AC..GC
TG..GT TC..CC
GA..GCTG..AC
CT..TGGT..GC AC..GC AC..GC
AT..ATTT..CC
AA..GC
Short DNA sequences
ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT...
Sequenced genome
Genome
5
Sanger Sequencing
1980 1990 2000
1982: lambda virusDNA stretches up to 30-40Kbp (Sanger et al.)
1994: H. Influenzae1.8 Mbp (Fleischmann et al.)
2001: H. Sapiens, D. Melanogaster3 Gbp (Venter et al.)
2007: Global Ocean Sampling Expedition~3,000 organisms, 7Gbp (Venter et al.)
6
Sanger Sequencing
Advantages Long reads (~900bps) Suitable for small projects
Disadvantages Low throughput Expensive
7 7
2010: 5K$, a few days?
2009: Illumina, Helicos40-50K$
Sequencing the Human Genome
Year
Log
10(p
rice)
201020052000
10
8
6
4
22012: 100$, <24 hrs?
2008: ABI SOLiD60K$, 2 weeks
2007: 4541M$, 3 months
2001: Celera100M$, 3 years
2001: Human Genome Project2.7G$, 11 years
8
Next Generation Sequencing
Alternative sequencing technologies to capillary, introduced in mid 2000s.
Systems by Illumina Solexa and ABI SOLiD.
Much higher throughput (1-4gbps / day)
Lower cost / base pair
Very short fragment lengths
High error rate
Inherent ability to do paired-end (mate-pair) sequencing.
9
Technology SummaryRead length Sequencing
TechnologyThroughput (per run)
Cost (1mbp)*
Sanger ~800bp Sanger 400kbp 500$
454 ~400bp Polony 500Mbp 60$
Solexa 75bp Polony 20Gbp 2$
SOLiD 75bp Polony 60Gbp 2$
Helicos 30-35bp Single molecule
25Gbp 1$
*Source: Shendure & Ji, Nat Biotech, 2008
10
Assembly
10
Cut DNA to larger pieces (2Kbp, 15Kbp) and sequence both ends of each piece (Fleischmann et al., 1994)
contig 1 contig 215Kbp mates
2Kbp mates
~(length―1,000)
~500 bp ~500 bp
resolving repeats
Better assembly of contigs, gap lengths estimation
11
many pieces to assemble
High coverage:
Assembly: How Much DNA?
Low coverage:
A few pieces to assemble
a few contigs, a few gaps
many contigs, many gaps
Input OutputLander and Waterman,
1988
12
Assembly paradigms
Overlap-layout-consensusgreedy (TIGR Assembler, phrap, CAP3...)graph-based (Celera Assembler, Arachne)
de Bruijn Graph based approaches (especially useful for short read sequencing)
13
Overlap-Layout-Consensus
Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA
Overlap: find potentially overlapping reads
Layout: merge reads into contigs and contigs into supercontigs (scaffolds)
Consensus: derive the DNA sequence and correct read errors
..ACGATTACAATAGGTT..
14
OVERLAP GRAPH
Edge Types:
AA
BB
AA
BB
AA
BB
BB
BB
BB
AA
AA
AA
Regular DovetailRegular Dovetail
Prefix DovetailPrefix Dovetail
Suffix DovetailSuffix Dovetail
E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps
15
OVERLAP GRAPH
Find the best match between the suffix of one read and the prefix of another
Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment
Apply a fast filtration method to filter out pairs of fragments that do not share a significantly long common substring
16
The Maximum Overlap Graph
Each edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v
a
b
d
c
TACGA
CTAAAGACCC
GACA
1
1
1
2
10-weight edges
omitted!
0-weight edges
omitted!
17
Paths and Layouts
The path dbc leads to the alignment:
a
b
d
c
TACGA
CTAAAGACCC
GACA
1
1
1
2
1
GACA-----------ACCC-----------CTAAAG
18
Superstrings
Every path that covers every node is a superstring
Zero weight edges result in alignments like:
Higher weights produce more overlap, and thus shorter strings
The shortest common superstring is the highest weight path that covers every node
GACA------------GCCC-------------TTAAAG
19
Graph formulation of SCS
Input: A weighted, directed graph
Output: The highest-weight path that touches every node of the graph
Does this problem sound familiar?Does this problem sound familiar?
20
The Greedy Algorithm
Algorithm greedy Sort edges in increasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End forEnd Algorithm
21
Greedy Example
7
6
54
3
2
1
2
2
22
Handling repeats
1. Repeat detectionpre-assembly: find fragments that belong to repeats
statistically (most existing assemblers)
repeat database (RepeatMasker)
during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)
post-assembly: find repetitive regions and potential mis-assemblies.
Reputer, RepeatMasker
"unhappy" mate-pairs (too close, too far, mis-oriented)
2. Repeat resolutionfind DNA fragments belonging to the repeat
determine correct tiling across the repeat
23
Statistical repeat detection
Significant deviations from average coverage flagged as repeats.
- frequent k-mers are ignored
- “arrival” rate of reads in contigs compared with theoretical value
(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)
Problem 1: assumption of uniform distribution of fragments - leads to false positives
non-random libraries
poor clonability regions
Problem 2: repeats with low copy number are missed - leads to false negatives
24
Consensus
A consensus sequence is derived from a profile of the assembled fragments
A sufficient number of reads is required to ensure a statistically significant consensus
Reading errors are corrected
25
Derive Consensus Sequence
Derive multiple alignment from pairwise read alignments (i.e., progressive alignment)
TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA
TAGATTACACAGATTACTGACTTGATGGCGTAA CTA
Derive each consensus base by weighted voting.
Another approach based on finding a longest path in a DAG is given the popular assembler
26
Definitions
Let v and w be two strings over the alphabet . Concatenation of v and w is denoted by vw, and v[i] is the ith symbol in v, 1 i |v|. v[i,j] denotes a substring in v and for any x , xk, k 1 is x concatenated with itself k times.
A string of length k is called a k-mer. The k-spectrum of v is the set of all k-mers that are substrings of v. [Example: v = abcd; 2-spectrum of v is {ab, bc, cd}, 3-spectrum of v is {abc, bcd}.
A DNA strand is a string with alphabet Σ = {a, g, c, t}. Characters of a DNA strand are called bases. The complement of a base α[i], denoted by α[i], is defined by the following bijection of Σ onto Σ: {t → a, c → g, a → t, g → c}.
The reverse complement of a DNA strand α, denoted by α, is obtained by reversing α and complementing each base (α[i] = α[|α| − i + 1]). Note that α[i] = α[i] and α = α.
A DNA molecule is a pair of complementary DNA strands, m = {αm, αm }. We denote the length of m as |m| = |αm| = h and call m an h-molecule.
27
The bi-directed graphs
A bidirected graph is one in which each edge is given an independent orientation (or direction, or arrow – thus, 2 kinds of arrowheads) at each end. Thus, there are three kinds of bidirected edges: (1) those where the arrows point outward, towards the vertices, at both ends; (2) those where both arrows point inward, away from the vertices; and (3) those in which one arrow points away from its vertex and towards the opposite end, while the other arrow points in the same direction as the first, away from the opposite end and towards its own vertex.
28
The bi-directed graphs
We denote a bidirected graph by an incidence matrix I: V E {-2, -1, 0, 1, 2}. I(x,e) = 0 if edge e is not incident to node x, +1 if e is positive-incident to x [denoted by diamond], -1 if e is negative incident on x, +2 if e is a self-loop on x (positive incident) and -2 if e is a self-loop on x (negative incident). The in-degree degn(x) and out-degree degp(x) of a vertex are defined as usual. The balance at a node x is bal(x) = degn(x) + degp(x); a graph is balanced if the balance of each vertex is 0. A walk is a sequence x1e1…xk-1ek-1xk where ei is an edge between nodes xi and xi+1 and ei and ei-1 have opposite orientations at xi.
Bal(W) = 0, bal (X) = -1, bal (Y) = 1, bal (Z) = -1; the graph is not balanced. We can view a loop less directed graph as a special kind of bidirected graph, if every edge is positive-incident to one of its endpoints and negative-incident to the other one – the definition of a walk reduces to its usual meaning in directed graphs. However, it is possible for the shortest walk between two vertices to repeat a vertex in a bidirected graph. Here, observe that there does not exist a walk between W and Z which does not repeat a vertex [not possible in a directed graph] – the walk [from node W to node Z] is ABCBD – observe that AD is not a walk but BD is.
W
Z
X YA B C D E
W 1 0 0 0 -1
X -1 1 0 -1 -1
Y 0 -1 2 0 0
Z 0 0 0 -1 0
A BC
D
29
The bi-directed de Bruijn graphs
A bi-directed de Bruijn graph Nodes: all possible k-mers
Edges: ((v, dv), (u, du)) or (v, u, dv, du)There is an edge between v1 and v2 iff
suffix(v, k-1)=prefix(u, k-1) or suffix(v, k-1)=prefix(u, k-1)
and v[1].suffix(v, k-1).u[k] is k+1 substring of S
or Sor v[1].suffix(v, k-1).u[k] is k+1 substring of S or S
)( SSBDk
)&()( SvSvasvv
30
The bi-directed de Bruijn graphs
Canonical k-mer: for the reverse complement of two k-mers, the lexicographically smaller (larger) is defined as canonical k-mer, and the other one is non-canonical k-mer.
The orientations of the arrow heads on the edges are chosen as follows.
If the canonical k-mers of nodes vi and vj overlap then an edge (vi, vj >, >) is introduced.If the canonical k-mer of vi overlaps with the non-canonical k-mer of vj then an edge (vi, vj , >, <) is introduced. If the noncanonical k-mer of vi overlaps with canonical k-mer of vj then an edge (vi, vj <, >) is introduced.
A walk W (vi, vj ) between two nodes vi, vj ÎV of a bi-directed graph G(V, E) is a sequence vi ei vi1 ei1 vi2 vim eim, , , , , , , , vj, such that for every intermediate vertex vi l , 1 ≤ l ≤ m the orientation of the arrow head on the incoming edge adjacent on vil, 1<l<m, should match the orientation of the arrow head on the outgoing edge.
31
The bi-directed de Bruijn graphs
Kundeti et al. BMC Bioinformatics 2010, 11:560
32
NGS Assembly using Bi-directed de Bruijn graph
1. Construct a bi-directed de Bruijn graph
2. de Bruijn graph simplification (Compaction)
3. Removal of errors
4. Assemble sequence through Chinese Postman Walk (CPW) on bi-directed de Bruijn Graphs.
33
Construct a bi-directed de Bruijn graphs
1. Generate Edges
Kundeti et al. BMC Bioinformatics 2010, 11:560
34
Construct a bi-directed de Bruijn graphs
2. Reduce multiplicity: sort all bi-directed edges
The sorting take O(n), where n=Nr, and N is the number of reads and r is the average size of reads.The sorting step is the dominated step in build bi-directed de Bruijn graph
3. Collect bi-directed vertices
4. Generate adjacent lists.
35
Graph Compaction
Compact chains into single edgesReduction to familiar list rankingList ranking:
◦ Distributed linked list with adjacency information◦ Find distance from each node to end of list
Extend:◦ Multiple linked lists◦ Identify nodes with multiple edges◦ Undirected
36
Graph Compaction
Kundeti et al. BMC Bioinformatics 2010, 11:560
37
Graph Compaction
Edges in the graph are nodes for list ranking
Sort by edges to assign unique edge labels
Sort by nodes to bring edges incident to a node together
Identify nodes on a chainMark adjacent edges on chains
Perform undirected list ranking
38
Errors Detection and Removal
Assumption: incidences of errors is random.
Errors is unlikely to occur repeatedly at the same base.
Each base in the genome is sampled on an average as many times as the coverage numbers, which is high in NGS.
Combining: identifying errors on their comparatively lower frequency
39
Errors Detection and Removal
Tips: misreading of one or more based towards the end of the short reads
TCGTTGCGTGCGTGAGCGT
k
Tip
40
Errors Detection and Removal
Bubbles: misreading of one or more bases in the middle of a short reads.
TCGTTGCGTGCGTGAGCGT
k k
Bubble
41
Errors Detection and Removal
Spurious links: when an erroneous (k+1) molecule happens to be identical to a legitimate (k+1) molecular form elsewhere in the genome.
SpuriousLink
42
Euler Circuits A path is a connected sequence
of edges showing a route on the graph that starts at a vertex and ends at a vertex.
The path that starts and ends at the same vertex is called a circuit.
Circuits that cover every edge once and only once are called Euler circuits.
43
Euler’s Theorem1. If a graph G is connected
and has all nodes with even degree, then G has an Euler circuit.
2. If G has an Euler circuit, then G must be connected and all its degree must be even numbers.
44
Chinese Postman Problem
Suppose there is a mailman who needs to deliver mail to a certain neighborhood. That mailman is lazy, so he wants to find the shortest route through the neighborhood, that meets the following criteria:
It is a closed circuit (it ends at the same point it starts).He needs to go through every street at least once.
If the graph traveled has an Eulerian Circuit, this circuit is the ideal solution.
45
Chinese Postman Problem Solution (Edmonds and Johnson)
1. Find the odd nodes in G.
2. calculate the shortest path between all odd nodes.
3. construct a complete graph F of all odd nodes. The weights of edges are the shortest paths between them.
4. Find the minimum weighted perfecting matching (MWPM) on F.
For every edge (u,v) in the set of MWPM, duplicate the shortest path between u and v in original graph G to construct a new multi graph G’
Find the Eulerian Circuit in G’.
46
Strategy for Solving Chinese Postman Problem1. Eulerizing the graph
2. Find an Euler circuit on the new graph.
3. “Squeeze” this Euler circuit from the Eulerized graph onto the original graph by reusing an edge of the original graph each time the circuit on the eulerized graph uses an added edge.
47
Chinese Postman Walk (CPW) on bi-directed DBG
A Chinese Postman walk in a bi-directed graph is a bi-directed walk which visits every edge at least once.
A cyclic Chinese Postman walk of minimum cost on a weighted bi-directed graph is denoted as CPW
48
Chinese Postman Walk (CPW) on bi-directed DBG
Lemma 1. A connected bi-directed graph is Eulerian if and only if every vertex is balanced.
Lemma 2. A non Eulerian bi-directed graph G = (V,E) has a cyclic Chinese Postman walk a corresponding multi-bi-directed graph Gm = (V,Em) which is Eulerian.
Lemma 3. Finding a cyclic CP walk on a bi-directed graph G(V,E) is equivalent to finding a minimum weight Eulerian multi-bi-directed graph G(V,E) corresponding to G.
Lemma 4. If a bi-directed-graph G(V,E) has a cyclic CP walk then the cost of that walk is equal to the weight of G(V,E).
49
Chinese Postman Walk (CPW) on bi-directed DBG
Lemma 5. A non Eulerain bi-directed graph G(V,E) has a cyclic CP walk the balancing bi-partite graph B(P,Q,Eb) has a perfect match.
Lemma 6. If G(V,E) is a non Eulerian bi-directed graph that has a cyclic CP walk, then every corresponding Eulerian multi-bi-directed graph Gm(V,Em) belongs to the family F.
50
Chinese Postman Walk (CPW) on bi-directed DBG