Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

Bioinformatics Algorithms

Sequence Assembly

2

Copy Right Notice

Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

3

Genome Sequencing

Goal figuring the order of nucleotides across a

genome

Problem Current DNA sequencing methods can handle

only short stretches of DNA at once (<1-2Kbp)

Solution Sequence and then use computers to assemble

the small pieces

4

Genome Sequencing

4

ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…

ACGTGGTAATGGCGTATACACCCTTAGGCCATA

Short fragments of DNA

AC..GCTT..TC

CG..CA

AC..GC

TG..GT TC..CC

GA..GCTG..AC

CT..TGGT..GC AC..GC AC..GC

AT..ATTT..CC

AA..GC

Short DNA sequences

ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT...

Sequenced genome

Genome

5

Sanger Sequencing

1980 1990 2000

1982: lambda virusDNA stretches up to 30-40Kbp (Sanger et al.)

1994: H. Influenzae1.8 Mbp (Fleischmann et al.)

2001: H. Sapiens, D. Melanogaster3 Gbp (Venter et al.)

2007: Global Ocean Sampling Expedition~3,000 organisms, 7Gbp (Venter et al.)

6

Sanger Sequencing

Advantages Long reads (~900bps) Suitable for small projects

Disadvantages Low throughput Expensive

7 7

2010: 5K$, a few days?

2009: Illumina, Helicos40-50K$

Sequencing the Human Genome

Year

Log

10(p

rice)

201020052000

10

8

6

4

22012: 100$, <24 hrs?

2008: ABI SOLiD60K$, 2 weeks

2007: 4541M$, 3 months

2001: Celera100M$, 3 years

2001: Human Genome Project2.7G$, 11 years

8

Next Generation Sequencing

Alternative sequencing technologies to capillary, introduced in mid 2000s.

Systems by Illumina Solexa and ABI SOLiD.

Much higher throughput (1-4gbps / day)

Lower cost / base pair

Very short fragment lengths

High error rate

Inherent ability to do paired-end (mate-pair) sequencing.

9

Technology SummaryRead length Sequencing

TechnologyThroughput (per run)

Cost (1mbp)*

Sanger ~800bp Sanger 400kbp 500$

454 ~400bp Polony 500Mbp 60$

Solexa 75bp Polony 20Gbp 2$

SOLiD 75bp Polony 60Gbp 2$

Helicos 30-35bp Single molecule

25Gbp 1$

*Source: Shendure & Ji, Nat Biotech, 2008

10

Assembly

10

Cut DNA to larger pieces (2Kbp, 15Kbp) and sequence both ends of each piece (Fleischmann et al., 1994)

contig 1 contig 215Kbp mates

2Kbp mates

~(length―1,000)

~500 bp ~500 bp

resolving repeats

Better assembly of contigs, gap lengths estimation

11

many pieces to assemble

High coverage:

Assembly: How Much DNA?

Low coverage:

A few pieces to assemble

a few contigs, a few gaps

many contigs, many gaps

Input OutputLander and Waterman,

1988

12

Assembly paradigms

Overlap-layout-consensusgreedy (TIGR Assembler, phrap, CAP3...)graph-based (Celera Assembler, Arachne)

de Bruijn Graph based approaches (especially useful for short read sequencing)

13

Overlap-Layout-Consensus

Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: merge reads into contigs and contigs into supercontigs (scaffolds)

Consensus: derive the DNA sequence and correct read errors

..ACGATTACAATAGGTT..

14

OVERLAP GRAPH

Edge Types:

AA

BB

AA

BB

AA

BB

BB

BB

BB

AA

AA

AA

Regular DovetailRegular Dovetail

Prefix DovetailPrefix Dovetail

Suffix DovetailSuffix Dovetail

E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps

15

OVERLAP GRAPH

Find the best match between the suffix of one read and the prefix of another

Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment

Apply a fast filtration method to filter out pairs of fragments that do not share a significantly long common substring

16

The Maximum Overlap Graph

Each edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v

a

b

d

c

TACGA

CTAAAGACCC

GACA

1

1

1

2

10-weight edges

omitted!

0-weight edges

omitted!

17

Paths and Layouts

The path dbc leads to the alignment:

a

b

d

c

TACGA

CTAAAGACCC

GACA

1

1

1

2

1

GACA-----------ACCC-----------CTAAAG

18

Superstrings

Every path that covers every node is a superstring

Zero weight edges result in alignments like:

Higher weights produce more overlap, and thus shorter strings

The shortest common superstring is the highest weight path that covers every node

GACA------------GCCC-------------TTAAAG

19

Graph formulation of SCS

Input: A weighted, directed graph

Output: The highest-weight path that touches every node of the graph

Does this problem sound familiar?Does this problem sound familiar?

20

The Greedy Algorithm

Algorithm greedy Sort edges in increasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End forEnd Algorithm

21

Greedy Example

7

6

54

3

2

1

2

2

22

Handling repeats

1. Repeat detectionpre-assembly: find fragments that belong to repeats

statistically (most existing assemblers)

repeat database (RepeatMasker)

during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

post-assembly: find repetitive regions and potential mis-assemblies.

Reputer, RepeatMasker

"unhappy" mate-pairs (too close, too far, mis-oriented)

2. Repeat resolutionfind DNA fragments belonging to the repeat

determine correct tiling across the repeat

23

Statistical repeat detection

Significant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored

- “arrival” rate of reads in contigs compared with theoretical value

(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)

Problem 1: assumption of uniform distribution of fragments - leads to false positives

non-random libraries

poor clonability regions

Problem 2: repeats with low copy number are missed - leads to false negatives

24

Consensus

A consensus sequence is derived from a profile of the assembled fragments

A sufficient number of reads is required to ensure a statistically significant consensus

Reading errors are corrected

25

Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments (i.e., progressive alignment)

TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting.

Another approach based on finding a longest path in a DAG is given the popular assembler

26

Definitions

Let v and w be two strings over the alphabet . Concatenation of v and w is denoted by vw, and v[i] is the ith symbol in v, 1 i |v|. v[i,j] denotes a substring in v and for any x , xk, k 1 is x concatenated with itself k times.

A string of length k is called a k-mer. The k-spectrum of v is the set of all k-mers that are substrings of v. [Example: v = abcd; 2-spectrum of v is {ab, bc, cd}, 3-spectrum of v is {abc, bcd}.

A DNA strand is a string with alphabet Σ = {a, g, c, t}. Characters of a DNA strand are called bases. The complement of a base α[i], denoted by α[i], is defined by the following bijection of Σ onto Σ: {t → a, c → g, a → t, g → c}.

The reverse complement of a DNA strand α, denoted by α, is obtained by reversing α and complementing each base (α[i] = α[|α| − i + 1]). Note that α[i] = α[i] and α = α.

A DNA molecule is a pair of complementary DNA strands, m = {αm, αm }. We denote the length of m as |m| = |αm| = h and call m an h-molecule.

27

The bi-directed graphs

A bidirected graph is one in which each edge is given an independent orientation (or direction, or arrow – thus, 2 kinds of arrowheads) at each end. Thus, there are three kinds of bidirected edges: (1) those where the arrows point outward, towards the vertices, at both ends; (2) those where both arrows point inward, away from the vertices; and (3) those in which one arrow points away from its vertex and towards the opposite end, while the other arrow points in the same direction as the first, away from the opposite end and towards its own vertex.

28

The bi-directed graphs

We denote a bidirected graph by an incidence matrix I: V E {-2, -1, 0, 1, 2}. I(x,e) = 0 if edge e is not incident to node x, +1 if e is positive-incident to x [denoted by diamond], -1 if e is negative incident on x, +2 if e is a self-loop on x (positive incident) and -2 if e is a self-loop on x (negative incident). The in-degree degn(x) and out-degree degp(x) of a vertex are defined as usual. The balance at a node x is bal(x) = degn(x) + degp(x); a graph is balanced if the balance of each vertex is 0. A walk is a sequence x1e1…xk-1ek-1xk where ei is an edge between nodes xi and xi+1 and ei and ei-1 have opposite orientations at xi.

Bal(W) = 0, bal (X) = -1, bal (Y) = 1, bal (Z) = -1; the graph is not balanced. We can view a loop less directed graph as a special kind of bidirected graph, if every edge is positive-incident to one of its endpoints and negative-incident to the other one – the definition of a walk reduces to its usual meaning in directed graphs. However, it is possible for the shortest walk between two vertices to repeat a vertex in a bidirected graph. Here, observe that there does not exist a walk between W and Z which does not repeat a vertex [not possible in a directed graph] – the walk [from node W to node Z] is ABCBD – observe that AD is not a walk but BD is.

W

Z

X YA B C D E

W 1 0 0 0 -1

X -1 1 0 -1 -1

Y 0 -1 2 0 0

Z 0 0 0 -1 0

A BC

D

29

The bi-directed de Bruijn graphs

A bi-directed de Bruijn graph Nodes: all possible k-mers

Edges: ((v, dv), (u, du)) or (v, u, dv, du)There is an edge between v1 and v2 iff

suffix(v, k-1)=prefix(u, k-1) or suffix(v, k-1)=prefix(u, k-1)

and v[1].suffix(v, k-1).u[k] is k+1 substring of S

or Sor v[1].suffix(v, k-1).u[k] is k+1 substring of S or S

)( SSBDk

)&()( SvSvasvv

30


Canonical k-mer: for the reverse complement of two k-mers, the lexicographically smaller (larger) is defined as canonical k-mer, and the other one is non-canonical k-mer.

The orientations of the arrow heads on the edges are chosen as follows.

If the canonical k-mers of nodes vi and vj overlap then an edge (vi, vj >, >) is introduced.If the canonical k-mer of vi overlaps with the non-canonical k-mer of vj then an edge (vi, vj , >, <) is introduced. If the noncanonical k-mer of vi overlaps with canonical k-mer of vj then an edge (vi, vj <, >) is introduced.

A walk W (vi, vj ) between two nodes vi, vj ÎV of a bi-directed graph G(V, E) is a sequence vi ei vi1 ei1 vi2 vim eim, , , , , , , , vj, such that for every intermediate vertex vi l , 1 ≤ l ≤ m the orientation of the arrow head on the incoming edge adjacent on vil, 1<l<m, should match the orientation of the arrow head on the outgoing edge.

31


Kundeti et al. BMC Bioinformatics 2010, 11:560

32

NGS Assembly using Bi-directed de Bruijn graph

1. Construct a bi-directed de Bruijn graph

2. de Bruijn graph simplification (Compaction)

3. Removal of errors

4. Assemble sequence through Chinese Postman Walk (CPW) on bi-directed de Bruijn Graphs.

33

Construct a bi-directed de Bruijn graphs

1. Generate Edges


34

Construct a bi-directed de Bruijn graphs

2. Reduce multiplicity: sort all bi-directed edges

The sorting take O(n), where n=Nr, and N is the number of reads and r is the average size of reads.The sorting step is the dominated step in build bi-directed de Bruijn graph

3. Collect bi-directed vertices

4. Generate adjacent lists.

35

Graph Compaction

Compact chains into single edgesReduction to familiar list rankingList ranking:

◦ Distributed linked list with adjacency information◦ Find distance from each node to end of list

Extend:◦ Multiple linked lists◦ Identify nodes with multiple edges◦ Undirected

36

Graph Compaction


37

Graph Compaction

Edges in the graph are nodes for list ranking

Sort by edges to assign unique edge labels

Sort by nodes to bring edges incident to a node together

Identify nodes on a chainMark adjacent edges on chains

Perform undirected list ranking

38

Errors Detection and Removal

Assumption: incidences of errors is random.

Errors is unlikely to occur repeatedly at the same base.

Each base in the genome is sampled on an average as many times as the coverage numbers, which is high in NGS.

Combining: identifying errors on their comparatively lower frequency

39


Tips: misreading of one or more based towards the end of the short reads

TCGTTGCGTGCGTGAGCGT

k

Tip

40


Bubbles: misreading of one or more bases in the middle of a short reads.

TCGTTGCGTGCGTGAGCGT

k k

Bubble

41


Spurious links: when an erroneous (k+1) molecule happens to be identical to a legitimate (k+1) molecular form elsewhere in the genome.

SpuriousLink

42

Euler Circuits A path is a connected sequence

of edges showing a route on the graph that starts at a vertex and ends at a vertex.

The path that starts and ends at the same vertex is called a circuit.

Circuits that cover every edge once and only once are called Euler circuits.

43

Euler’s Theorem1. If a graph G is connected

and has all nodes with even degree, then G has an Euler circuit.

2. If G has an Euler circuit, then G must be connected and all its degree must be even numbers.

44

Chinese Postman Problem

Suppose there is a mailman who needs to deliver mail to a certain neighborhood. That mailman is lazy, so he wants to find the shortest route through the neighborhood, that meets the following criteria:

It is a closed circuit (it ends at the same point it starts).He needs to go through every street at least once.

If the graph traveled has an Eulerian Circuit, this circuit is the ideal solution.

45

Chinese Postman Problem Solution (Edmonds and Johnson)

1. Find the odd nodes in G.

2. calculate the shortest path between all odd nodes.

3. construct a complete graph F of all odd nodes. The weights of edges are the shortest paths between them.

4. Find the minimum weighted perfecting matching (MWPM) on F.

For every edge (u,v) in the set of MWPM, duplicate the shortest path between u and v in original graph G to construct a new multi graph G’

Find the Eulerian Circuit in G’.

46

Strategy for Solving Chinese Postman Problem1. Eulerizing the graph

2. Find an Euler circuit on the new graph.

3. “Squeeze” this Euler circuit from the Eulerized graph onto the original graph by reusing an edge of the original graph each time the circuit on the eulerized graph uses an added edge.

47

Chinese Postman Walk (CPW) on bi-directed DBG

A Chinese Postman walk in a bi-directed graph is a bi-directed walk which visits every edge at least once.

A cyclic Chinese Postman walk of minimum cost on a weighted bi-directed graph is denoted as CPW

48


Lemma 1. A connected bi-directed graph is Eulerian if and only if every vertex is balanced.

Lemma 2. A non Eulerian bi-directed graph G = (V,E) has a cyclic Chinese Postman walk a corresponding multi-bi-directed graph Gm = (V,Em) which is Eulerian.

Lemma 3. Finding a cyclic CP walk on a bi-directed graph G(V,E) is equivalent to finding a minimum weight Eulerian multi-bi-directed graph G(V,E) corresponding to G.

Lemma 4. If a bi-directed-graph G(V,E) has a cyclic CP walk then the cost of that walk is equal to the weight of G(V,E).

49


Lemma 5. A non Eulerain bi-directed graph G(V,E) has a cyclic CP walk the balancing bi-partite graph B(P,Q,Eb) has a perfect match.

Lemma 6. If G(V,E) is a non Eulerian bi-directed graph that has a cyclic CP walk, then every corresponding Eulerian multi-bi-directed graph Gm(V,Em) belongs to the family F.

50


Documents

Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources