50
Bioinformatics Algorithms Sequence Assembly

Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

Embed Size (px)

Citation preview

Page 1: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

Bioinformatics Algorithms

Sequence Assembly

Page 2: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

2

Copy Right Notice

Most slides in this presentation are adopted from slides of text book and various sources. The Copyright belong to the original authors. Thanks!

Page 3: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

3

Genome Sequencing

Goal figuring the order of nucleotides across a

genome

Problem Current DNA sequencing methods can handle

only short stretches of DNA at once (<1-2Kbp)

Solution Sequence and then use computers to assemble

the small pieces

Page 4: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

4

Genome Sequencing

4

ACGTGGTAA CGTATACAC TAGGCCATA GTAATGGCG CACCCTTAG TGGCGTATA CATA…

ACGTGGTAATGGCGTATACACCCTTAGGCCATA

Short fragments of DNA

AC..GCTT..TC

CG..CA

AC..GC

TG..GT TC..CC

GA..GCTG..AC

CT..TGGT..GC AC..GC AC..GC

AT..ATTT..CC

AA..GC

Short DNA sequences

ACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACACGTGACCGGTACTGGTAACGTACACCTACGTGACCGGTACTGGTAACGTACGCCTACGTGACCGGTACTGGTAACGTATACCTCT...

Sequenced genome

Genome

Page 5: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

5

Sanger Sequencing

1980 1990 2000

1982: lambda virusDNA stretches up to 30-40Kbp (Sanger et al.)

1994: H. Influenzae1.8 Mbp (Fleischmann et al.)

2001: H. Sapiens, D. Melanogaster3 Gbp (Venter et al.)

2007: Global Ocean Sampling Expedition~3,000 organisms, 7Gbp (Venter et al.)

Page 6: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

6

Sanger Sequencing

Advantages Long reads (~900bps) Suitable for small projects

Disadvantages Low throughput Expensive

Page 7: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

7 7

2010: 5K$, a few days?

2009: Illumina, Helicos40-50K$

Sequencing the Human Genome

Year

Log

10(p

rice)

201020052000

10

8

6

4

22012: 100$, <24 hrs?

2008: ABI SOLiD60K$, 2 weeks

2007: 4541M$, 3 months

2001: Celera100M$, 3 years

2001: Human Genome Project2.7G$, 11 years

Page 8: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

8

Next Generation Sequencing

Alternative sequencing technologies to capillary, introduced in mid 2000s.

Systems by Illumina Solexa and ABI SOLiD.

Much higher throughput (1-4gbps / day)

Lower cost / base pair

Very short fragment lengths

High error rate

Inherent ability to do paired-end (mate-pair) sequencing.

Page 9: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

9

Technology SummaryRead length Sequencing

TechnologyThroughput (per run)

Cost (1mbp)*

Sanger ~800bp Sanger 400kbp 500$

454 ~400bp Polony 500Mbp 60$

Solexa 75bp Polony 20Gbp 2$

SOLiD 75bp Polony 60Gbp 2$

Helicos 30-35bp Single molecule

25Gbp 1$

*Source: Shendure & Ji, Nat Biotech, 2008

Page 10: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

10

Assembly

10

Cut DNA to larger pieces (2Kbp, 15Kbp) and sequence both ends of each piece (Fleischmann et al., 1994)

contig 1 contig 215Kbp mates

2Kbp mates

~(length―1,000)

~500 bp ~500 bp

resolving repeats

Better assembly of contigs, gap lengths estimation

Page 11: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

11

many pieces to assemble

High coverage:

Assembly: How Much DNA?

Low coverage:

A few pieces to assemble

a few contigs, a few gaps

many contigs, many gaps

Input OutputLander and Waterman,

1988

Page 12: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

12

Assembly paradigms

Overlap-layout-consensusgreedy (TIGR Assembler, phrap, CAP3...)graph-based (Celera Assembler, Arachne)

de Bruijn Graph based approaches (especially useful for short read sequencing)

Page 13: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

13

Overlap-Layout-Consensus

Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA

Overlap: find potentially overlapping reads

Layout: merge reads into contigs and contigs into supercontigs (scaffolds)

Consensus: derive the DNA sequence and correct read errors

..ACGATTACAATAGGTT..

Page 14: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

14

OVERLAP GRAPH

Edge Types:

AA

BB

AA

BB

AA

BB

BB

BB

BB

AA

AA

AA

Regular DovetailRegular Dovetail

Prefix DovetailPrefix Dovetail

Suffix DovetailSuffix Dovetail

E.G.:E.G.: Edges are annotated Edges are annotated with deltas of overlapswith deltas of overlaps

Page 15: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

15

OVERLAP GRAPH

Find the best match between the suffix of one read and the prefix of another

Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment

Apply a fast filtration method to filter out pairs of fragments that do not share a significantly long common substring

Page 16: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

16

The Maximum Overlap Graph

Each edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v

a

b

d

c

TACGA

CTAAAGACCC

GACA

1

1

1

2

10-weight edges

omitted!

0-weight edges

omitted!

Page 17: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

17

Paths and Layouts

The path dbc leads to the alignment:

a

b

d

c

TACGA

CTAAAGACCC

GACA

1

1

1

2

1

GACA-----------ACCC-----------CTAAAG

Page 18: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

18

Superstrings

Every path that covers every node is a superstring

Zero weight edges result in alignments like:

Higher weights produce more overlap, and thus shorter strings

The shortest common superstring is the highest weight path that covers every node

GACA------------GCCC-------------TTAAAG

Page 19: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

19

Graph formulation of SCS

Input: A weighted, directed graph

Output: The highest-weight path that touches every node of the graph

Does this problem sound familiar?Does this problem sound familiar?

Page 20: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

20

The Greedy Algorithm

Algorithm greedy Sort edges in increasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End forEnd Algorithm

Page 21: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

21

Greedy Example

7

6

54

3

2

1

2

2

Page 22: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

22

Handling repeats

1. Repeat detectionpre-assembly: find fragments that belong to repeats

statistically (most existing assemblers)

repeat database (RepeatMasker)

during assembly: detect "tangles" indicative of repeats (Pevzner, Tang, Waterman 2001)

post-assembly: find repetitive regions and potential mis-assemblies.

Reputer, RepeatMasker

"unhappy" mate-pairs (too close, too far, mis-oriented)

2. Repeat resolutionfind DNA fragments belonging to the repeat

determine correct tiling across the repeat

Page 23: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

23

Statistical repeat detection

Significant deviations from average coverage flagged as repeats.

- frequent k-mers are ignored

- “arrival” rate of reads in contigs compared with theoretical value

(e.g., 800 bp reads & 8x coverage - reads "arrive" every 100 bp)

Problem 1: assumption of uniform distribution of fragments - leads to false positives

non-random libraries

poor clonability regions

Problem 2: repeats with low copy number are missed - leads to false negatives

Page 24: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

24

Consensus

A consensus sequence is derived from a profile of the assembled fragments

A sufficient number of reads is required to ensure a statistically significant consensus

Reading errors are corrected

Page 25: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

25

Derive Consensus Sequence

Derive multiple alignment from pairwise read alignments (i.e., progressive alignment)

TAGATTACACAGATTACTGA TTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAAACTATAG TTACACAGATTATTGACTTCATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGCGTAA CTATAGATTACACAGATTACTGACTTGATGGGGTAA CTA

TAGATTACACAGATTACTGACTTGATGGCGTAA CTA

Derive each consensus base by weighted voting.

Another approach based on finding a longest path in a DAG is given the popular assembler

Page 26: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

26

Definitions

Let v and w be two strings over the alphabet . Concatenation of v and w is denoted by vw, and v[i] is the ith symbol in v, 1 i |v|. v[i,j] denotes a substring in v and for any x , xk, k 1 is x concatenated with itself k times.

A string of length k is called a k-mer. The k-spectrum of v is the set of all k-mers that are substrings of v. [Example: v = abcd; 2-spectrum of v is {ab, bc, cd}, 3-spectrum of v is {abc, bcd}.

A DNA strand is a string with alphabet Σ = {a, g, c, t}. Characters of a DNA strand are called bases. The complement of a base α[i], denoted by α[i], is defined by the following bijection of Σ onto Σ: {t → a, c → g, a → t, g → c}.

The reverse complement of a DNA strand α, denoted by α, is obtained by reversing α and complementing each base (α[i] = α[|α| − i + 1]). Note that α[i] = α[i] and α = α.

A DNA molecule is a pair of complementary DNA strands, m = {αm, αm }. We denote the length of m as |m| = |αm| = h and call m an h-molecule.

Page 27: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

27

The bi-directed graphs

A bidirected graph is one in which each edge is given an independent orientation (or direction, or arrow – thus, 2 kinds of arrowheads) at each end. Thus, there are three kinds of bidirected edges: (1) those where the arrows point outward, towards the vertices, at both ends; (2) those where both arrows point inward, away from the vertices; and (3) those in which one arrow points away from its vertex and towards the opposite end, while the other arrow points in the same direction as the first, away from the opposite end and towards its own vertex.

Page 28: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

28

The bi-directed graphs

We denote a bidirected graph by an incidence matrix I: V E {-2, -1, 0, 1, 2}. I(x,e) = 0 if edge e is not incident to node x, +1 if e is positive-incident to x [denoted by diamond], -1 if e is negative incident on x, +2 if e is a self-loop on x (positive incident) and -2 if e is a self-loop on x (negative incident). The in-degree degn(x) and out-degree degp(x) of a vertex are defined as usual. The balance at a node x is bal(x) = degn(x) + degp(x); a graph is balanced if the balance of each vertex is 0. A walk is a sequence x1e1…xk-1ek-1xk where ei is an edge between nodes xi and xi+1 and ei and ei-1 have opposite orientations at xi.

Bal(W) = 0, bal (X) = -1, bal (Y) = 1, bal (Z) = -1; the graph is not balanced. We can view a loop less directed graph as a special kind of bidirected graph, if every edge is positive-incident to one of its endpoints and negative-incident to the other one – the definition of a walk reduces to its usual meaning in directed graphs. However, it is possible for the shortest walk between two vertices to repeat a vertex in a bidirected graph. Here, observe that there does not exist a walk between W and Z which does not repeat a vertex [not possible in a directed graph] – the walk [from node W to node Z] is ABCBD – observe that AD is not a walk but BD is.

W

Z

X YA B C D E

W 1 0 0 0 -1

X -1 1 0 -1 -1

Y 0 -1 2 0 0

Z 0 0 0 -1 0

A BC

D

Page 29: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

29

The bi-directed de Bruijn graphs

A bi-directed de Bruijn graph Nodes: all possible k-mers

Edges: ((v, dv), (u, du)) or (v, u, dv, du)There is an edge between v1 and v2 iff

suffix(v, k-1)=prefix(u, k-1) or suffix(v, k-1)=prefix(u, k-1)

and v[1].suffix(v, k-1).u[k] is k+1 substring of S

or Sor v[1].suffix(v, k-1).u[k] is k+1 substring of S or S

)( SSBDk

)&()( SvSvasvv

Page 30: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

30

The bi-directed de Bruijn graphs

Canonical k-mer: for the reverse complement of two k-mers, the lexicographically smaller (larger) is defined as canonical k-mer, and the other one is non-canonical k-mer.

The orientations of the arrow heads on the edges are chosen as follows.

If the canonical k-mers of nodes vi and vj overlap then an edge (vi, vj >, >) is introduced.If the canonical k-mer of vi overlaps with the non-canonical k-mer of vj then an edge (vi, vj , >, <) is introduced. If the noncanonical k-mer of vi overlaps with canonical k-mer of vj then an edge (vi, vj <, >) is introduced.

A walk W (vi, vj ) between two nodes vi, vj ÎV of a bi-directed graph G(V, E) is a sequence vi ei vi1 ei1 vi2 vim eim, , , , , , , , vj, such that for every intermediate vertex vi l , 1 ≤ l ≤ m the orientation of the arrow head on the incoming edge adjacent on vil, 1<l<m, should match the orientation of the arrow head on the outgoing edge.

Page 31: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

31

The bi-directed de Bruijn graphs

Kundeti et al. BMC Bioinformatics 2010, 11:560

Page 32: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

32

NGS Assembly using Bi-directed de Bruijn graph

1. Construct a bi-directed de Bruijn graph

2. de Bruijn graph simplification (Compaction)

3. Removal of errors

4. Assemble sequence through Chinese Postman Walk (CPW) on bi-directed de Bruijn Graphs.

Page 33: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

33

Construct a bi-directed de Bruijn graphs

1. Generate Edges

Kundeti et al. BMC Bioinformatics 2010, 11:560

Page 34: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

34

Construct a bi-directed de Bruijn graphs

2. Reduce multiplicity: sort all bi-directed edges

The sorting take O(n), where n=Nr, and N is the number of reads and r is the average size of reads.The sorting step is the dominated step in build bi-directed de Bruijn graph

3. Collect bi-directed vertices

4. Generate adjacent lists.

Page 35: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

35

Graph Compaction

Compact chains into single edgesReduction to familiar list rankingList ranking:

◦ Distributed linked list with adjacency information◦ Find distance from each node to end of list

Extend:◦ Multiple linked lists◦ Identify nodes with multiple edges◦ Undirected

Page 36: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

36

Graph Compaction

Kundeti et al. BMC Bioinformatics 2010, 11:560

Page 37: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

37

Graph Compaction

Edges in the graph are nodes for list ranking

Sort by edges to assign unique edge labels

Sort by nodes to bring edges incident to a node together

Identify nodes on a chainMark adjacent edges on chains

Perform undirected list ranking

Page 38: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

38

Errors Detection and Removal

Assumption: incidences of errors is random.

Errors is unlikely to occur repeatedly at the same base.

Each base in the genome is sampled on an average as many times as the coverage numbers, which is high in NGS.

Combining: identifying errors on their comparatively lower frequency

Page 39: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

39

Errors Detection and Removal

Tips: misreading of one or more based towards the end of the short reads

TCGTTGCGTGCGTGAGCGT

k

Tip

Page 40: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

40

Errors Detection and Removal

Bubbles: misreading of one or more bases in the middle of a short reads.

TCGTTGCGTGCGTGAGCGT

k k

Bubble

Page 41: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

41

Errors Detection and Removal

Spurious links: when an erroneous (k+1) molecule happens to be identical to a legitimate (k+1) molecular form elsewhere in the genome.

SpuriousLink

Page 42: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

42

Euler Circuits A path is a connected sequence

of edges showing a route on the graph that starts at a vertex and ends at a vertex.

The path that starts and ends at the same vertex is called a circuit.

Circuits that cover every edge once and only once are called Euler circuits.

Page 43: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

43

Euler’s Theorem1. If a graph G is connected

and has all nodes with even degree, then G has an Euler circuit.

2. If G has an Euler circuit, then G must be connected and all its degree must be even numbers.

Page 44: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

44

Chinese Postman Problem

Suppose there is a mailman who needs to deliver mail to a certain neighborhood. That mailman is lazy, so he wants to find the shortest route through the neighborhood, that meets the following criteria:

It is a closed circuit (it ends at the same point it starts).He needs to go through every street at least once.

If the graph traveled has an Eulerian Circuit, this circuit is the ideal solution.

Page 45: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

45

Chinese Postman Problem Solution (Edmonds and Johnson)

1. Find the odd nodes in G.

2. calculate the shortest path between all odd nodes.

3. construct a complete graph F of all odd nodes. The weights of edges are the shortest paths between them.

4. Find the minimum weighted perfecting matching (MWPM) on F.

For every edge (u,v) in the set of MWPM, duplicate the shortest path between u and v in original graph G to construct a new multi graph G’

Find the Eulerian Circuit in G’.

Page 46: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

46

Strategy for Solving Chinese Postman Problem1. Eulerizing the graph

2. Find an Euler circuit on the new graph.

3. “Squeeze” this Euler circuit from the Eulerized graph onto the original graph by reusing an edge of the original graph each time the circuit on the eulerized graph uses an added edge.

Page 47: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

47

Chinese Postman Walk (CPW) on bi-directed DBG

A Chinese Postman walk in a bi-directed graph is a bi-directed walk which visits every edge at least once.

A cyclic Chinese Postman walk of minimum cost on a weighted bi-directed graph is denoted as CPW

Page 48: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

48

Chinese Postman Walk (CPW) on bi-directed DBG

Lemma 1. A connected bi-directed graph is Eulerian if and only if every vertex is balanced.

Lemma 2. A non Eulerian bi-directed graph G = (V,E) has a cyclic Chinese Postman walk a corresponding multi-bi-directed graph Gm = (V,Em) which is Eulerian.

Lemma 3. Finding a cyclic CP walk on a bi-directed graph G(V,E) is equivalent to finding a minimum weight Eulerian multi-bi-directed graph G(V,E) corresponding to G.

Lemma 4. If a bi-directed-graph G(V,E) has a cyclic CP walk then the cost of that walk is equal to the weight of G(V,E).

Page 49: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

49

Chinese Postman Walk (CPW) on bi-directed DBG

Lemma 5. A non Eulerain bi-directed graph G(V,E) has a cyclic CP walk the balancing bi-partite graph B(P,Q,Eb) has a perfect match.

Lemma 6. If G(V,E) is a non Eulerian bi-directed graph that has a cyclic CP walk, then every corresponding Eulerian multi-bi-directed graph Gm(V,Em) belongs to the family F.

Page 50: Bioinformatics Algorithms Sequence Assembly 2 Copy Right Notice Most slides in this presentation are adopted from slides of text book and various sources

50

Chinese Postman Walk (CPW) on bi-directed DBG