59
Lecture 5. Sequence Assembly The Chinese University of Hong Kong CSCI3220 Algorithms for Bioinformatics

Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Lecture 5. Sequence Assembly

The Chinese University of Hong KongCSCI3220 Algorithms for Bioinformatics

Page 2: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Lecture outline1. The sequence assembly problem– Several general approaches

2. Related graph problems– Hamiltonian path– Eulerian path

3. Sequence assembly by using de Bruijn graphs

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 2

Page 3: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

THE SEQUENCE ASSEMBLY PROBLEM

Part 1

Page 4: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Reference sequence• In the last lecture, we studied the problem of

short read alignment• Assumptions behind the short read alignment

problem:– There is a reference genome– The reference is similar to, but not exactly the

same as, the DNA sequence from which the short reads were generated

• Sometimes, no good references are available

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 4

Page 5: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Lack of a reference sequence• Some situations in which a good reference

sequence is not available:– Sequencing a new type of bacteria– Sequencing a genomic region previously poorly

annotated

• In these situations, we need to reconstruct the sequence by assembling the short reads.– This process is called sequence assembly, or “de

novo assembly”.

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 5

Page 6: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Sequence assembly• General idea:– Find sequencing reads with substantial suffix-prefix

overlaps• Example: assembling ACCGAGT and CCGAGTC into ACCGAGTC

– Similar to playing a jigsaw puzzle– Reasons that it is possible:

• We sequence multiple copies of the DNA• The different copies are independently, randomly fragmented

– …ACCGAGTC… → … ACCGAGT C…

– …ACCGAGTC… → …A CCGAGTC …

– The larger is the overlap, the more likely the reads really come from nearby positions and should be assembled

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 6

Image credit: slideteam.net

Page 7: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Shortest superstring formulation• Example revisited: Suppose we have got the

following short reads from multiple copies of an unknown sequence s:– ACA, ATA, ATA, ATT, TAG, TAT, TTC– How to get back sequence s from these short reads?

• One possible formulation: Shortest superstring– Find the shortest string s’ such that every observed

read is a substring of s’• There is no guarantee that s’ must be equal to the actual

sequence s• Very difficult problem (NP-hard) – No known polynomial

time algorithms exist

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 7

Page 8: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Maximum overlap• One heuristic method to solving shortest

superstring: Greedy merge of the two strings with the maximum overlap– For example, ATA may be followed by TAG, since

the last two characters of ATA are the same as the first two characters of TAG

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 8

Page 9: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Maximum overlap• Detailed steps:

1. First remove any input short read that is a sub-sequence of another. For example,• If both ACAT and ACA are in the input, remove ACA.• If there are multiple copies of ACA in the input, remove all but

one of them.2. For the remaining list of short reads, find the ordered

pair (x, y) with the maximum overlap between x’s suffix and y’s prefix. If y contains l characters and the overlap involves k characters, remove y and replace x with the merged sequence xy[k+1..l].• Tie-breaking rule for our discussion: If there are multiple of them,

merge the first pair according to lexicographic order of the read IDs.

• (1, 2) < (1, 3) < (2, 1) < (2, 3) < (3, 1) < (3, 2)3. Repeat steps 1 and 2 until there is only one sequence left.

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 9

Page 10: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Maximum overlap example• Input short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTA

→ (Removing reads that are substrings of others)ACA, ATA, ATT, TAG, TAT, TTA

→ACA, ATA, ATT, TAG, TAT, TTA→ACA, ATAG, ATT, TAT, TTA→ACA, ATAG, ATT, TAT, TTA→ACA, ATAG, ATTA, TAT→ACA, ATAG, ATTA, TAT→ACA, ATAG, ATTAT→ACA, ATAG, ATTAT→ACA, ATTATAG→ACA, ATTATAG→ACATTATAG

• Results:1. Reconstructed string different from actual one (TATACATTAG)2. Also one character shorter than the actual one (9 vs. 10)

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 10

Page 11: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

How good is maximum overlap?• How long is the reconstructed sequence?– It has been proved that if the shortest superstring

has length m, then the reconstructed string has length £ 4m. (Note that m is different from n, the length of the actual sequence s.)• Is there an example with a reconstructed string with

length » 2m?

– It is possible to have m < n, i.e., the shortest superstring is not the actual sequence s• Is there an example with a reconstructed string with

length << n?

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 11

Page 12: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Quick quiz• Suppose we want to assemble a set of n reads

each with a length of m, what is the time complexity of the maximum overlap algorithm?

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 12

Page 13: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Practical issues• There are many practical issues in sequence

assembly:1. Non-uniqueness due to short read length2. Non-uniqueness due to heterogeneity3. Incomplete coverage4. Sequencing errors5. Ambiguity due to repeats...

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 13

Page 14: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Practical difficulties and solutions• Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC

• Problems:1. Non-uniqueness due to short read length: TAT

may also be followed by ATT (instead of ATA as in TATACATTAG)

• Solution:– Use longer reads and/or paired reads, and

assemble reads only when they have substantial overlaps• Limited by current technology: <250nt per NGS read

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 14

Page 15: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Paired end sequencing• Sequence both ends of a fragment• The two reads are called a mate pair• Insert sizes form a distribution due to

random fragmentation and manual size selection– One read is likely within a certain

distance range from the other– If the location of one read is ambiguous,

may use the location information of the other to help

• More difficult in practice due to imprecise insert size

• The idea of paired end sequencing is also useful in short read alignment

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 15

Read length

Insert size

Fragment length

Sequencing

Read1 Read2

TATATAATT

...CAT...ATT...GGG

(Suppose s=TATACATTAGGG here)

Page 16: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Practical difficulties and solutions• Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC

• Problems:2. Non-uniqueness due to heterogeneity: It is

possible that the DNA sample contains two versions of the sequence, one with TATACATTAG and one with TATACATTCG

• Solution:– Use an assembly method that can consider

multiple possible ways of assembling the reads

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 16

Page 17: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Practical difficulties and solutions• Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC• Problems:

3. Incomplete coverage: We do not know what is after ACA, since no reads start with CA

• Solution:– Produce more reads• The ratio between the total length of useful reads and the

length of s is called the average read depth. For sequence alignment, it is now common to perform 30x-60x coverage. For sequence assembly, usually we need >100x.– Sometimes we do not know the actual length of s and need to

estimate it• Costs more starting materials, reagents, money and

computation

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 17

Page 18: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Practical difficulties and solutions• Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC

• Problems:4. Sequencing errors: ATT seems to be followed by

TTC, but actually s does not contain TTC as a subsequence (recall that s=TATACATTAG), due to a sequencing error (TTC should actually be TTA)

• Solutions:– Use longer reads/paired reads so that if a read

contains an error, it has low probability of being assembled

– Use some statistics to estimate error probability

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 18

Page 19: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Practical difficulties and solutions• Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC• Problems:

5. Ambiguity due to repeats: Should ATA be put before or after TAT?• This problem is due to the occurrence of two TA’s in

s=TATACATTAG

• Solution:– Use longer reads/paired reads• Still cannot handle the following cases:

– Each unit of a repeat is very long– There are many copies of a repeating unit

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 19

Page 20: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Contigs and scaffolds• Key terms:– Contig: A partially assembled sequence from some

reads– Scaffold: An arrangement of the contigs with

specified order and orientations

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 20

Image source: https://www.researchgate.net/profile/Sicheng_Xu3/post/what_is_the_principle_of_transcriptome_assembly_after_RNA-seq/attachment/5e7750c03843b0047b36b411/AS%3A871854671659009%401584877760694/download/GenomeAssembly.png

Page 21: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Quick quiz• What information can we use to construct

scaffolds from the contigs?

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 21

Page 22: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Quality of an assembly• Usually the final output does not contain a single

sequence, but just some contigs/scaffolds• Some descriptive statistics of assembly outputs:– Length of longest contig– Average length of contigs– Total length of contigs– N50: Length of the contig such that it and longer

contigs amount to 50% or more of the total length of all contigs• If the lengths are (in an arbitrary unit) 10, 8, 6, 5, 3, 3, 2, 1, 1,

1, then the N50 value would be 6, since (10+8+6) = 24, which is larger than 50% of the sum (10+8+6+5+3+3+2+1+1+1)/2 = 40/2 = 20 but (10+8) = 18 is smaller than it

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 22

Page 23: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

General approaches• Some other proposed approaches to sequence

assembly:– Greedy extension: Extend the current contig until

there are no more or multiple extensions• Short reads: ACA, ATA, ATA, ATT, TAG, TAT, TTC• ATT® ATTC

– Overlap/layout/consensus: Perform all-against-all read alignments to find overlaps, deduce approximate layout, and refine layout by multiple sequence alignment• The all-against-all part requires tremendous computational

power– de Bruijn graph (More details later)

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 23

Page 24: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

RELATED GRAPH PROBLEMSPart 2

Page 25: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Graph formulations• We can use a graph to provide an abstraction of

the short read assembly scenario• Formulation 1:– Each node is a short read (assuming no two reads are

identical)– There is an edge from node X to node Y if a suffix of X

substantially overlaps with a prefix of node Y• E.g., overlapping at least |X|/2 characters

– Goal : Find a path that visits every node exactly once• Because you want every short read to appear exactly once in

the resulting sequence• This is a Hamiltonian path problem

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 25

Page 26: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Hamiltonian path formulation• Short reads (new example):ACA, ATA, ATT, CAT, TAC, TAG , TAT, TTA

• Suppose a node is connected to another one if the length-2 suffix of the former equals the length-2 prefix of the latter:

• Unfortunately, even the decision version (whether such a path exists) is very difficult (NP-complete)

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 26

TTA

TAT

ATA

TAC

ACA

CAT

ATT

TAG

Page 27: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Graph formulations• Formulation 2:– Each edge is a short read– The two nodes that connect an edge are derived

from the prefix and suffix of the corresponding read• Different reads can share nodes

– Goal: Find a path that visits every edge exactly once• The Eulerian path problem

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 27

Page 28: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Eulerian path formulation• Short reads:ACA, ATA, ATT, CAT, TAC, TAG , TAT, TTA

• Suppose each read is decomposed into two nodes, one containing the length-2 prefix of it and the other the length-2 suffix of it. There is an edge pointing from the former to the latter.

• Interestingly, it is much easier to find Eulerian paths

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 28

AC

AG

AT

CATA

TT

ACAATA

ATT

CATTACTAGTAT

TTA

Page 29: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

The Eulerian path problem

• Existence of an Eulerian path:– The in-degree of a node is the number of edges going into it– The out-degree of a node is the number of edges going out of it– If a connected directed graph has an Eulerian path, the followings are true (why?):

• At most one node has (out-degree – in-degree) = 1• At most one node has (out-degree – in-degree) = -1• All other nodes have (out-degree – in-degree) = 0

– Surprisingly, if a connected graph satisfies these conditions, it must have anEulerian path, i.e., the three form a set of both necessary and sufficient conditions

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 29

AC

AG

AT

CATA

TT

ACAATA

ATT

CATTACTAGTAT

TTA

Page 30: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

The Eulerian path problem

• Finding an Eulerian path (Hierholzer’s algorithm):– Start with the node with an extra out-degree (or if not exist, any node)– Follow any unused edges to visit other nodes, until getting stuck

• Either back to the starting node and it has no more unused edge, or the node with an extra in-degree (why are there no other possibilities?)

– If any visited node has unused edges, repeat the above with this node as the starting node. The path must end at the same node. Join this new path with the old one.

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 30

AC

AG

AT

CATA

TT

ACAATA

ATT

CATTACTAGTAT

TTA

Page 31: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Example

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 31

AC

AG

AT

CATA

TT

AC

AG

AT

CATA

TT

1

2

3

4

AC

AG

AT

CATA

TT

1

2

3

4

ii

iii

iv

i

AC

AG

AT

CATA

TT

1

2

3

8

5

6

7

4Final answer:TATTACATAG

Page 32: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Using stacks to find the paths

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 32

AC

AG

AT

CATA

TT

1

2

3

4

Current path stack Completed path stackTAATTTTAAG

Page 33: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Using stacks to find the paths

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 33

AC

AG

AT

CATA

TT

1

2

3

4

Current path stack Completed path stackTAATTTTA

AG

Page 34: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Using stacks to find the paths

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 34

Current path stack Completed path stackTAATTTTA

AG

AC

AG

AT

CATA

TT

1

2

3

4

ii

iii

iv

i

ACCAATTA

Page 35: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Using stacks to find the paths

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 35

Current path stack Completed path stack

TAATTTTA

AG

AC

AG

AT

CATA

TT

1

2

3

4

ii

iii

iv

i

ACCAATTA

Final answer:TATTACATAG

Page 36: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Quick quiz• What is the graph problem when each node

corresponds to a read?• What is the graph problem when each edge

corresponds to a read?• Which of these two problems is easier in

terms of computational complexity?

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 36

Page 37: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Comparing the two formulations

• The Hierholzer’s algorithm runs in linear time– Does it mean we have proved P=NP?– A general Hamiltonian path problem is NP hard, but due to the

equivalence to the Eulerian path formulation, Hamiltonian path problems *for short reads* can be efficiently solved

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 37

TTA

TAT

ATA

TAC

ACA

CAT

ATT

TAG

Hamiltonian path formulation:

AC

AG

AT

CATA

TT

ACAATA

ATT

CATTACTAGTAT

TTA

Eulerian path formulation:

Actual DNA sequence (which should be unknown): TATACATTAG

Page 38: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Why the big difference?• Why is the Eulerian path problem much easier

than the general Hamiltonian path problem?– Strict necessary and sufficient conditions for an

Eulerian path to exist– The conditions are simple: They can be checked

very efficiently– For the Eulerian path problem, the solution of a

sub-problem (involving only a subset of the edges) can always contribute towards the solution of the original problem• Another example of reusing results

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 38

Page 39: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Back to sequence assembly• While the Eulerian path formulation is elegant,

we need to deal with the many issues when applying it to perform sequence assembly– Non-uniqueness– Errors in data– Repeats– Heterogeneity– ...

• We now study some practical methods for solving these issues

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 39

Page 40: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

DE BRUIJN GRAPHSPart 3

Page 41: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Problems of the graph formulations• So far we have assumed that each node (in the

Hamiltonian path formulation) or each edge (in the Eulerian path formulation) is a short read.

• It is problematic if “short” reads are not really that short (but 100-250nt) when determining which reads should be connected in the graph:– If the required overlap is too large, a single error/mutation

would make two consecutive reads not connected• TATACATTA, ATAGATTAG

– If the required overlap is too small, there will be too many connections and it is easy to have non-unique solutions

– There can be a tremendous number of nodes/edges, making the graph not fitting into memory

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 41

Page 42: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

de Bruijn graph• In the original definition proposed by Dutch mathematician

Nicolaas de Bruijn in 1946:– A de Bruijn graph is a graph that contains every k-mer of a certain

alphabet as a node, and there is a directed edge from a node to another one if the length-(k-1) suffix of the former is the same as the length-(k-1) prefix of the latter.

• Example: k=3, alphabet={0, 1}

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 42

Image source: Wikipedia

Page 43: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

de Bruijn graph for sequence assembly

• To use de Bruijn graph for sequence assembly:– We consider only k-mers that are subsequences of

some observed reads• In practice, only a very small fraction of the 4k possible k-mers appear in the reads

– Two nodes are connected only if there are reads that contain both at consecutive positions• So, there will not be an edge from ATA to TAG if ATAG

is not observed in any read

– The number of reads that support each edge is also recorded

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 43

Page 44: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

de Bruijn graph for sequence assembly

• Example:– Sequencing reads: a:ACGGC, b:CGGCG, c:CGTGA,

d:GACGT, e:GCGTG, f:GGCGT, g:GTGAC and h:TGACG– To construct a de Bruijn graph when each node

corresponds to a 3-mer:

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 44

First 3-mer in a read Next 3-mer in a read Supporting reads

ACG CGG 1 (a)

ACG CGT 1 (d)

CGG GGC 2 (a, b)

CGT GTG 2 (c, e)

GAC ACG 2 (d, h)

GCG CGT 2 (e, f)

GGC GCG 2 (b, f)

GTG TGA 2 (c, g)

TGA GAC 2 (g, h)

ACG

CGG

GCG

GGC

CGT

GTG

TGA

GAC

1

12

2

22

2

22

Page 45: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Patterns of various issues• The various issues we discussed before will

form special patterns in a de Bruijn graph:– Non-uniqueness: frayed rope• s=TACCGGACCGC• Observed reads (not necessarily covering all k-mers of s): TACC, ACCG, CCGG, GACC, CCGC

– Incomplete coverage: possibly disconnected graph• s=TATACATTAG• Observed reads: TATA, ATAC, ACAT, CATT, ATTA

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 45

TACACC CCG

CGG

GAC CGC

TAT ATA TAC ACA CAT ATT TTA

Image credit: cuttingedgedjs.com

Page 46: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Patterns of various issues– Tandem repeats: cycle• s=TACCGACCGC• Observed reads: ACCG, CCGA, CGAC, GACC, CCGC

– Errors at read ends: spur• s=TATACATTAG• Observed reads: TATA, TATT, ATAC, TACA, ACAT

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 46

ACC CCG CGA GAC

CGC

TAT ATA TAC ACA CAT

ATT

Image credit: Wikimedia

Page 47: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Patterns of various issues– Heterogeneity/errors at read centers: bubble• s=TATACATTAG; s’=TATAGATTAG• Observed reads: TATA, ATAC, TACA, ACAT, CATT, ATAG, TAGA, AGAT, GATT

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 47

TAT ATATAC ACA CAT

TAGATT

AGA GAT

Page 48: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Resolving problems• Some methods for resolving the problems

(high-level ideas only):– Handling potential errors:• Pre-filtering k-mers supported by few reads

– Bimodal distribution of k-mer frequencies: one peak corresponds to legitimate k-mers, the other (much lower) peak due to errors

– May also use base quality scores in filtering

• Remove paths supported by few reads• Combine near-identical paths

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 48

Page 49: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Resolving problems• Some methods for resolving the problems

(high-level ideas only):– Non-uniqueness, heterogeneity:• Duplicate the shared part in frayed ropes and bubbles

into separate paths, if supported by read counts

– Repeats:• Use read counts to deduce number of copies• Usually not very accurate due to random fluctuations in

read counts

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 49

Page 50: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Strand issue• In actual sequencing, often we get sequencing

reads from both strands• Example:– +5’CATACATTAG 3’-3’GTATGTAATC 5’

– Suppose each read is 6nt long, we can get the following reads:• From the +ve strand: CATACA, ATACAT, TACATT, ACATTA, CATTAG

• From the -ve strand: CTAATG, TAATGT, AATGTA, ATGTAT, TGTATG

• The corresponding de Bruijn graph is also more complicated

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 50

Page 51: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Scaffolding• Main types of useful information:– Paired-end reads– Long reads (but

more noisy)– Reference alignment

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 51

Image credit: Green, Nature Reviews Genetics 2(8):573-583, (2001)

Page 52: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

CASE STUDY, SUMMARY AND FURTHER READINGS

Epilogue

Page 53: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Case study: “Synthetic cell”• Topic of this lecture so far:– We have multiple copies of an unknown (biological) DNA

sequence– We cut them down into small fragments– We sequence each of them to get short reads (text strings)– We assemble the short reads to get back the original (text)

sequence• Is it possible to do the opposite?– We have a long text string s– We cut it down into small strings– We biochemically synthesize each of them– We assemble them to get a DNA molecule with sequence s

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 53

Page 54: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Case study: “Synthetic cell”• In 2008, a team reported such an experiment:– They took the DNA sequence (text string) of a

bacterium called Mycoplasma genitalium• Total length: 582,970 base pairs

– They synthesized the DNA molecule in a hierarchical manner, with some changes to the sequence• Cassettes of 5-7kb ® intermediate assemblies of ~24kb ® ~72kb ® ~144kb ® full sequence

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 54

Page 55: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Case study: “Synthetic cell”• In 2010, the team reported something more:

– They synthesized the DNA molecule of a bacterium, Mycoplasma mycoides (1.1Mb)

– They then transplanted it into a cell from a closely related species, Mycoplasma capricolum

– The cell did not divide– Was it “alive”?

• Later they found that there was a frameshift mutation in an important gene– After correcting it, cells receiving the sequence

were able to divide• The study stirred up a lot of heated debates

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 55

Image credit: Gibson et al., Science 329(5987):52-56, (2010)

Page 56: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Update• A re-designed genome that led to viable

bacteria, smaller than any natural bacterial genome– 531kb– 473 genes

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 56

Image credit: Hutchison et al., Science 351(6280), (2016)

Page 57: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Further update• The synthetic yeast 2.0 project (http://syntheticyeast.org/)

– Size of baker’s yeast (Saccharomyces cerevisiae) genome: ~12Mb

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 57

Page 58: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Summary• The sequence assembly problem is to assemble the original

sequence from short sequencing reads without a reference• Two graph formulations:

– Read as nodes: Hamiltonian path problem• Hamiltonian path problems are NP-hard in general

– Read as edges: Eulerian path problem• Eulerian path problems can be solved efficiently

• Current standard is to use k-mer-based de Bruijn graphs– Complications due to various issues:

• Heterogeneity/heterozygosity• Sequence errors• Non-uniqueness of sub-sequence, repeats• Incomplete coverage

Last update: 10-Jul-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 58

Page 59: Lecture 5. Sequence Assemblykevinyip/csci3220/CSCI3220_2020Fall_05_Se… · has length m, then the reconstructed string has length £4m. (Note that mis different from n, the length

Further readings• Review papers:– Flicek and Birney, Sense from Sequence Reads:

Methods for Alignment and Assembly. Nature Methods 6(11s):S6-S12, (2009)

– Miller et al., Assembly Algorithms for Next-Generation Sequencing Data. Genomics 95(6):315-327, (2010)

– Compeau et al., How to Apply de Bruijn Graphs to Genome Assembly. Nature Biotechnology29(11):987-991, (2011)

Last update: 20-Oct-2020 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2020 59