On Genome Assembly

On Genome Assembly

Abhiram RanadeIIT Bombay

Genome

• Constituent of living cells that determines hereditary characteristics a.k.a. DNA

• Sequence of nucleobases: Adenine, Guanine, Thymine, Cytosine

ACCTGGA…• Human genome: 3 billion nucleobases• Knowledge of sequence is very useful

Genome Sequencing

Biochemical techniques can “read” genomes of length ~ 700 nucleobases

• Make many copies of the genome.• Break the copies randomly into pieces of

length ~ 700.• Read the pieces.• Try to infer what original genome must have

been. Genome Assembly

Assembly Example

• Input pieces “Reads”: abcd, cdefghi, hijkl• Assembly:• Input pieces “Reads”: abcd, cdefghi, hijkl, hicd• Assembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl

abcdefghijkl

Strategy

• Characterize all possible assemblies of the given read set

• Assign a probability to each assembly• Pick the assembly with the highest probability

All possible assemblies

• A = valid assembly of reads if – each read appears in A at least once.– Nothing else appears, i.e. A is made by pasting

together the reads in possibly overlapping faction• Can we compute/represent the set

{A | A is a valid assembly}Overlap/String Graph!

Overlap Graph: Intuition

• Vertices = Reads.• Edge from read u to read v if read u, read v

likely to overlap in assembly.e.g. abcd, cdef => abcdef• Assembly = walk in the graph: will encourage

overlaps

Overlap graph

• Vertices: reads + empty read ϕ• Edges: (ri, rj) : if long suffix of ri = prefix of rj

abcd cdefghi• Long = ? Real genomes: 50? 100? Any value

that indicates overlap is not coincidental • Edge label: portion of rj not belonging to

overlap. abcd cdefghiefghi

(Long = 2)

Overlap graph, long=2

abcd cdefghi hijkl

hicd

ϕ

abcd

efghi jkl

cd

efghi

ϕ

ϕ

hicdϕ

Overlap graph, long=2

abcd cdefghi hijkl

hicd

ϕ

abcd

efghi jkl

cd

efghi

ϕ

ϕ

hicdϕ

Walk => Assembly

• Assembly = Walk in the overlap graph which– Starts at ϕ, Ends at ϕ– Passes through every vertex at least once.– Passes through every edge at least once?

• Assembled sequence = concatenation of labels along the walk.– Every read appears in the sequence

• Walk revisits ϕ: reconstruction is incomplete, in several pieces.

Assembly => Walk

• Input: Assembly A• Output: Walk which will generate A

– Visit vertices in the order of appearance in A

Overlap graph characterizes assemblies. Variations on graph also studied.

Approaches to assembly

• Occam’s Razor: Most likely = “Shortest”– Shortest walk that visits every vertex at least once:

NP-hard– Shortest walk that visits every edge at least once:

Chinese Postman problem. Polytime.– Pragmatic: Use some greedy approach to find

above.• Model probability more accurately

A Twist: pair constraints

• Sequencing process may give additional constraints: distance from ri to rj in assembly is about D

• Example: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely?abcdefghijklhicd

abcdefghicdhijkl abcdefghicdefghijkl

Systematic estimation of probability of a given assembly

Algebraic representation of walks

• Walk is cyclic: number of times vertex entered = number of times it is exited.

• Walk = fluid flowTotal fluid coming in = Total fluid going outXij = fluid going from i to j.= number of times walk goes from i to j

Formulate conditions on Xij and solve

Algebraic representation: Xij,δj

Xij = Number of times walk goes from i to j.δj = Number of times walk goes over j

Lij = Length of label of edge (i,j)Length of genome L =L may be known. €

Xij = Xjkk∑

i∑ = δj > 0

€

Xij * Liji, j∑

Maximum likelihood reconstruction(Medvedev-Brudno 08)

Goal: Find assembly A most likely given the observations. maximize Pr(A | r1,r2,…rn)

=Pr(A, r1,r2,…,rn) / Pr(r1,r2,…,rn)=Pr(r1,r2,…,rn|A) * Pr(A) / Pr(r1,r2,…,rn)Standard assumption: Unconditional probability

Pr(A) same for all A.Maximize Pr(r1,r2,…,rn|A) and output A that

maximizes.

Computing Pr(r1,r2,…,rn|A)

• A = abcdefghicdefghijklr1 = abcd, r2 = cdefghi, r3 = hijkl, r4 = hicd

• Process of generating reads:– Pick a random starting point.– Pick a length at random

• Pr(r2) = ? = 2/Length * Pr(read length = 7)

= δ2/Length * Pr(read length = 7)


Generative model: • For i=1 to n

– Pick starting point for ri

– Pick length Li

• Probability of generating ri:= Number of times ri appears in A/Length of A * Probability of getting the correct length= δi/L * Pr(Li)


• Pr = Πi δi/L * Pr(Li)• We want to pick A for which this probability is

maximum• Score(A) = Πi δi/L * Pr(Li)

• Best A will have max value of Πi δi/L • So now we have a program

Finding the best assembly

• Maximize Πi δi/L• s.t.

• L known approximately. L = Lgeε ≈ Lg(1+ε)

• Lg, Lij : constants. Solve for Xij≥0, δj≥0, ε• Convex optimization €

Xij = Xjkk∑

i∑ = δj > 0

€

L = Xij *Liji, j∑

Concluding Remarks

• Experiments seem to indicate our approach works well.

www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf• Computationally intensive, but well founded.• May not be useful for large genomes – linear

time algorithms only!• How to handle pair constraints: important

open problem.• Graphs are everywhere!

Documents

On Genome Assembly