23
On Genome Assembly Abhiram Ranade IIT Bombay

On Genome Assembly

  • Upload
    kale

  • View
    57

  • Download
    0

Embed Size (px)

DESCRIPTION

On Genome Assembly. Abhiram Ranade IIT Bombay. Genome. Constituent of living cells that determines hereditary characteristics a.k.a. DNA Sequence of nucleobases : A denine, G uanine, T hymine, C ytosine ACCTGGA… Human genome: 3 billion nucleobases - PowerPoint PPT Presentation

Citation preview

Page 1: On Genome Assembly

On Genome Assembly

Abhiram RanadeIIT Bombay

Page 2: On Genome Assembly

Genome

• Constituent of living cells that determines hereditary characteristics a.k.a. DNA

• Sequence of nucleobases: Adenine, Guanine, Thymine, Cytosine

ACCTGGA…• Human genome: 3 billion nucleobases• Knowledge of sequence is very useful

Page 3: On Genome Assembly

Genome Sequencing

Biochemical techniques can “read” genomes of length ~ 700 nucleobases

• Make many copies of the genome.• Break the copies randomly into pieces of

length ~ 700.• Read the pieces.• Try to infer what original genome must have

been. Genome Assembly

Page 4: On Genome Assembly

Assembly Example

• Input pieces “Reads”: abcd, cdefghi, hijkl• Assembly:• Input pieces “Reads”: abcd, cdefghi, hijkl, hicd• Assembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl

abcdefghijkl

Page 5: On Genome Assembly

Strategy

• Characterize all possible assemblies of the given read set

• Assign a probability to each assembly• Pick the assembly with the highest probability

Page 6: On Genome Assembly

All possible assemblies

• A = valid assembly of reads if – each read appears in A at least once.– Nothing else appears, i.e. A is made by pasting

together the reads in possibly overlapping faction• Can we compute/represent the set

{A | A is a valid assembly}Overlap/String Graph!

Page 7: On Genome Assembly

Overlap Graph: Intuition

• Vertices = Reads.• Edge from read u to read v if read u, read v

likely to overlap in assembly.e.g. abcd, cdef => abcdef• Assembly = walk in the graph: will encourage

overlaps

Page 8: On Genome Assembly

Overlap graph

• Vertices: reads + empty read ϕ• Edges: (ri, rj) : if long suffix of ri = prefix of rj

abcd cdefghi• Long = ? Real genomes: 50? 100? Any value

that indicates overlap is not coincidental • Edge label: portion of rj not belonging to

overlap. abcd cdefghiefghi

(Long = 2)

Page 9: On Genome Assembly

Overlap graph, long=2

abcd cdefghi hijkl

hicd

ϕ

abcd

efghi jkl

cd

efghi

ϕ

ϕ

hicdϕ

Page 10: On Genome Assembly

Overlap graph, long=2

abcd cdefghi hijkl

hicd

ϕ

abcd

efghi jkl

cd

efghi

ϕ

ϕ

hicdϕ

Page 11: On Genome Assembly

Walk => Assembly

• Assembly = Walk in the overlap graph which– Starts at ϕ, Ends at ϕ– Passes through every vertex at least once.– Passes through every edge at least once?

• Assembled sequence = concatenation of labels along the walk.– Every read appears in the sequence

• Walk revisits ϕ: reconstruction is incomplete, in several pieces.

Page 12: On Genome Assembly

Assembly => Walk

• Input: Assembly A• Output: Walk which will generate A

– Visit vertices in the order of appearance in A

Overlap graph characterizes assemblies. Variations on graph also studied.

Page 13: On Genome Assembly

Approaches to assembly

• Occam’s Razor: Most likely = “Shortest”– Shortest walk that visits every vertex at least once:

NP-hard– Shortest walk that visits every edge at least once:

Chinese Postman problem. Polytime.– Pragmatic: Use some greedy approach to find

above.• Model probability more accurately

Page 14: On Genome Assembly

A Twist: pair constraints

• Sequencing process may give additional constraints: distance from ri to rj in assembly is about D

• Example: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely?abcdefghijklhicd

abcdefghicdhijkl abcdefghicdefghijkl

Page 15: On Genome Assembly

Systematic estimation of probability of a given assembly

Page 16: On Genome Assembly

Algebraic representation of walks

• Walk is cyclic: number of times vertex entered = number of times it is exited.

• Walk = fluid flowTotal fluid coming in = Total fluid going outXij = fluid going from i to j.= number of times walk goes from i to j

Formulate conditions on Xij and solve

Page 17: On Genome Assembly

Algebraic representation: Xij,δj

Xij = Number of times walk goes from i to j.δj = Number of times walk goes over j

Lij = Length of label of edge (i,j)Length of genome L =L may be known. €

Xij = Xjkk∑

i∑ = δj > 0

Xij * Liji, j∑

Page 18: On Genome Assembly

Maximum likelihood reconstruction(Medvedev-Brudno 08)

Goal: Find assembly A most likely given the observations. maximize Pr(A | r1,r2,…rn)

=Pr(A, r1,r2,…,rn) / Pr(r1,r2,…,rn)=Pr(r1,r2,…,rn|A) * Pr(A) / Pr(r1,r2,…,rn)Standard assumption: Unconditional probability

Pr(A) same for all A.Maximize Pr(r1,r2,…,rn|A) and output A that

maximizes.

Page 19: On Genome Assembly

Computing Pr(r1,r2,…,rn|A)

• A = abcdefghicdefghijklr1 = abcd, r2 = cdefghi, r3 = hijkl, r4 = hicd

• Process of generating reads:– Pick a random starting point.– Pick a length at random

• Pr(r2) = ? = 2/Length * Pr(read length = 7)

= δ2/Length * Pr(read length = 7)

Page 20: On Genome Assembly

Computing Pr(r1,r2,…,rn|A)

Generative model: • For i=1 to n

– Pick starting point for ri

– Pick length Li

• Probability of generating ri:= Number of times ri appears in A/Length of A * Probability of getting the correct length= δi/L * Pr(Li)

Page 21: On Genome Assembly

Computing Pr(r1,r2,…,rn|A)

• Pr = Πi δi/L * Pr(Li)• We want to pick A for which this probability is

maximum• Score(A) = Πi δi/L * Pr(Li)

• Best A will have max value of Πi δi/L • So now we have a program

Page 22: On Genome Assembly

Finding the best assembly

• Maximize Πi δi/L• s.t.

• L known approximately. L = Lgeε ≈ Lg(1+ε)

• Lg, Lij : constants. Solve for Xij≥0, δj≥0, ε• Convex optimization €

Xij = Xjkk∑

i∑ = δj > 0

L = Xij *Liji, j∑

Page 23: On Genome Assembly

Concluding Remarks

• Experiments seem to indicate our approach works well.

www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf• Computationally intensive, but well founded.• May not be useful for large genomes – linear

time algorithms only!• How to handle pair constraints: important

open problem.• Graphs are everywhere!