Upload
kale
View
57
Download
0
Embed Size (px)
DESCRIPTION
On Genome Assembly. Abhiram Ranade IIT Bombay. Genome. Constituent of living cells that determines hereditary characteristics a.k.a. DNA Sequence of nucleobases : A denine, G uanine, T hymine, C ytosine ACCTGGA… Human genome: 3 billion nucleobases - PowerPoint PPT Presentation
Citation preview
On Genome Assembly
Abhiram RanadeIIT Bombay
Genome
• Constituent of living cells that determines hereditary characteristics a.k.a. DNA
• Sequence of nucleobases: Adenine, Guanine, Thymine, Cytosine
ACCTGGA…• Human genome: 3 billion nucleobases• Knowledge of sequence is very useful
Genome Sequencing
Biochemical techniques can “read” genomes of length ~ 700 nucleobases
• Make many copies of the genome.• Break the copies randomly into pieces of
length ~ 700.• Read the pieces.• Try to infer what original genome must have
been. Genome Assembly
Assembly Example
• Input pieces “Reads”: abcd, cdefghi, hijkl• Assembly:• Input pieces “Reads”: abcd, cdefghi, hijkl, hicd• Assembly? abcdefghijklhicd abcdefghicdhijkl abcdefghicdefghijkl
abcdefghijkl
Strategy
• Characterize all possible assemblies of the given read set
• Assign a probability to each assembly• Pick the assembly with the highest probability
All possible assemblies
• A = valid assembly of reads if – each read appears in A at least once.– Nothing else appears, i.e. A is made by pasting
together the reads in possibly overlapping faction• Can we compute/represent the set
{A | A is a valid assembly}Overlap/String Graph!
Overlap Graph: Intuition
• Vertices = Reads.• Edge from read u to read v if read u, read v
likely to overlap in assembly.e.g. abcd, cdef => abcdef• Assembly = walk in the graph: will encourage
overlaps
Overlap graph
• Vertices: reads + empty read ϕ• Edges: (ri, rj) : if long suffix of ri = prefix of rj
abcd cdefghi• Long = ? Real genomes: 50? 100? Any value
that indicates overlap is not coincidental • Edge label: portion of rj not belonging to
overlap. abcd cdefghiefghi
(Long = 2)
Overlap graph, long=2
abcd cdefghi hijkl
hicd
ϕ
abcd
efghi jkl
cd
efghi
ϕ
ϕ
hicdϕ
Overlap graph, long=2
abcd cdefghi hijkl
hicd
ϕ
abcd
efghi jkl
cd
efghi
ϕ
ϕ
hicdϕ
Walk => Assembly
• Assembly = Walk in the overlap graph which– Starts at ϕ, Ends at ϕ– Passes through every vertex at least once.– Passes through every edge at least once?
• Assembled sequence = concatenation of labels along the walk.– Every read appears in the sequence
• Walk revisits ϕ: reconstruction is incomplete, in several pieces.
Assembly => Walk
• Input: Assembly A• Output: Walk which will generate A
– Visit vertices in the order of appearance in A
Overlap graph characterizes assemblies. Variations on graph also studied.
Approaches to assembly
• Occam’s Razor: Most likely = “Shortest”– Shortest walk that visits every vertex at least once:
NP-hard– Shortest walk that visits every edge at least once:
Chinese Postman problem. Polytime.– Pragmatic: Use some greedy approach to find
above.• Model probability more accurately
A Twist: pair constraints
• Sequencing process may give additional constraints: distance from ri to rj in assembly is about D
• Example: ri = abcd, rj = hijkl, D = 10. Which of the following assembiles is more likely?abcdefghijklhicd
abcdefghicdhijkl abcdefghicdefghijkl
Systematic estimation of probability of a given assembly
Algebraic representation of walks
• Walk is cyclic: number of times vertex entered = number of times it is exited.
• Walk = fluid flowTotal fluid coming in = Total fluid going outXij = fluid going from i to j.= number of times walk goes from i to j
Formulate conditions on Xij and solve
Algebraic representation: Xij,δj
Xij = Number of times walk goes from i to j.δj = Number of times walk goes over j
Lij = Length of label of edge (i,j)Length of genome L =L may be known. €
Xij = Xjkk∑
i∑ = δj > 0
€
Xij * Liji, j∑
Maximum likelihood reconstruction(Medvedev-Brudno 08)
Goal: Find assembly A most likely given the observations. maximize Pr(A | r1,r2,…rn)
=Pr(A, r1,r2,…,rn) / Pr(r1,r2,…,rn)=Pr(r1,r2,…,rn|A) * Pr(A) / Pr(r1,r2,…,rn)Standard assumption: Unconditional probability
Pr(A) same for all A.Maximize Pr(r1,r2,…,rn|A) and output A that
maximizes.
Computing Pr(r1,r2,…,rn|A)
• A = abcdefghicdefghijklr1 = abcd, r2 = cdefghi, r3 = hijkl, r4 = hicd
• Process of generating reads:– Pick a random starting point.– Pick a length at random
• Pr(r2) = ? = 2/Length * Pr(read length = 7)
= δ2/Length * Pr(read length = 7)
Computing Pr(r1,r2,…,rn|A)
Generative model: • For i=1 to n
– Pick starting point for ri
– Pick length Li
• Probability of generating ri:= Number of times ri appears in A/Length of A * Probability of getting the correct length= δi/L * Pr(Li)
Computing Pr(r1,r2,…,rn|A)
• Pr = Πi δi/L * Pr(Li)• We want to pick A for which this probability is
maximum• Score(A) = Πi δi/L * Pr(Li)
• Best A will have max value of Πi δi/L • So now we have a program
Finding the best assembly
• Maximize Πi δi/L• s.t.
• L known approximately. L = Lgeε ≈ Lg(1+ε)
• Lg, Lij : constants. Solve for Xij≥0, δj≥0, ε• Convex optimization €
Xij = Xjkk∑
i∑ = δj > 0
€
L = Xij *Liji, j∑
Concluding Remarks
• Experiments seem to indicate our approach works well.
www.cse.iitb.ac.in/~ranade/GraphAssembly.pdf• Computationally intensive, but well founded.• May not be useful for large genomes – linear
time algorithms only!• How to handle pair constraints: important
open problem.• Graphs are everywhere!