Comp. Genomics

Comp. Genomics

Recitation 14Exam preparationBiological networks

Exercise

• A large PPI network G was generated using high throughput technologies.

• A smaller network H is known in a different organism.

• Assume that there exists an efficient algorithm which determines whether there is a sub-network of G of size ≥k that is isomorphic to H

Exercise

• Two graphs (PPI networks) are said to be isomorphic if there is a bijection between their vertices sets such that f(u) is adjacent to f(v) iff u is adjacent to v

Exercise

• Show that the same algorithm can solve the following problem in polynomial time:

• CLIQUE: Is there a clique of size ≥ k in a given a graph G’ and an integer k

Solution

• Given a graph G’ and a number k, we create another graph H’ of size k in which there is an edge between every two vertices

• This takes polynomial time• We run the original algorithm on

(G’,H’) and answer the same

Exercise

• Show that the algorithm from the previous question can also solve the following problem:

• Input: • A set of elements X=(x1,x2,…,xn), • A distance function d(xi,xj)=1 if xi and xj are

“close”, 0 otherwise• Output: Can the set be divided into at most k

clusters such that all the element pairs in every cluster are close

Solution

• Build a graph |G|, edge (xi,xj) means d(xi,xj)=1

• Use the previous algorithm to find a clique of maximal size (decision problemoptimization problem)

• Find the clique and remove it from the graph• Repeat at most k times. If the result is the

empty graph, answer ‘Yes’. Otherwise answer ‘No’.

Moed B 26.2.2010

• You are given a set of strings S1,S2,..Sk of length C each, and each string is associated to a positive score Bi.

• Si appears in an alignment if there is a sequence of gapless matches in the alignment that contains Si.

• We reduce Bi from the score of an alignment for every appearance of Si, including overlaps.

Describe a global alignment algorithm.

Question

• True or false: The following algorithm is a global alignment algorithm for the problem:• For every cell [i,j] in the DP matrix we

will save the number of consecutive matches that the optimal alignment between x1,…,xi and y1,…yj has made since the last gap. If this value is ≥ C we will check for every Si and reduce Bi as needed.

Solution

• The suggested algorithm does not work. Counter-example:

S[G,G]=10, S[A,A]=1 indel=-1S1=AAAG B1=-100

A A A G

0 -1 -2 -3 -4

A -1

A -2

A -3

G -4

Solution

A A A G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

A -2 0

A -3 -1

G -4 -2

A A A G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

A -2 0 2 1 0

A -3 -1 1

G -4 -2 0

Solution

A A A G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

A -2 0 2 1 0

A -3 -1 1 3 2

G -4 -2 0 2

A A A G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

A -2 0 2 1 0

A -3 -1 1 3 2

G -4 -2 0 2 1

Alignment found: AAA _G Score:1 A AAG_

Optimal alignment: _AAAGScore:10 A_AAG

Question

• True or false: The algorithm that worked for positive bonuses will work here too :Add terms of the following form to the recursive update rule: -Isi*Bi+∑k=0..3S[i-k,j-k]

where Isi is 1 if the nucleotides i-3,…,i and j-3,…,j are the seed Si and otherwise ∞. The last component is the normal score for matching 4 nucleotides.

Solution

• It will not work here.• Since the -Isi*Bi ≤0, and since the

option of four consecutive matches is also considered, the algorithm will never use the new update rule

• The score that the algorithm computes will not be consistent with the scoring scheme

Question

• What is the correct algorithm? • Divide every cell of the DP matrix

into C+1 cells • The cell M[i,j,k] represents the

optimal alignment between X and Y that ends with k matches

Solution

0

[ , , ] 0

k C

M i j k k

k C

[ 1, 1, 1] ( , )i jM i j k x y

0

0

max [ , 1, ] ( , )

maxmax [ 1, , ] ( , )

jl C

il C

M i j l y

M i j l x

11

max [ 1, 1, ] ( , ) ( ,..., )i j i C iC l C

M i j l x y B x x

Solution

• Correctness: Assume that we have the correct values for all cells M[i’,j’,k’] that precede M[i,j,k] and we want to compute the score at the cell M[i,j,k].

• If k<C, then we are not creating a sequence of C matches, and therefore by the inductive assumption and the defined operations M[i,j,k] will contain the optimal score.

Solution

• If k≥C, and the last C characters are not in {Si}, we are done for the same reasons.

• If k≥C and the last C characters are in {Si}, then there are several options:• The optimal alignment contains the seed.

Since we are checking the cells M[i-1,j-1,k-1], M[i-1,j-1,k], we will obtain the score of the optimal alignment.

Solution

• If k≥C and the last C characters are in {Si}, then the other option is:• The seed is not in the optimal alignment. Since the

alignment between [i-C+1,…,i][j-C+1,…j] does not contain C consecutive matches, but ends with a match, its prefix which aligns [i-C+1,…,i-1][j-C+1,…j-1] must end with 0,1,…,C-2 matches. Hence the optimal alignment between [1,…,i-1] and [1,…,j-1] ends with 0,1,…,C-2 matches.

• By the inductive assumption, we have the optimal score for all these alignments, and the update rules tests them all.

Moed B 26.2.2010

• An inverted-repeat is an appearance of some sequence and its inverse in a string, without overlapping. For example, in the string abcdelmnedcblmnknm there is an inverted repeat of size 4, because bcde and its inverse appear in it, and do not overlap. The sequence lmn appears twice but in the same order and therefore it does not constitute an inverted repeat. The sequence mnk appears twice but the two appearances overlap, and therefore it does not constitute an inverted repeat either. Describe a linear time algorithm for finding the longest inverted repeat in a string.

Solution

• Build a suffix tree for S and SR

abcdelmnedcblmnknm i=red start index in S=2 mnknmlbcdenmledcba j=green start index in SR =7=|S|-|REP|-green start index in S+2

So there is no overlap if i+|REP|-1< |S|-|REP|-j+2

Solution

• Each node is marked if it has children from both S and SR. MAX0

• The postorder search will proceed as follows:

• If v does not have marked descendants and v is marked, compare the indices of the leaves with minimal indices in S, SR.

Solution

• The postorder search will proceed as follows:• If v does not have marked

descendants and v is marked, compare the indices of the leaves with minimal indices in S, SR.

• The indices give the maximal repeat with no overlap. If >MAX, update MAX.

Solution

• The section between the end of the first appearance and the start of the inverted appearance is (i+|REP|-1+|S|-|REP|-j+2)/2=(|S|+1+i-j)/2

• The length of the non-overlapping string is (|S|+1+i-j)/2-i

Exercise from homework

• We want to align a gene x to a genome y

• x appears in y starting at position i• We want to align x or part of it to y

but that x or parts of it will not be aligned to themselves

Solution

i i+|x|

x

y

Not good!

i

OK

i

OK

Solution

• Global or local alignment?

Local, because alignment of parts of x are also acceptable solutions, and we need to find the highest scoring solution• Who are the solutions that we need to exclude?Every solution in which some x[j] is aligned to y[i+j]

Solution

x

y

i

x

y

i

i

i i+|x|

Solution

• Can we set the diagonal to -∞ instead?

x

y

i

i i+|x|

No, this will disregard solutions that cross the diagonal, e.g.:

Another exam question

• Question from exam: Given K strings, denote by l(i) the length of the longest common (contiguous) substring of at least i of the input strings. Compute l(2),…,l(k)

This is the k-common substring problem, for all possible k values

Solution

• We have seen that for a specific k the problem can be solved in time linear in the sum of input string lengths

Solution

…

Solution

123

0

0

00

Solution

• Claim: After the update procedure is completed, every node contains exactly the number of distinct strings in its subtree

Proof: Induction on node height.Base: Node v – all the children are leaves. All children are direct children. They appear consecutively in the DFS and sum in their LCA which is v

Solution

Solution

• Step: Let v be a node of height i>1. In all the subtrees that are rooted in its descendants, duplications are counted correctly. It remains to see what happens to duplications in subtrees of different descendants.

Solution

v

1

1

z w

Solution

• Duplications in different subtrees are counted in v. Therefore all the duplications are counted correctly.

• How do we use the information that we computed in order to solve the question?• Traverse the suffix tree, for every node v with j distinct strings update l(j) if v’s string depth is larger than l(j)

Solution

• Are we done?

Traverse the array l from the last entry to the first and override l(i-1) with l(i) if l(i) is larger

Another homework exercise

• Longest Common Prefix: Given a set of k strings of length n each, give an algorithm that finds the longest common prefix for every pair of strings. The total time should be O(kn + p) where p is the number of pairs of strings having a common prefix of length > 0.

Trie• A string is represented in one path only

ab

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Solution

• We can construct a trie from the set of strings in the question. This will take O(k·n).

• Can we then find the LCA for every pair of leaves?

• We can, but it will take O(k2) and the totalrunning time will be O(k·n+k2)

Solution

• A simple trick: We will add each leaf to a group of sequences that start with the same nucleotide

• Comparison of leaves in the generated groups will take O(p)

Documents

Comp. Genomics