43
Comp. Genomics Recitation 14 Exam preparation Biological networks

Comp. Genomics

  • Upload
    olesia

  • View
    18

  • Download
    0

Embed Size (px)

DESCRIPTION

Comp. Genomics. Recitation 14 Exam preparation Biological networks. Exercise. A large PPI network G was generated using high throughput technologies. A smaller network H is known in a different organism. - PowerPoint PPT Presentation

Citation preview

Page 1: Comp. Genomics

Comp. Genomics

Recitation 14Exam preparationBiological networks

Page 2: Comp. Genomics

Exercise

• A large PPI network G was generated using high throughput technologies.

• A smaller network H is known in a different organism.

• Assume that there exists an efficient algorithm which determines whether there is a sub-network of G of size ≥k that is isomorphic to H

Page 3: Comp. Genomics

Exercise

• Two graphs (PPI networks) are said to be isomorphic if there is a bijection between their vertices sets such that f(u) is adjacent to f(v) iff u is adjacent to v

Page 4: Comp. Genomics

Exercise

• Show that the same algorithm can solve the following problem in polynomial time:

• CLIQUE: Is there a clique of size ≥ k in a given a graph G’ and an integer k

Page 5: Comp. Genomics

Solution

• Given a graph G’ and a number k, we create another graph H’ of size k in which there is an edge between every two vertices

• This takes polynomial time• We run the original algorithm on

(G’,H’) and answer the same

Page 6: Comp. Genomics

Exercise

• Show that the algorithm from the previous question can also solve the following problem:

• Input: • A set of elements X=(x1,x2,…,xn), • A distance function d(xi,xj)=1 if xi and xj are

“close”, 0 otherwise• Output: Can the set be divided into at most k

clusters such that all the element pairs in every cluster are close

Page 7: Comp. Genomics

Solution

• Build a graph |G|, edge (xi,xj) means d(xi,xj)=1

• Use the previous algorithm to find a clique of maximal size (decision problemoptimization problem)

• Find the clique and remove it from the graph• Repeat at most k times. If the result is the

empty graph, answer ‘Yes’. Otherwise answer ‘No’.

Page 8: Comp. Genomics

Moed B 26.2.2010

• You are given a set of strings S1,S2,..Sk of length C each, and each string is associated to a positive score Bi.

• Si appears in an alignment if there is a sequence of gapless matches in the alignment that contains Si.

• We reduce Bi from the score of an alignment for every appearance of Si, including overlaps.

Describe a global alignment algorithm.

Page 9: Comp. Genomics

Question

• True or false: The following algorithm is a global alignment algorithm for the problem:• For every cell [i,j] in the DP matrix we

will save the number of consecutive matches that the optimal alignment between x1,…,xi and y1,…yj has made since the last gap. If this value is ≥ C we will check for every Si and reduce Bi as needed.

Page 10: Comp. Genomics

Solution

• The suggested algorithm does not work. Counter-example:

S[G,G]=10, S[A,A]=1 indel=-1S1=AAAG B1=-100

A A A G

0 -1 -2 -3 -4

A -1

A -2

A -3

G -4

Page 11: Comp. Genomics

Solution

A A A G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

A -2 0

A -3 -1

G -4 -2

A A A G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

A -2 0 2 1 0

A -3 -1 1

G -4 -2 0

Page 12: Comp. Genomics

Solution

A A A G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

A -2 0 2 1 0

A -3 -1 1 3 2

G -4 -2 0 2

A A A G

0 -1 -2 -3 -4

A -1 1 0 -1 -2

A -2 0 2 1 0

A -3 -1 1 3 2

G -4 -2 0 2 1

Alignment found: AAA _G Score:1 A AAG_

Optimal alignment: _AAAGScore:10 A_AAG

Page 13: Comp. Genomics

Question

• True or false: The algorithm that worked for positive bonuses will work here too :Add terms of the following form to the recursive update rule: -Isi*Bi+∑k=0..3S[i-k,j-k]

where Isi is 1 if the nucleotides i-3,…,i and j-3,…,j are the seed Si and otherwise ∞. The last component is the normal score for matching 4 nucleotides.

Page 14: Comp. Genomics

Solution

• It will not work here.• Since the -Isi*Bi ≤0, and since the

option of four consecutive matches is also considered, the algorithm will never use the new update rule

• The score that the algorithm computes will not be consistent with the scoring scheme

Page 15: Comp. Genomics

Question

• What is the correct algorithm? • Divide every cell of the DP matrix

into C+1 cells • The cell M[i,j,k] represents the

optimal alignment between X and Y that ends with k matches

Page 16: Comp. Genomics

Solution

0

[ , , ] 0

k C

M i j k k

k C

[ 1, 1, 1] ( , )i jM i j k x y

0

0

max [ , 1, ] ( , )

maxmax [ 1, , ] ( , )

jl C

il C

M i j l y

M i j l x

11

max [ 1, 1, ] ( , ) ( ,..., )i j i C iC l C

M i j l x y B x x

Page 17: Comp. Genomics

Solution

• Correctness: Assume that we have the correct values for all cells M[i’,j’,k’] that precede M[i,j,k] and we want to compute the score at the cell M[i,j,k].

• If k<C, then we are not creating a sequence of C matches, and therefore by the inductive assumption and the defined operations M[i,j,k] will contain the optimal score.

Page 18: Comp. Genomics

Solution

• If k≥C, and the last C characters are not in {Si}, we are done for the same reasons.

• If k≥C and the last C characters are in {Si}, then there are several options:• The optimal alignment contains the seed.

Since we are checking the cells M[i-1,j-1,k-1], M[i-1,j-1,k], we will obtain the score of the optimal alignment.

Page 19: Comp. Genomics

Solution

• If k≥C and the last C characters are in {Si}, then the other option is:• The seed is not in the optimal alignment. Since the

alignment between [i-C+1,…,i][j-C+1,…j] does not contain C consecutive matches, but ends with a match, its prefix which aligns [i-C+1,…,i-1][j-C+1,…j-1] must end with 0,1,…,C-2 matches. Hence the optimal alignment between [1,…,i-1] and [1,…,j-1] ends with 0,1,…,C-2 matches.

• By the inductive assumption, we have the optimal score for all these alignments, and the update rules tests them all.

Page 20: Comp. Genomics

Moed B 26.2.2010

• An inverted-repeat is an appearance of some sequence and its inverse in a string, without overlapping. For example, in the string abcdelmnedcblmnknm there is an inverted repeat of size 4, because bcde and its inverse appear in it, and do not overlap. The sequence lmn appears twice but in the same order and therefore it does not constitute an inverted repeat. The sequence mnk appears twice but the two appearances overlap, and therefore it does not constitute an inverted repeat either. Describe a linear time algorithm for finding the longest inverted repeat in a string.

Page 21: Comp. Genomics

Solution

• Build a suffix tree for S and SR

abcdelmnedcblmnknm i=red start index in S=2 mnknmlbcdenmledcba j=green start index in SR =7=|S|-|REP|-green start index in S+2

So there is no overlap if i+|REP|-1< |S|-|REP|-j+2

Page 22: Comp. Genomics

Solution

• Each node is marked if it has children from both S and SR. MAX0

• The postorder search will proceed as follows:

• If v does not have marked descendants and v is marked, compare the indices of the leaves with minimal indices in S, SR.

Page 23: Comp. Genomics

Solution

• The postorder search will proceed as follows:• If v does not have marked

descendants and v is marked, compare the indices of the leaves with minimal indices in S, SR.

• The indices give the maximal repeat with no overlap. If >MAX, update MAX.

Page 24: Comp. Genomics

Solution

• The section between the end of the first appearance and the start of the inverted appearance is (i+|REP|-1+|S|-|REP|-j+2)/2=(|S|+1+i-j)/2

• The length of the non-overlapping string is (|S|+1+i-j)/2-i

Page 25: Comp. Genomics

Exercise from homework

• We want to align a gene x to a genome y

• x appears in y starting at position i• We want to align x or part of it to y

but that x or parts of it will not be aligned to themselves

Page 26: Comp. Genomics

Solution

i i+|x|

x

y

Not good!

i

OK

i

OK

Page 27: Comp. Genomics

Solution

• Global or local alignment?

Local, because alignment of parts of x are also acceptable solutions, and we need to find the highest scoring solution• Who are the solutions that we need to exclude?Every solution in which some x[j] is aligned to y[i+j]

Page 28: Comp. Genomics

Solution

x

y

i

x

y

i

i

i i+|x|

Page 29: Comp. Genomics

Solution

• Can we set the diagonal to -∞ instead?

x

y

i

i i+|x|

No, this will disregard solutions that cross the diagonal, e.g.:

Page 30: Comp. Genomics

Another exam question

• Question from exam: Given K strings, denote by l(i) the length of the longest common (contiguous) substring of at least i of the input strings. Compute l(2),…,l(k)

This is the k-common substring problem, for all possible k values

Page 31: Comp. Genomics

Solution

• We have seen that for a specific k the problem can be solved in time linear in the sum of input string lengths

Page 32: Comp. Genomics

Solution

Page 33: Comp. Genomics

Solution

123

0

0

00

Page 34: Comp. Genomics

Solution

• Claim: After the update procedure is completed, every node contains exactly the number of distinct strings in its subtree

Proof: Induction on node height.Base: Node v – all the children are leaves. All children are direct children. They appear consecutively in the DFS and sum in their LCA which is v

Page 35: Comp. Genomics

Solution

Page 36: Comp. Genomics

Solution

• Step: Let v be a node of height i>1. In all the subtrees that are rooted in its descendants, duplications are counted correctly. It remains to see what happens to duplications in subtrees of different descendants.

Page 37: Comp. Genomics

Solution

v

1

1

z w

Page 38: Comp. Genomics

Solution

• Duplications in different subtrees are counted in v. Therefore all the duplications are counted correctly.

• How do we use the information that we computed in order to solve the question?• Traverse the suffix tree, for every node v with j distinct strings update l(j) if v’s string depth is larger than l(j)

Page 39: Comp. Genomics

Solution

• Are we done?

Traverse the array l from the last entry to the first and override l(i-1) with l(i) if l(i) is larger

Page 40: Comp. Genomics

Another homework exercise

• Longest Common Prefix: Given a set of k strings of length n each, give an algorithm that finds the longest common prefix for every pair of strings. The total time should be O(kn + p) where p is the number of pairs of strings having a common prefix of length > 0.

Page 41: Comp. Genomics

Trie• A string is represented in one path only

ab

c

e

e

f

d b

f

e g

{ aeef ad bbfe bbfg c }

Page 42: Comp. Genomics

Solution

• We can construct a trie from the set of strings in the question. This will take O(k·n).

• Can we then find the LCA for every pair of leaves?

• We can, but it will take O(k2) and the totalrunning time will be O(k·n+k2)

Page 43: Comp. Genomics

Solution

• A simple trick: We will add each leaf to a group of sequences that start with the same nucleotide

• Comparison of leaves in the generated groups will take O(p)