View
35
Download
2
Category
Preview:
DESCRIPTION
A simple, practical and complete O( n 3 /log(n))-time Algorithm for RNA folding using the Four-Russians Speedup. Yelena Frid and Dan Gusfield Department of Computer Science UC Davis. Background. - PowerPoint PPT Presentation
Citation preview
1
A simple, practical and complete O(n3/log(n))-time Algorithm
for RNA foldingusing the Four-Russians Speedup
Yelena Frid and Dan GusfieldDepartment of Computer Science UC Davis
2
Background
-Problem of computationally predicting the secondary structure of RNA molecules was first introduced more than thirty years ago by:
Nussinov et. Al. (1978, 1980)Waterman and Smith (1978)
Zuker M. and Stiegler (1981)
They presented Dynamic programming solutions with an asymptotic runtime of O(n3).
Where n is the length of the RNA sequence.
3
There have been several improvements to the running time of this problem:
• A complex worse case speed up O(n3*(loglogn)/(logn) 1/2)– (Akutsu 1999)
• Practical heuristic speed up O(nP) where P in [n, n2] (Wexler 2007, Backofen, 2009). This is still O(n3) in terms of n. It is important to note
that P is empirically much less then n2.
Our approach has an O(n3/log(n)) running time using the FOUR RUSSIANS speedup.
4
Four Russians is well known
- The Four Russians method is a general technique for speeding up Dynamic programs.
- The method involves extensive preprocessing of a wide-range of possible inputs, before any actual input is seen.
However, it has not been applied to RNA folding, despite an ‘expectation’ that it could be.
5
Possible reasons for difficulty in applying the FOUR RUSSIANS speedupI. Widely exposed version of the original dynamic-
programming algorithm does not lend itself to application of the Four-Russians technique. (Solution: choose a different order of evaluation.)
II. It doesn’t seem possible to separate the preprocessing and the computation. (Solution: interleave the two.)
We made use of the insights presented in Graham et. al. (1980) paper which gives a Four-Russians solution to the problem of Context-Free Language recognition, where similar problems were encountered.
6
Basic RNA-folding problem
- Find a maximum number of non-crossing matches of complimentary nucleotides in an RNA sequence of length n.
Enhancements:
-Richer scoring schemes
-Minimum energy calculations DP.
7
Input to the basic RNA-folding problem consists of a string K of length n over a four-letter alphabet {A,U,C,G}
A matching consists of non-crossing disjoint pairs of sites in K.
ii i’ j j’
B(i,j)
B(i,j) is the score given to the match between nucleotides in sites i and j of K.
In a simple scoring scheme a score of 1 is given for complementary pairs ( A, U or C, G), otherwise 0.
9
Introduction to the original O(n3) DP
Let S(i,j) represent the score for the optimal folding of the subsequence consisting of the sites in K from i to j>i inclusively.
The DP recurrence relations are based on the different possibilities of pairing nucleotides.
10
S(i,j) will equal the maximum of :
S(i+1,j-1)+B(i,j) S(i, j-1)
11
Head Tail
S(i+1, j)
12
As a result the recurrences for the optimal fold score S(i,j) are:
Head Tail
13
A U C G G C A U C A A
0 1 2 3 4 5 6 11 12
0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]
1 S[1,2] S[1,3] S[0,4] S[0,5] S[1,6] S[1,11]
2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]
3 S[3,4] S[3,5] S[3,6] S[3,11]
4 S[4,5] S[4,6] S[4,11]
5 S[5,6] S[5,11]
6 S[6,11]
7 S[7,11]
8 S[8,11]
9 S[9,11]
j
……iUsually evaluated
in increasing length of subsequence i.e. distance between i and j.
14
A U C G G C A U C A A
0 1 2 3 4 5 6 11 12
0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]
1 S[1,2] S[1,3] S[0,4] S[0,5] S[1,6] S[1,11]
2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]
3 S[3,4] S[3,5] S[3,6] S[3,11]
4 S[4,5] S[4,6] S[4,11]
5 S[5,6] S[5,11]
6 S[6,11]
7 S[7,11]
8 S[8,11]
9 S[9,11]
j
……i
-We choose an alternative permissible order.
-The algorithm will evaluate in order of increasing j and decreasing i.
15
for j= 2 to n
for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a
S(i,j-1) Rule b
for i = j-1 to 1
S(i,j) = max( S(i,j), S(i+1,j) ) Rule c
for k=j-1 to i+1 {Rule d loop}
S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) ) Rule d
S(i,j)=maxO(n2)
16
for j= 2 to n
for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a
S(i,j-1) Rule b
for i = j-1 to 1
S(i,j) = max( S(i,j), S(i+1,j) ) Rule c
for k=j-1 to i+1 {Rule d loop}
S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) )
S(i,j)=maxO(n)
O(n2)
O(n)
O(n3)
Bottleneck
17
for j= 2 to n
for i= 1 to j-1 S(i+1,j-1)+B(i,j) Rule a
S(i,j-1) Rule b
for i = j-1 to 1
S(i,j) = max( S(i,j), S(i+1,j) ) Rule c
for k=j-1 to i+1 {Rule d loop}
S(i,j)=max( S(i,j) , S(i,k-1)+S(k,j) )
S(i,j)=max
Idea: Change the decrement by one to a constant number of operations in each group of size q.
Four Russians speedup: first hint
What will Rule d loop look like then?
19
Computation components for speeding up Rule d loop
There are several components for speeding up Rule d loop.• Rgroup• Vg vector• Little vg vector• k*• R Table
At first these may seem unclear and extraneous but will lead to a speed-up.
20
Rgroup g
Conceptually divide each column in the S matrix into groups of q rows. Each is called an Rgroup, indexed by g.
21
A U C G G C A U C A A
0 1 2 3 4 5 6 11 12
0 S[0,1] S[0,2] S[0,3] S[0,4] S[0,5] S[0,6] S[0,11]
1 S[1,2] S[1,3] S[1,4] S[1,5] S[1,6] S[1,11]
2 S[2,3] S[2,4] S[2,5] S[2,6] S[2,11]
3 S[3,4] S[3,5] S[3,6] S[3,11]
4 S[4,5] S[4,6] S[4,11]
5 S[5,6] S[5,11]
6 S[6,11]
7 S[7,11]
8 S[8,11]
9 S[9,11]
j
……i
Rgroup example where q=3
Rgroup 0
Rgroup 1
Rgroup 2
22
Computation components for speeding up Rule d loop
• Rgroup
• Vg vector
• Little vg vector
• k*
• R Table
23
Vg vectorFor a fixed j, consider an Rgroup g consisting of rows
z, z-1, … z-q+1 for some z<j.
Let Vg = {S(z,j), S(z-1, j) ,… S(z-q+1,j)} for all Rgroups g.
V0
V1
24
Computation components for speeding up Rule d loop
• Rgroup
• Vg vector
• Little vg vector
• k*
25
Little vg vector
Observe that for the simple scoring scheme:
B(i,j)=1 when (i,j) is a permitted matchB(i,j)=0 when (i,j) is not a permitted match
S(z-1,j) – S(z,j) = {1 or 0}.
V0V15-4=1
5-5=0
Hence, the difference between consecutive values in any Vg vector are either 0 or 1.
26
Little vg vector
The changes in consecutive values of Vg could therefore be encoded into little vg ,a vector of length q-1.
V0V1
encode*
5-4=1
5-5=0 v0
Base of V0
27
Little vector vg
V1
V0
encode
3-3=0
4-3=1 v1
- Encoding each Vg into little vg takes O(q) time.
- Each column has j/q different Rgroups.
- O(n) time to do encoding for the entire column.
Decoding the offset- non formal definition
Decode: vg V`;
where
- V’ is a vector of length q-1
- if V`[l]=4 then Vg[l]=x+4 where x is the value in Vg(0) or the base of the Vg vector.
29
decode formal definition and exampleIt will be a running sum of the values in little vector vg.
decode: vg ->V’ ; where Example q=7
Vgencode
7-7=0
8-7=1
6-5=1
7-6=1
5-4=1
5-5=0
vg
3+0=3
3+1=4
1+1=2
2+1=3
1
1+0=1
V’decode
Offset from the base
base
30
Computation components for speeding up Rule d loop
• Rgroup
• Vg vector
• Little vg vector
• k*
• Table R
31
Defining k* for an Rgroup g
In column j, let k*(i,g,j)
be the index k such that the sum
S(i,k-1)+S(k,j)
is maximized over the indices in Rgroup g.
32
Example
S(3,4)+S(5,11)S(3,5)+S(6,11)S(3,6)+S(7,11)
For example, assuming column j=11 and row i=3:
To directly compute k*(i,g,j) for Rgroup1 (g=1) we would have to find max of
and store the k that corresponds to that max bipartition.
This would require O(q) time operations.
33
Computation components for speeding up Rule d loop
• Rgroup
• Vg vector
• Little vg vector
• k*
• Table R
Table R
Introducing Table R (currently a black box) that given (i,g,vg)
returns k*(i,g,j)
in O(1) operations.
35
for j= 2 to n
….
for i = j-1 to 1
for k=j-1 to i+1 get vg given Rgroup g
retrieve k*(i,g,j) from R(i,g,vg)
S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )
Modified Rule d loop using Table R
for g=(j-1)/q to (i+1)/q {Rule d loop}
36
for j= 2 to n
….
for i = j-1 to 1for g=(j-1)/q to (i+1)/q {Rule d loop: where Rgroup g is fully
complete}
get vg given Rgroup g
retrieve k*(i,g,j) from R(i,g,vg)
S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )
Run-time of Modified Rule d loop using Table R
The asymptotic run-time for Rule d loop is O(n2/q)
O(n)
O(n/q)
q*n2
37
for j= 2 to n
….
for i = j-1 to 1 for g=(j-1)/q to (i+1)/q { where Rgroup g is fully complete}
get vg given Rgroup g
retrieve k*(i,g,j) from R(i,g,vg)
S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )
Modified Rule d loop using Table R Having k* speeds up the
Rule d loop. How can we know k* when we need it?Answer: Four-Russians preprocessing.
38
Preprocessing
39
Preprocessing Table R: Cgroups
We conceptually divide the
columns into groups of size q. Example with q=3.
40
Preprocessing Table RAssume that we run the ‘Second Dynamic Programming Algorithm’ until j=q. That means that all the S(i,j) values in Cgroup 0 have been computed.
At this point we can compute the following:
for each binary vector v of length q-1 V’=decode(v) for each i such that i < q-1 R(i, 0, v) is set to the index k in Rgroup 0
such that S(i,k-1) + V’[k] is maximized.
41
Preprocessing in general
In general, for Cgroup g > 0, we could do a similar preprocessing after all the entries in columns of Cgroup g have been computed.
That is, R(i,g,v) could be found and stored for all i < g*q.
once Cgroup g=(j/q) is completefor each binary vector v of length q-1 V’=decode(v)
for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that
S(i,k-1) + V’[k] is maximized.
O(qn 2q-1)
Total for pre-processing for all Cgroups is O(qn*2q-1) *
O(n/q)=O(n2*2q-1) time
42
RNA folding algorithm with Four-Russians Speedup
43
RNA folding algorithm with Four-Russians Speedup for j= 2 to n for i= 1 to j-1
S(i,j) = max( S(i+1,j-1)+B(i,j) , S(i,j-1))
for i = j-1 to 1 for g=(j-1)/q to (i+1)/q {Rule loop d}
get vg given Rgroup g retrieve k*(i,g,j) from R(i,g,vg)
S(i,j)=max( S(i,j) , S(i,k*-1)+S(k*,j) )
once Cgroup g=(j/q) is complete for each binary vector v of length q-1 V’=decode(v)
for each i such that i < q-1 R(i, g, v) is set to the index k in Rgroup g such that S(i,k-1) + V’[k] is maximized.
O(n2*2q-1)
O(n3/q)
44
Run-Time
Setting q= log(n) the total run-time of
O(n2*2q-1) + O(n3/q)+ O(n2q) simplifies to
O(n3/log(n)) time total. Pre-processing Computation
45
Empirical Results
- We compare our Four-Russians algorithm to the original O(n3)-time algorithm.
- The empirical results shown in Table 1 give the average time for 50 tests of randomly generated RNA sequences and 25 downloaded sequences from genBank, for each size between 1,000bp and 6,000bp.
- The algorithm performs identically for randomly generated and gen-Bank sequences of equal length.
- This is to be expected because the algorithm's run time is sequence character independent.
46
*
*seconds
47
Variations on scoring scheme B
The Four Russians Speed-Up could be extended to any B(i,j) for which no differences between S(i,j) and S(i+1,j) depend on n.
Let C denote the size of the set of all possible differences. Then asymptotic time is:
when and
48
Parallel Computing
When running this algorithm in parallel on n processors we can reduce the total computation to O(log(n)*n2).
(Could be reduced to O(n2) time by a re-ordering method; space trade off)
49
Conclusion
- A practical and easy to implement Four-Russians Speed Up algorithm for prediction of RNA secondary structure, could be applied to a wide variety of scoring schemes.
-The asymptotic time of O(n3/log(n)) is achieved by interleaving computation and preprocessing.
-This time can be further lowered to O(log(n)*n2) when using n processors in parallel.
Recommended