19
Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA CFG – PSA Algorithm Sequence Alignment Guided By Common Motifs Described By Context Free Grammars CFG - PSA

CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

CFG – PSA Algorithm

Sequence Alignment Guided By Common Motifs Described By Context Free Grammars

CFG - PSA

Page 2: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

motivation

External Base Internal Loop

Multi-loop

Bulge

Hairpin Loop

Hairpin Loop

• Find motifs- conserved regions that indicate a biological function or signature. Other algorithm do not always align motif regions together.

• Incorporate knowledge about common structures into the alignment process. Forcing the alignments align such common motifs.

CFG - PSA

Page 3: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

The goal

To align sequences in a way to include all or some of the motif-matches in an order to optimize the resulting score.

Example

A(AA,AA)+A(GCA,GA)+A(UAU,UA)+A(GC,CGC)+ β(G1)+β(G2)+β(G3)

max 𝐴 𝑧𝑖, 𝑤𝑖 + max β(G)

𝑦,𝑥∈𝐺, 𝐺∈𝐶

𝑘+1

𝑖=1

CFG - PSA

Page 4: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

Let A(u, v) denote the maximum ordinary alignment score for strings u and v that can be computed using the Alignment algorithm. Given: • two sequences S1 and S2 • set of context-free grammars C = {G1, . . . , Gf } each of which represents a motif

with all its possible variations. • weight function β(G) Compute:

max 𝐴 𝑍𝑖, 𝑤𝑖 + max β(G) 𝑋,𝑌∈𝐺, 𝐺∈𝐶 𝑘+1𝑖=1

over all possible X1, X2, . . . ,Xk, Y1, Y2, . . . , Yk

S1 = z1 x1 z2 x2 . . . zk xk zk+1 S2 = w1 y1 w2 y2 . . .w k y k w k+1

Each Z j or w j , 1 <= j <= k + 1, can be an empty string For all I 1<= i<= k, there exists a 𝐺 ∈ 𝐶 such that 𝑦i, xi ∈ 𝐿 𝐺 , where L(G) is the language generated by CFG G.

motif-matching region

Problem definition

CFG - PSA

Page 5: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

C={G1, G2 , G3 , G4} G1=(V,T,P,V0) – context free grammar

Set of variables V = {V0, V1, V2}

V0 is a start variable

Terminal symbols T = {A,U,C,G}

Set of rules: • V0 --> CV1G | GV1C • V1 --> GV2C • V2 --> GAA

G2={ {V0, V1, V2} , {A,U,C,G} , p , V0} Set of rules: • V0 --> AV1U | UV1A • V1 --> CV2G • V2 --> CC | GCG

G3={ {V0, V1, V2, V3} , {A,U,C,G} , p , V0}

Set of rules: • V0 --> AV1U | GV1C | CV1G • V1 --> AV2U • V2 --> AV3U | UV3A • V3--> AAC | AA

CFG - PSA

Page 6: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

Reminder – CYK Algorithm

CYK is an algorithm that receives a CNF, G, and a string S, and

determines if S can be produced from G, and how.

ParseAll

ParseAll is a modification of CYK that receives a CNF, G, and a

string S, and finds all of the substrings of S that can be produced

from G.

CNF – Chomsky normal form

A CNF is a CFG in which all of the production rules are of one

of the forms: V→TS | V→a | S→ε

Every CFG is easily convertible to CNF by following a simple

algorithm.

CFG - PSA

Page 7: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

CNF G

String S = “Poor fool, do you think you can understand

Dynamic Programming? No one can understand it.”

B1 → can

B2 → B3C

B3 → understand

C → DP | it.

D → Dynamic

P → Programming

Running Example:

V0 → AB

A → A1A2 | you

A1 →No

A2 →one

B → B1B2

CFG - PSA

Page 8: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

String S = “Poor fool, do you think you can understand

Dynamic Programming? No one can understand it.”

A → you

B1 → can

B3 → understand

D → Dynamic

P → Programming

A1 → No

A2 → one

B1 → can

B3 → understand

C → it

T[4,4]={A} T[6,6]={A}

T[7,7]={B1}

T[9,9]={D}

T[10,10]={P}

T[11,11]={A1}

T[12,12]={A2}

T[13,13]={B1}

T[14,14]={B3}

T[15,15]={C}

T[8,8]={B3}

Running Example:

CFG - PSA

Page 9: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

Running Example:

15 14 13 12 11 10 9 8 7 6 5 4

A 4

5

A 6

B1 7

B3 8

D 9

P 10

A1 11

A2 12

B1 13

B3 14

C 15

CFG - PSA

Page 10: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

15 14 13 12 11 10 9 8 7 6 5 4

A 4

5

A 6

B1 7

B3 8

C D 9

P 10

A A1 11

A2 12

B1 13

B2 B3 14

C 15

C → DP A → A1A2 B2 → B3C

Running Example:

CFG - PSA

Page 11: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

15 14 13 12 11 10 9 8 7 6 5 4

A 4

5

A 6

B1 7

B2 B3 8

C D 9

P 10

A A1 11

A2 12

B B1 13

B2 B3 14

C 15

B → B1B2 B2 → B3C

Running Example:

CFG - PSA

Page 12: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

15 14 13 12 11 10 9 8 7 6 5 4

A 4

5

A 6

B B1 7

B2 B3 8

C D 9

P 10

A A1 11

A2 12

B B1 13

B2 B3 14

C 15

Running Example:

B → B1B2

CFG - PSA

Page 13: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

15 14 13 12 11 10 9 8 7 6 5 4

A 4

5

V0 A 6

B B1 7

B2 B3 8

C D 9

P 10

V0 A A1 11

A2 12

B B1 13

B2 B3 14

C 15

Running Example:

V0 → AB V0 → AB

CFG - PSA

Page 14: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

S1 = [6,10] S2=[11,15]

Running Example:

15 14 13 12 11 10 9 8 7 6 5 4

A 4

5

V0 A 6

B B1 7

B2 B3 8

C D 9

P 10

V0 A A1 11

A2 12

B B1 13

B2 B3 14

C 15

CFG - PSA

Page 15: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

Step 1. Find the substrings derived

by the rules of the form X → a:

for i =1 to n do

set T[i, i]= ϕ

for each variable X

if X → S[i] is a rule

then add X to T[i, i]

Step 2. Find the substrings derived by

the rules of the form X → YZ:

for l =2 to n do

for i =1 to n - l +1 do

set j = i + l – 1

for k = i to j - 1 do

for each rule X → YZ

if Y ∈ T[i, k]

and Z ∈ T[k +1,j]

Then add X to T[i, j]

Algorithm ParseAll (input string S, length n, CFG G)

Runtime = O(|G|N3)

Step 3. Return the set

of all substrings

generated by G:

P = ϕ

for i =1 to n do

for j = i to n do

if V0 ∈T[i, j]

then add (i, j) to P

return P

CFG - PSA

Page 16: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

_ A A C G A A C G G C A A A A A A C U

_ 0 -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞ -∞

A -∞ 1 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14 -15 -16 -17

A -∞ -1 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12 -13 -14

G -∞ -2 1 0 2 1 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 -12

G -∞ -3 0 -1 1 0 -1 -2 0 -1 -2 -3 -4 -5 -6 -7 -8 -9 -10

G -∞ -4 -1 -2 0 -1 -2 -3 -1 1 0 -1 -2 -3 -4 -5 -6 -7 -8

A -∞ -5 -2 -3 -1 1 0 -1 -2 0 -1 1 0 -1 -2 -3 -4 -5 -6

A -∞ -6 -3 -4 -2 0 2 1 0 -1 -2 0 2 1 0 -1 -2 -3 -4

C -∞ -7 -8 -2 -3 -1 1 3 2 1 0 -1 1 0 -1 -2 -3 -1 -2

C -∞ -8 -9 -3 -4 -2 0 2

G -∞

A -∞

C -∞

A -∞

U -∞

A -∞

A -∞

A -∞

-1 -1

CFG - PSA

Page 17: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

Preprocessing

For all G k in C do:

X k parseAll(S 1,n 1,G k) Y k parseAll(S 2,n 2,G k)

Move every (i',i)∈X k to list X as (i,j,k)

Move every (i',i) ∈ Y k to list Y as (i,j,k) Sort X & Y in ascending order of the end points i'.

(1,2,4)

(2,5, 1)

(3,7, 4)

(5,7, 1)

(3,9, 2)

(7,11, 4)

(9,11, 1)

(8,18, 3)

(2,6, 4)

(4,8, 3)

(3,8, 2)

(5,11, 1)

(2,11, 4)

(9,11, 3)

(1,5, 2)

X Y j=7

i=11

CFG - PSA

Page 18: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

Time Complexity of CFG - PSA Where N=max{N1,N2}

Running ParseAll for C

And creating X, Y:

Sorting X, Y:

Computing the maxterms: O(|X||Y| + N2)

Because of the usual N2 table checking, and the new maxterm:

max{H(i’,j’) +β(Gk1 )} (i’,i,k1,j’,j,k2) ∈ Xi*Yj ,Gk1=Gk2

Which passes every CNF in X and Y and check for matching CNFs.

O(|C|N3)

O(|X|log|X|+ |Y|log|Y|)

Total time complexity: O(|C|N3 + |X||Y|)

CFG - PSA

Page 19: CFG PSA Algorithm - BGUtabio122/wiki.files/CFG psa.pdf · 2012-06-14 · CNF – Chomsky normal form A CNF is a CFG in which all of the production rules are of one of the forms: V→TS

Shachar Langbeheim & Noa Lempel , Ben Gurion University CFG - PSA

Possible Modifications

Affine gap penalties

local alignment computations

more advanced algorithms

Solve the general problem of optimally aligning multiple

sequences guided by a given set of motifs described by

CFGs.

used in the alignment

of non-motif-matching

regions. Same or better

complexity.

CFG - PSA