Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

Preview:

DESCRIPTION

A memory-efficient algorithm for multiple sequence alignment with constraints Chin Lung Lu and Yen Pin Huang National Chiao Tung University Taiwan, Republic of China Bioinformatics, Vol. 21 no. 1 2005. Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005. Motivation. - PowerPoint PPT Presentation

Citation preview

A memory-efficient algorithm for multiple sequence alignment with

constraints

Chin Lung Lu and Yen Pin Huang

National Chiao Tung UniversityTaiwan, Republic of China

Bioinformatics, Vol. 21 no. 1 2005

Yutu Liu -- CPSC 689 Algorithmic Techniques for Biology Spring 2005

2

Motivation

Incorporate the biological structures and consensuses into sequence alignment

Memory efficient

3

Problem Formulation -- Constraints

What is the multiple sequence alignment with constraints ?

A T C T C G C T

T G C A T A T

AT T

A T -- C -- T C G C T

-- T G C A T -- -- A T

-- -- -- A T C T C G C T

T G C A T A T -- -- -- --

1C 2C

Conserved sites of a protein

or DNA/RNA family

),...,,( 21 CCC

iiii i

cccC ...21

No overlapping between them

CCC ...21

4

Problem Formulation -- Constraints

T G C A T A T

G A

iC

jS

2i

jS

iii ccC 21

1),( ij CS

Hamming Distance

0.5

1 i

Approximately appears

5

Given S={s1,s2,…,sx}, and

Problem Formulation -- Constraints

A T G C A T C G C T

-- T G C A T -- -- A T

T T G C A T C A T C

L

Subseq(S2, L’)

Band L’

T G C C C

string T={t1,t2,..tk}, for

T approximately appears in L

*)),',(( kTLSsubseq i ,1 xi

6

Problem Formulation

1 2 x 1 2Let S={S ,S ,...,S } over the alphabet . Let =(C ,C ,...,C )

be an ordered set of constraints. Then the CMSA of S w.r.t is an

alignment L of S over {-} with the optimal sum-of-pair score

(SP sco

'1 2 i

'

re) in which all the constraints of approximately appear in

the order of C C ... such that (subseq(S , ), )

for all 1 i x and 1 j , where is the band of L whose induced

consensus i

j j j

j

C L C

L

js C .

Constrained Multiple Sequence Alignment (CMSA)

S1

S2

S3

C1 C2 C3

Optimal Sum-of-Pair Score

CPSA

7

CMSA

Pick two sequencesFind the CPSAUse it as a kernel to progressively

align more sequences

[1] Progressive Multiple Alignment with Constraints, Gene Myers et al. [2] MuSiC: A Tool for Multiple Sequence Alignment with Constraints Yin Te Tsai Chin Lung Lu Ching Ta Yu Yen Pin Huang

8

Algorithm Overview

bj-1

ai-1

Find recursive relationship

ai

bj

M(i-1,j-1)

M(i,j)

Divide-and-Conquer

9

Notations

1 2

1 2

1 2

1 2

...

...

( , ) ...

( , ) ...

m

n

i i

i i m i

A a a a

B bb b

pref A i a a a A

suff A i a a a A

1 2 1 2..... .....i i i ma a a a a a

iA iA

10

Notations

1 2

1 2

1 2

1 2

( , j) ...

( , j) ...

( , k) ...

( , k) ...

j j

j j n j

k k

k k k

pref B b b b B

suff B b b b B

pref C C C

suff C C C

11

Notation ( , )

k

DM i j

( , ) kM i j

( , ) K

IM i j

jBjB

iA iA1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

CkC1 Cγ … …

( , ) k

SM i j

( , ) k

DM i j

( , ) kM i j

( , ) K

IM i j

( , ) k

SM i j

( , ) kM i j

12

Alignment Score

Let ( , ) be the score of an optimal constrained

alignment of and w.r.t k

i j k

M i j

A B

A

B ...C1 C2 Ck

13

Alignment Score - Substitution

Let ( , ) be the maximum score of all constrained

alignments of and w.r.t that end with a substitution

pair ( , ).

Sk

i j k

i j

M i j

A B

a b

A

B ...C1 C2 Ck

ai

bj

14

Alignment Score -- Deletion

i

i

Let ( , ) be the maximum scores of all constrained

alignment of A and w.r.t. that end with a deletion

pair (a , -).

Dk

j k

M i j

B

--

ai

A

B...

C1 C2 Ck

15

Alignment Score -- Insertion

i

j

Let ( , ) be the maximum scores of all constrained

alignment of A and w.r.t. that end with a insertion

pair (-,b ).

Ik

j k

M i j

B

--

b j

A

B...

C1 C2 Ck

16

Semi-Constrained Alignment

k-1

k

A semi-constrained alignment of and w.r.t

a constrained alignment and w.r.t , and end

with a band which is a prefix of C

i j k

i j

A B is

A B

A

B ...C1 C2 Ck-1 Ck

( , , )kN i j hh

( , )kpref C h

17

Recurrence of Scores

k

if 0, then

( 1, 1) ( , )

M ( , ) max ( , )

( , )

k i jDkIk

k

M i j a b

i j M i j

M i j

18

Recurrence of Scores

k

if 1 , then

( 1, 1) ( , )

( , )M ( , ) max

( , )

( , , )

k i jDkIk

k k

k

M i j a b

M i ji j

M i j

N i j

19

Recurrence of Scores

1 0 1

If ( ( , ), ) and ( ( , ), )

( , , ) ( , ) ( , )

( , , )

k

i k k k j k k k

k k k k k i h j hh

k k

suff A C suff B C

N i j M i j a b

else

N i j

1 2 1............ ..........i h i ia a a a a

1 2 1........... .........j h j jbb b b b

Ck

20

? ( 1, )kM i j

1 2 1...................... i ia a a a

1 2 1..................... j jbb b b

( , )DkM i j

D Sk k

D Dk k

D Ik k

Substitution: M ( , ) M ( 1, )

Deletion: M ( , ) M ( 1, )

Insertion: M ( , ) M ( 1, )

o e

e

o e

i j i j w w

i j i j w

i j i j w w

a i-1

b j

--

b j

a i-1

--

21

Recurrence

Sk

D Ik k

Dk

M ( 1, )

M ( , ) max M ( 1, )

M ( 1, )

o e

o e

e

i j w w

i j i j w w

i j w

DkM ( 1, ) o ei j w w

kM ( 1, ) o ei j w w

kDk D

k

M ( 1, )M ( , ) max

M ( 1, )o e

e

i j w wi j

i j w

kIk I

k

M ( 1, )M ( , ) max

M ( 1, )o e

e

i j w wi j

i j w

22

( i, j, k)( i, j-1, k)

( i-1, j-1, k) ( i-1, j, k)

Sequence B

Sequence A

Constraints

( m, n, γ )

( 0, 0, 0)

23

1 0 1( , , ) ( , ) ( , )

kk k k k k i h j hh

N i j M i j a b

Nk

24

Assignment

Design an algorithm to find the CPSA using dynamic programming

technique. Analyze the time and space complexity of your algorithm. For

simpilicity, you can ignore the open-gap penalty. Prove your algorithm

is consistent with the constrained set . . . it will find such a CPSA if

there exists one.

i e

Email: alinux@tamu.edu

25

Divide-and-Conquer

26

( , , )kN i j h( , , )kN i j h

1 2 1( ) [ , ,..., , ( , )]k k kh c c c pref c h 1( ) [ ( , ), ,..., , ]k k k kh suff c h c c

( , )IkM i j( , )I

kM i j

( , )SkM i j ( , )S

kM i j

( , )kM i j ( , )kM i j

( , )DkM i j( , )D

kM i j

jBjB

iA iA1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

h

pref(Ck,h) suff(Ck, λk - h)

CkC1 Cγ … …

27

Divide-and-Conquer

( , )mid mid midk i jL A B ( , )

mid mid midk i jL A B

Case 1: if the last pair of ( , ) is a substitution

A. ( , ) and ( , ) are optimal constrained

( , ) ( , ) ( , )

B.

mid mid mid

mid mid mid mid mid mid

k midmid

k i j

k i j k i j

Smid mid k mid mid

L A B

L A B L A B

M m n M i j M i j

L

( , ) and ( , ) are optimal semi-constrained

( , ) ( , , ) ( , , )

C. if

( , ) ( , , ) (

mid mid mid mid mid mid

mid mid

mid

mid mid mid

k i j k i j

k mid mid mid k mid mid mid

mid k

k mid mid k k m

A B L A B

M m n N i j h N i j h

h

M m n N i j M i

, ) id midj

28

Divide-and-Conquer

( , )mid mid midk i jL A B ( , )

mid mid midk i jL A B

Case 2: if the last pair of ( , ) is a deletion

A. If the first pair of ( , ) is not a deletion pair

( , ) max{ ( , ) ( , ),

mid mid mid

mid mid mid

k midmid

k i j

k i j

D Smid mid k mid mid

L A B

L A B

M m n M i j M i j

( , ) ( , )}

B. If the first pair of ( , ) is a deletion pair

( , ) ( , ) ( , )

k midmid

mid mid mid

k midmid

D Imid mid k mid mid

k i j

D Dmid mid k mid mid o

M i j M i j

L A B

M m n M i j M i j w

29

Summary( , )

midk mid midM i j

( , ) ( , )

( , ) ( , )

( , ) ( , )( , ) max

( , ) ( , )

( , , ) ( , , )

K midmid

K midmid

K midmid

K midmid

mid mid

mi

D Imid mid k mid mid

D Smid mid k mid mid

Smid mid k mid mid

D Dmid mid k mid mid o

k mid mid mid k mid mid mid

k

M i j M i j

M i j M i j

M i j M i jM m n

M i j M i j w

N i j h N i j h

N

( , , ) ( , )d mid midmid mid k k mid midi j M i j

( , ) ( , )K midmid

D Dmid mid k mid midM i j M i j

( , )midk mid midM i j( , )

Kmid

Dmid midM i j

30

Take , , as indices , and , where 1 ,

0 and 1 , such that the following maximal value

is the maximum.

( , ) ( , )

( , ) ( , )

( , ) max ( , ) (

K

K

K

mid mid mid

k

Dmid k mid

Smid k mid

D Dmid k mi

j k h j k h j n

k h

M i j M i j

M i j M i j

M m n M i j M i

, )

( , , ) ( , , )

( , , ) ( , )

d o

k mid k mid

k mid k k mid

j w

N i j h N i j h

N i j M i j

Summary

( , ) { ( , , , )}midM m n Max F i j k h

2mid

mi

31

, , ,Algorithm CPSA-DC( , , )

1. Divide A into 2, then call BestScore() and BestScoreRev(), where

the sizes of B and 's are not changed.

2. The BestScore() and BestScoreRev

start end start end start endi i j j k k

mid midi i

{ ( , , , )

() return all the alignment scores

of (A , , ) (A , , )

3 Find the where the value of j, k, h will be used as

the middle point to divide the alignment for recur

}

smi

k

d

j k jB an

max F i k h

B

j

d

ive call of

CPSA-DC()

Implementation -- CPSA-DC()

32

( , ) kM i j

jB

jB

1 2 1 1............ ..........i i m ma a a a a a

1 2 1 1........... .........j j n nbb b b b b

CkC1 Cγ … …

midiA

33

Complexity

k ,

A single matrix E of size ( +1)(n+1) with each entry of

4 for ( , ), ( ), ( , ), ( , )

and ( , , )

Temporary Space: V is the same size as E

Total Space: ( )

Let , the s

s D Ik mid k mid k mid k mid

k mid

M i j M i j M i j M i j

N i j h

n

mn

ize of the original problem,

then the total time complexity of CPSA-DC algorithm is

equal to ... 22 4 8

34

Experimental Results

35

Experimental Results

36

Discussion

Lack of proof of consistency of constraints

Optimal pair-wise subsequences alignment might cause the failure of the overall optimal alignment

37

Discussion

http://genome.life.nctu.edu.tw:8080/MUSICME/index.html

38

Assignment

1 2 1 2

1 2

Let { , } over the alphabet . Let ( , ,..., )

be an ordered set of constraints, where ... . Then

the Constrained Pair-wise Sequence Alignment (CPSA) of S

w.r.t is an alignment

i

i i ii

S S S C C C

C c c c

1 2

'

L of S over {-} with the optimal

sum-of-pair score (SP score) in which all the constraints of

approximately appear in the order of ... such

that the hamming distance ( ( , ), )i j j

C C C

subseq S L C

'

for

all 1 2 and 1 , where 0 1, and is the band

of L whose induced consensus is . A band is a block of

consecutive columns in L.

Design an efficient algorithm to find the CPSA using

j

j

j

i j L

C

dynamic

programming technique or whatever method you prefer. For

simpilicity, you can ignore the open-gap penalty. Analyze

the time and space complexity of your algorithm. Prove your

algorithm is consistent with the constraint set . . . it will find

such a CPSA if there exists one.

i e

39

Reference

Efficient Constrained Multiple Sequence Alignmentwith Performance GuaranteeFrancis Y.L. Chin N.L. Ho T.W. Lamy Prudence W.H. Wong M.Y. Chan

Divide-and-conquer multiple alignment withsegment-based constraintsMichael Sammeth1,∗, Burkhard Morgenstern2 and Jens Stoye 1

Multiple sequence alignment with the divide-and-conquer methodJens Stoye

MuSiC: A Tool for Multiple Sequence Alignment with Constraints Yin Te Tsai1 Chin Lung Lu2 ∗ Ching Ta Yu1 Yen Pin Huang

Recommended