16
Approaches to Sequence Analysis s 2 s 3 s 4 s 1 statis tics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA} A c t u a l P r a c t i c e : 2 p h a s e a n a l y s i s . I d e a l P r a c t i c e : 1 p h a s e a n a l y s i s . 1. TKF91 - The combined substitution/indel process. 2. Acceleration of Basic Algorithm 3. Many Sequence Algorithm 4. MCMC Approaches

Approaches to Sequence Analysis

Embed Size (px)

DESCRIPTION

Approaches to Sequence Analysis. Data {GTCAT,GTTGGT,GTCA,CTCA}. Parsimony, similarity, optimisation. GT-CAT GTTGGT GT-CA- CT-CA-. Ideal Practice: 1 phase analysis. Actual Practice: 2 phase analysis. statistics. s 1. s 2. s 3. s 4. TKF91 - The combined substitution/indel process. - PowerPoint PPT Presentation

Citation preview

Page 1: Approaches to Sequence Analysis

Approaches to Sequence Analysis

s2 s3 s4s1

statistics

GT-CAT

GTTGGT

GT-CA-

CT-CA-

Parsimony, similarity, optimisation.

Data {GTCAT,GTTGGT,GTCA,CTCA}

Actu

al Practice: 2 p

hase an

alysis.

Ideal P

ractice: 1 ph

ase analysis.

1. TKF91 - The combined

substitution/indel process.

2. Acceleration of Basic

Algorithm

3. Many Sequence Algorithm

4. MCMC Approaches

Page 2: Approaches to Sequence Analysis

Thorne-Kishino-Felsenstein (1991) Process

(birth rate) (death rate)

A # C G

# ##

#

T= 0

T = t

#

s2

s1

s1 s2

r

s1 s22. Time reversible:

1. P(s) = (1-)()l A#A* .. * T

#T l =length(s)

# - - -

# # # #

*

Page 3: Approaches to Sequence Analysis

& into Alignment BlocksA. Amino Acids Ignored:

e-t[1-]()k-1

# - - - # # # # k

# - - - -- # # # # k

=[1-e()t]/[e()t]

pk(t)p’k(t)

[1--]()k

p’0(t)= (t)

* - - - -* # # # # k

[1-]()k

p’’k(t)

B. Amino Acids Considered:

T - - -R Q S W Pt(T-->R)*Q*..*W*p4(t) 4

T - - - -- R Q S W R *Q*..*W*p’4(t) 4

Page 4: Approaches to Sequence Analysis

# - - ... -# # # ... #

Differential Equations for p-functions

# - - - ... -- # # # ... #

* - - - ... -* # # # ... #

Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1 p1(0)= p0’’(0)= 1. p’0 (0)= 0

pk = t*[*(k-1) pk-1 + *k*pk+1 - ()*k*pk]

p’k=t*[*(k-1) p’k-1+*(k+1)*p’k+1-()*k*p’k+*pk+1]

p’’k=t*[*k*p’’k-1+*(k+1)*p’’k+1- [(k+1)+k]*p’’k]

Page 5: Approaches to Sequence Analysis

Basic Pairwise Recursion (O(length3))

Survives: Dies:

i-1

j-2i

j

i-1 ij-1 j

……………………

1… j (j) cases

……………………

j

i-1 i

j

ii-1

j-1

])[2(*'*)21( 111 jspssP ji

0… j (j+1) cases

…………………………………………

……………………

i

j

P(s1i s2 j )

(s2[ j])

f (s1[i],s2[ j 1])

p2

P(s1i 1 s2 j 2)

e-t[1-]()k-1, where

=[1-e()t]/[e()t]

Page 6: Approaches to Sequence Analysis

Basic Pairwise Recursion (O(length3))

(i,j)

i

j

i-1

j-1

(i-1,j)

(i-1,j-1)

survive

death

(i-1,j-k)

…………..

…………..…………..

Initial condition:

p’’=s2[1:j]

Page 7: Approaches to Sequence Analysis

Accelleration of Pairwise Algorithm(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)

Corner Cutting ~100-1000

Better Numerical Search ~10-100Ex.: good start guess, 28 evaluations, 3 iterations

Simpler Recursion ~3-10

Faster Computers ~250

1991-->2000 ~106

Page 8: Approaches to Sequence Analysis

-globin (141) and -globin (146)(From Hein,Wiuf,Knudsen,Moeller & Wiebling 2000)

430.108 : -log(-globin) 327.320 : -log(-globin --> -globin) 747.428 : -log(-globin, -globin) = -log(l(sumalign))

*t: 0.0371805 +/- 0.0135899*t: 0.0374396 +/- 0.0136846s*t: 0.91701 +/- 0.119556

E(Length) E(Insertions,Deletions) E(Substitutions)

143.499 5.37255 131.59

Maximum contributing alignment:

V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADALTVHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFS

NAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Ratio l(maxalign)/l(sumalign) = 0.00565064

Page 9: Approaches to Sequence Analysis

The invasion of the immortal link

VLSPADNAL.....DLHAHKR 141 AA long

???????????????????? k AA long

2 107 years

2 108 years

2 109 years

*########### …. ### 141 AA long

*########### …. ###

*########### …. ###

109 years

Page 10: Approaches to Sequence Analysis

Algorithm for alignment on star tree (O(length6))(Steel & Hein, 2001)

* ()*######

P(S) (1

)[P*(S)

P# (Tail )P(S Tail)]

a

s1 s2

s3

*ACGC *TT GT

*ACG GT

Page 11: Approaches to Sequence Analysis

Binary Tree Problem

The problem would be simpler if:

s1

s2

s3

s4

a1 a2

ACCT

GTT

TGA

ACG

A Markov chain generating ancestral alignments can solve the problem!!

a1 a2

* *# ## -- ## #- #

i. The ancestral sequences & their alignment was known.

ii. The alignment of ancestral alignment columns to leaf sequences was known

How to sum over all possible ancestral sequences and their alignments?:

Page 12: Approaches to Sequence Analysis

- # # E # # - E ** e- e-

## e- e-

_# e- e-

#-

1 e

1 e

e

1 e

( )1 e

Generating Ancestral Alignments

a1 *a2 *

- #

# # e-

E E

Page 13: Approaches to Sequence Analysis

The Basic Recursion

S E

”Remove 1st step” - recursion:

”Remove last step” - recursion:

Last/First step removal are inequivalent, but have the same complexities.

First step algorithm is the simplest.

Page 14: Approaches to Sequence Analysis

Sequence Recursion: First Step Removal

iS

P '(k Si ,H )H C

P( )P (Si)

P(Sk): Epifixes (S[k+1:l]) starting in given MC starts in .

P(Sk) = E

( p' kj:H ( j )0

(t j ) sj [i( j) : k( j)])( pkj:H( j )1

( t j ) sj [i( j) 1: k( j)])F(kSi,H)

Where P’(kS i,H =

Page 15: Approaches to Sequence Analysis

Human alpha hemoglobin;Human beta hemoglobin;Human myoglobinBean leghemoglobin

Probability of data e -1560.138

Probability of data and alignment e-1593.223

Probability of alignment given data 4.279 * 10-15 = e-33.085

Ratio of insertion-deletions to substitutions: 0.0334

Maximum likelihood phylogeny and alignment

Gerton Lunter

Istvan Miklos

Alexei Drummond

Yun Song

Page 16: Approaches to Sequence Analysis

Metropolis-Hastings Statistical AlignmentLunter, Drummond, Miklos, Jensen & Hein, 2005