Overview ( stats.ox.ac.uk/people/hein/lectures.htm ) Pairwise Alignment Again

Overview(http://www.stats.ox.ac.uk/people/hein/lectures.htm)

Pairwise Alignment Again

Triple – Quadruple - Many

Similarity-Distance Conversion

Local Alignment

Statistical alignment Pairwise Multiple

Conclusion

Approaches to Data Analysis

Data {GTCAT,GTTGGT,GTCA,CTCA}

GT-CAT

GTTGGT

GT-CA-

CT-CA-

s2 s3 s4s1

statisticsstatistics

Parsimony, similarity, optimisation.

Actual Practice: 2 phase analysis. Ideal Practice: 1 phase analysis.

Parsimony Alignment of two strings.Sequences: s1=CTAGG s2=TTGT.

Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10.

{CTA,TT}AL + GG

{CTAG,TTG}AL = {CTA,TTG}AL + G- {CTAG,TT}AL + -G

Initial condition: D0,0=0. (Di,j = D(s1[1:i], s2[1:j]))

Di,j=min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g, Di-1,j +g}

DCTA,TT + w(GG) =12 + 0 = 12D4,3=DCTAG,TTG=min{DCTA,TTG + w(G-) = 4 + 10 = 14} DCTAG,TT + w(-G) = 22 + 10 = 32

40 32 22 14 9 17T / 30 22 12 4 12 22G / 20 12 2 - 12 22 32T / 10 2 10 20 30 40T / 0 10 20 30 40 50 C T A G G

CTAGG Alignment: i v Cost 17 TT-GT

CA

A?

Alignment of three sequences.

s1=ATCG s2=ATGCC s3=CTCC

Alignment: AT-CG ATGCC CT-CC Consensus sequence: ATCC

Configurations in an alignment column:

- - n n n - n -- n - n - n n -n - - - n n n -

Recursion: Di,j,k = min{Di-i',j-j',k-k' + d(i,i',j,j',k,k')}

Initial condition: D0,0,0 = 0.

Running time: l1*l2*l3*(23-1) Memory requirement: l1*l2*l3

New phenomena: ancestral sequence.

G

G

C

C

Parsimony Alignment of four sequences

s1=ATCG s2=ATGCC s3=CTCC s4=ACGCG

Alignment: AT-CG ATGCC CT-CC ACGCG

Configurations in alignment columns:

- - - n - - - n n n - n n n n -- - n - n n - n - - n - n n n -- n - - n - n - n - n n - n n -n - - - - n n - - n n n n - n -

Recursion: Di = min{Di-∆ + d(i,∆)} ∆ [{0,1}4\{0}4]

Initial condition: D0 = 0.

Computation time: l1*l2*l3*l4*24 Memory : l1*l2*l3*l4

Alignment of many sequences.

s1=ATCG, s2=ATGCC, ......., sn=ACGCG

Alignment: AT-CG s1 s3 s4 ATGCC \ ! / ..... ---------- ..... / \ ACGCG s2 s5

Configurations in an alignment column: 2n-1

Recursion: Di=min{Di-∆ + d(i,∆)} ∆ [{0,1}n\{0}n]

Initial condition: D0,0,..0 = 0.

Computation time: ln*(2n-1)*n Memory requirement: ln (l:sequence length, n:number of sequences)

Fitch-Hartigan-Sankoff Algorithm

(A,C,G,T) (9,7,7,7) / \ / \ Costs: Transition 2, / \ (A ,C,G, T) \ Transversion 5, indel 10. (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 *

Indel Constraint: Nucleotides is connected set.

Longer Indels

TCATGGTACCGTTAGCGTGCA-----------GCAT

gk : cost of indel of length k.

Initial condition: D0,0=0

Di,j = min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g1,Di,j-2 + g2,Di,j-3 + g3,, Di-1,j + g1,Di-2,j + g2,Di-3,j + g3,, }

Cubic running time. Quadratic memory.

If gk = a + b*k, then quadratic running time.

Gotoh (1982) Di,j is split into 3 types:

1. D0i,j as Di,j, except s1[i] must mactch s2[j].

2. D1i,j as Di,j, except s1[i] is matched with "-".

3. D2i,j as Di,j, except s2[i] is matched with "-".

Then:D0i,j = min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1) + d(s1[i],s2[j])

D1i,j = min(D1i,j-1 + b, D0i,j-1 + a + b)

D2i,j = min(D2i-1,j + b, D0i-1,j + a + b)

Comment:

1. Evolutionary Consistency Condition: gi + gj > gi+j

Distance-Similarity.(Smith-Waterman-Fitch,1982)

Di,j=min{Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 +g, Di-1,j +g}

Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w}

Distance: Transitions:2 Transversions 5 Indels:10

M largest distance between two nucleotides (5).

Similarity s(n1,n2) M - d(n1,n2) wk k/(2*M) + gk w 1/(2*M) + g

Similarity Parameters: Transversions:0 Transitions:3 Identity:5 Indels: 10 + 1/10

40/-40.4 32/-27.3 22/-12.2 14/0.9 9/11.0 17/2.9T 30/-30.3 22/-17.2 12/-2.1 4/11.0 12/2.9 22/-7.2G 20/-20.2 12/-7.1 2/8.0 12/-2.1 22/-12.2 32/-22.3T 10/-10.1 2/3.0 10/-7.1 20/-17.2 30/-27.3 40/-37.4T 0/0 10/-10.1 20/-20.2 30/-30.3 40/-40.4 50/-50.5

C T A G G

Comments1. The Switch from Dist to Sim is highly analogous to Maximizing {-f(x)} instead of Minimizing {f(x)}.

2. Dist will based on a metric: i. d(x,x) =0, ii. d(x,y) >=0, iii. d(x,y) = d(y,x) & iv. d(x,z) + d(z,y) >= d(x,y).

There are no analogous restrictions on Sim, giving it a larger parameter space.

Local alignment Smith,Waterman (1981

Global Alignment: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w}Local: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w,0}

0 1 0 .6 1 2 .6 1.6 1.6 3 2.6 Score Parameters: C 0 0 1 0 1 .3 .6 0.6 2 3 1.6 Match: 1 A 0 0 0 1.3 0 1 1 2 3.3 2 1.6 Mismatch -1/3 G / 0 0 .3 .3 1.3 1 2.3 2.3 2 .6 1.6 Gap 1 + k/3C / 0 0 .6 1.6 .3 1.3 2.6 2.3 1 .6 1.6 GCC-UCGU / GCCAUUG 0 0 2 .6 .3 1.6 2.6 1.3 1 .6 1 A ! 0 1 .6 0 1 3 1.6 1.3 1 1.3 1.6 C / 0 1 0 0 2 1.3 .3 1 .3 2 .6 C / 0 0 0 1 .3 0 0 .6 1 0 0 G / 0 0 0 .6 1 0 0 0 1 1 2 U 0 0 1 .6 0 0 0 0 0 0 0 A 0 0 1 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 C A G C C U C G C U U

SodhSodb Sodl

sddm

Sdmz

sods Sdpb

Progressive Alignment(Feng-Doolittle 1987 J.Mol.Evol.)

Can align alignments and given a tree make a multiple alignment.

* *alkmny-trwq acdeqrtakkmdyftrwq acdehrtkkkmemftrwq

[ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6

* * *** * * * * * *Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrkSodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrkSodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrkSddm atkavcvlkgdgpqvq -infeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq— infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp LsrkSods vatkavcvlkgdgpqvq— infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk Sdpb datkavcvlkgdgpqvq—-infeqkesdgpv----wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk

P(s) = (1-)()l pA #A* .. * pT

#T

Time reversible

## ##

#

* A # C G0

t

Thorne-Kishino-Felsenstein Process

Time reversibility

t1

a

t2

s2s1

t1 +t2s1 s2

Pi,j(t) = probability that i has evolved into j after time t.

(i) = probability of i after infinitely long time - equilibrium distribution

(i) Pi,j(t) = (j) Pj,i(t) t1-----------t2---------t3

# - - ... - # # # ... #

pk = t*[l*(k-1) pk-1 + m*k*pk+1 - (l+m)*k*pk]

# - - - ... - - # # # ... #

p’k=t*[l*(k-1) p’k-1+m*(k+1)*p’k+1-(l+m)*k*p’k+m*pk+1]

* - - - ... - * # # # ... #

p’’k=t*[l*k*p’’k-1+m*(k-1)*p’’k+1-((k+1)l+mk)*p’’k]

Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1 p0(0)= p0’’(0)= 1. p’0 (0)= 0

Diff. Equations for p-functions

& into Alignment Blocks

A. Amino Acids Ignored:

# - - - # - - - - * - - - -# # # # - # # # # * # # # # k k k

e-t[1-(t)]((t))k-1 [1-e-t-(t)][1-(t)]((t))k-1 [1-(t)]((t))k

pk(t) p’k(t) p’’k(t)

p’0(t)= (t) where (t)=[1-et]/[]

B. AA Considered: T - - - R Q S W Pt(T-->R)*Q*..*W*p4(t) 4

T - - - - - R Q S W R *Q*..*W*p’4(t) 4

Basic Pairwise Recursion (O(length3))

i

j

Survives Dies

i-1j

i-1 i-1i

j-2

j-1i

ijj

ij

P(s1i1 s2 j 2) * p2 * f (s1[i],s2[ j 1])

i-1

j-1

Fundamental Pairwise Recursion.

P(s1i->s2j) = p’0P(s1i-1->s2j) +

Initial Condition P(s10 ->s2j) = pj’’s2[1:j]

Simplification: Ri,j =(p1f(s1[i],s2[j]+p’1s2j[j])P(s1i-1->s2j-1)

P(s1i->s2j) = Ri,j + p’0 P(s1i->s2j-1)

P(s1i->s2j) = p’0P(s1i-1->s2j)+P(s1i->s2j-1) + (p1f(s1[i],s2[j]+p’1s2j[j]-

s2j[j] ))P(s1i-1->s2j-1)

Probability of observation P(s1 , s2) = P(s1) P(s1 ->s2)

P(s1i1 s2 j k)( pk f (s1[i],s2[ j k 1]) s2[ j k2: j ] p'k1kj

s2[ j k1: j ])

-globin (141) and -globin (146)

430.108 : -log(-globin) 327.320 : -log(-globin -globin) 730.428 : -log(l(sumalign))

*t: 0.0371805 +/- 0.0135899*t: 0.0374396 +/- 0.0136846s*t: 0.91701 +/- 0.119556

E(Length) E(Insertions,Deletions) E(Substitutions)

143.499 5.37255 131.59

Maximum contributing alignment:

V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF

TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH

Ratio l(maxalign)/l(sumalign) = 0.00565064

Probability for substitution: = 0.46

#children: p p' p'':

0. 0.0 3.60 10-2 9.64 10-1

1. 9.28 10-1 6.30 10-4 3.45 10-2

2. 3.32 10-2 2.26 10-5 1.23 10-3

3. 1.19 10-3 8.10 10-7 4.43 10-5

4. 4.27 10-5 2.90 10-8 1.59 10-6

5. 1.53 10-6 1.04 10-9 5.70 10-8

(t) = 9.64 * 10-1 (t) = 3.46 * 10-2

Length Evolution + Immortal Link

Accelerations of pairwise algorithm

1

2 - Better numerical search

3 - Simpler recursion

1991->2000 : an 106 acceleration for 2 proteins 1500 long.

4 - Better computers

Likelihood Surface

Homology Test

Wi,j= -ln(i*P2.5i,j/(i*j))

D(s1,s2) is evaluated in D(s1,s2*)

Real s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE *** ** * * *

This test:

1. Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart.

2. It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary.

3. It samples in (i*j) by permuting the order of amino acids in the second. I.e. uses drawing without replacement – a hypergeometric distribution.

Steel-Hein Algorithm

a

s1 s2

s3

*######

P(S) (1

)[P*(S)

P# (Tail )P(S Tail)]

*ACGGT

*ACGC *TTGT

Binary Tree Problem

s1

s2

s3

s4

a1 a2

ACCT

GTT

TGA

ACG

Markov Chains Generating the p-functions.

Generating Ancestral Sequences

1 Sequence # E * #

2 Sequence - # # E # # - E ** e- e-

-# e- e-

_# e- e-

#-

a1 * - # Ea2 * # # E e-

1 e

1 e

e

1 e

( )1 e

Fundamental Multiple Recursion Is1 - C G C T As2 A G A A T Ta1 # - a2 # #s3 A G C G Gs4 G - C C T G

#

#=

#

#

-

#

Sum over all

String partitions - Anc. state survivals - Anc. state MC jumps

Fundamental Multiple Recursion II

iS

P(k Si ,H )H C

P( )P (Si)

P(Sk): Epifixes (S1[k+1:l1]) starting in given MC starts i state .

( p' kj:H ( j )0

(t j ) sj [i( j) : k( j)])( pkj:H( j )1

( t j ) sj [i( j) 1: k( j)])F(kSi,H)

Where P(kS i,H =

P(Sk) =

Fundamental 4 sequence Recursion

Not a proper recursion!

Initialisation: PEE(Ø) =1 and P(Ø) are directly calculatable.

O(l2k)shown, O(lk) Algorithm possible

Toy 4-Sequence Program almost ready.

This approach could analyse up to 6-7 sequences.

Jens Ledet and others are working on Gibbs sampler approach.

Statistical Alignment Summary

Motivation for statistical alignment: Data is sequences not alignment!

Problems ahead: Longer Insertion Deletion Process

Position Heterogeneous Process Simultaneous Comparative Gene Finding and Alignment.

Making an TKF-process with a given HMM as stationary distribution.

Explore non-TKF processes.

References

Documents

Overview ( stats.ox.ac.uk/people/hein/lectures.htm ) Pairwise Alignment Again