Upload
shalin
View
29
Download
3
Embed Size (px)
DESCRIPTION
Overview ( http://www.stats.ox.ac.uk/people/hein/lectures.htm ) Pairwise Alignment Again Triple – Quadruple - Many Similarity-Distance Conversion Local Alignment Statistical alignment Pairwise Multiple Conclusion. Approaches to Data Analysis. s 1. s 2. s 3. s 4. - PowerPoint PPT Presentation
Citation preview
Overview(http://www.stats.ox.ac.uk/people/hein/lectures.htm)
Pairwise Alignment Again
Triple – Quadruple - Many
Similarity-Distance Conversion
Local Alignment
Statistical alignment Pairwise Multiple
Conclusion
Approaches to Data Analysis
Data {GTCAT,GTTGGT,GTCA,CTCA}
GT-CAT
GTTGGT
GT-CA-
CT-CA-
s2 s3 s4s1
statisticsstatistics
Parsimony, similarity, optimisation.
Actual Practice: 2 phase analysis. Ideal Practice: 1 phase analysis.
Parsimony Alignment of two strings.Sequences: s1=CTAGG s2=TTGT.
Basic operations: transitions 2 (C-T & A-G), transversions 5, indels (g) 10.
{CTA,TT}AL + GG
{CTAG,TTG}AL = {CTA,TTG}AL + G- {CTAG,TT}AL + -G
Initial condition: D0,0=0. (Di,j = D(s1[1:i], s2[1:j]))
Di,j=min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g, Di-1,j +g}
DCTA,TT + w(GG) =12 + 0 = 12D4,3=DCTAG,TTG=min{DCTA,TTG + w(G-) = 4 + 10 = 14} DCTAG,TT + w(-G) = 22 + 10 = 32
40 32 22 14 9 17T / 30 22 12 4 12 22G / 20 12 2 - 12 22 32T / 10 2 10 20 30 40T / 0 10 20 30 40 50 C T A G G
CTAGG Alignment: i v Cost 17 TT-GT
CA
A?
Alignment of three sequences.
s1=ATCG s2=ATGCC s3=CTCC
Alignment: AT-CG ATGCC CT-CC Consensus sequence: ATCC
Configurations in an alignment column:
- - n n n - n -- n - n - n n -n - - - n n n -
Recursion: Di,j,k = min{Di-i',j-j',k-k' + d(i,i',j,j',k,k')}
Initial condition: D0,0,0 = 0.
Running time: l1*l2*l3*(23-1) Memory requirement: l1*l2*l3
New phenomena: ancestral sequence.
G
G
C
C
Parsimony Alignment of four sequences
s1=ATCG s2=ATGCC s3=CTCC s4=ACGCG
Alignment: AT-CG ATGCC CT-CC ACGCG
Configurations in alignment columns:
- - - n - - - n n n - n n n n -- - n - n n - n - - n - n n n -- n - - n - n - n - n n - n n -n - - - - n n - - n n n n - n -
Recursion: Di = min{Di-∆ + d(i,∆)} ∆ [{0,1}4\{0}4]
Initial condition: D0 = 0.
Computation time: l1*l2*l3*l4*24 Memory : l1*l2*l3*l4
Alignment of many sequences.
s1=ATCG, s2=ATGCC, ......., sn=ACGCG
Alignment: AT-CG s1 s3 s4 ATGCC \ ! / ..... ---------- ..... / \ ACGCG s2 s5
Configurations in an alignment column: 2n-1
Recursion: Di=min{Di-∆ + d(i,∆)} ∆ [{0,1}n\{0}n]
Initial condition: D0,0,..0 = 0.
Computation time: ln*(2n-1)*n Memory requirement: ln (l:sequence length, n:number of sequences)
Fitch-Hartigan-Sankoff Algorithm
(A,C,G,T) (9,7,7,7) / \ / \ Costs: Transition 2, / \ (A ,C,G, T) \ Transversion 5, indel 10. (10,2,10,2) \ / \ \ / \ \ / \ \ / \ \ / \ \ (A,C,G,T) (A,C,G,T) (A,C,G,T) * 0 * * * * * 0 * * 0 *
Indel Constraint: Nucleotides is connected set.
Longer Indels
TCATGGTACCGTTAGCGTGCA-----------GCAT
gk : cost of indel of length k.
Initial condition: D0,0=0
Di,j = min { Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 + g1,Di,j-2 + g2,Di,j-3 + g3,, Di-1,j + g1,Di-2,j + g2,Di-3,j + g3,, }
Cubic running time. Quadratic memory.
If gk = a + b*k, then quadratic running time.
Gotoh (1982) Di,j is split into 3 types:
1. D0i,j as Di,j, except s1[i] must mactch s2[j].
2. D1i,j as Di,j, except s1[i] is matched with "-".
3. D2i,j as Di,j, except s2[i] is matched with "-".
Then:D0i,j = min(D0i-1,j-1, D1i-1,j-1, D2i-1,j-1) + d(s1[i],s2[j])
D1i,j = min(D1i,j-1 + b, D0i,j-1 + a + b)
D2i,j = min(D2i-1,j + b, D0i-1,j + a + b)
Comment:
1. Evolutionary Consistency Condition: gi + gj > gi+j
Distance-Similarity.(Smith-Waterman-Fitch,1982)
Di,j=min{Di-1,j-1 + d(s1[i],s2[j]), Di,j-1 +g, Di-1,j +g}
Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w}
Distance: Transitions:2 Transversions 5 Indels:10
M largest distance between two nucleotides (5).
Similarity s(n1,n2) M - d(n1,n2) wk k/(2*M) + gk w 1/(2*M) + g
Similarity Parameters: Transversions:0 Transitions:3 Identity:5 Indels: 10 + 1/10
40/-40.4 32/-27.3 22/-12.2 14/0.9 9/11.0 17/2.9T 30/-30.3 22/-17.2 12/-2.1 4/11.0 12/2.9 22/-7.2G 20/-20.2 12/-7.1 2/8.0 12/-2.1 22/-12.2 32/-22.3T 10/-10.1 2/3.0 10/-7.1 20/-17.2 30/-27.3 40/-37.4T 0/0 10/-10.1 20/-20.2 30/-30.3 40/-40.4 50/-50.5
C T A G G
Comments1. The Switch from Dist to Sim is highly analogous to Maximizing {-f(x)} instead of Minimizing {f(x)}.
2. Dist will based on a metric: i. d(x,x) =0, ii. d(x,y) >=0, iii. d(x,y) = d(y,x) & iv. d(x,z) + d(z,y) >= d(x,y).
There are no analogous restrictions on Sim, giving it a larger parameter space.
Local alignment Smith,Waterman (1981
Global Alignment: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w}Local: Si,j=max{Di-1,j-1 + s(s1[i],s2[j]), Si,j-1 -w, Si-1,j-w,0}
0 1 0 .6 1 2 .6 1.6 1.6 3 2.6 Score Parameters: C 0 0 1 0 1 .3 .6 0.6 2 3 1.6 Match: 1 A 0 0 0 1.3 0 1 1 2 3.3 2 1.6 Mismatch -1/3 G / 0 0 .3 .3 1.3 1 2.3 2.3 2 .6 1.6 Gap 1 + k/3C / 0 0 .6 1.6 .3 1.3 2.6 2.3 1 .6 1.6 GCC-UCGU / GCCAUUG 0 0 2 .6 .3 1.6 2.6 1.3 1 .6 1 A ! 0 1 .6 0 1 3 1.6 1.3 1 1.3 1.6 C / 0 1 0 0 2 1.3 .3 1 .3 2 .6 C / 0 0 0 1 .3 0 0 .6 1 0 0 G / 0 0 0 .6 1 0 0 0 1 1 2 U 0 0 1 .6 0 0 0 0 0 0 0 A 0 0 1 0 0 0 0 0 0 0 0 A 0 0 0 0 0 0 0 0 0 0 0 C A G C C U C G C U U
SodhSodb Sodl
sddm
Sdmz
sods Sdpb
Progressive Alignment(Feng-Doolittle 1987 J.Mol.Evol.)
Can align alignments and given a tree make a multiple alignment.
* *alkmny-trwq acdeqrtakkmdyftrwq acdehrtkkkmemftrwq
[ P(n,q) + P(n,h) + P(d,q) + P(d,h) + P(e,q) + P(e,h)]/6
* * *** * * * * * *Sodh atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrkSodb atkavcvlkgdgpqvqgtinfeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrkSodl atkavcvlkgdgpqvqgsinfeqkesdgpvkvwgsikglte-glhgfhvhqfg----ndtagct sagphfnp lsrkSddm atkavcvlkgdgpqvq -infeak-gdtvkvwgsikglte—-glhgfhvhqfg----ndtagct sagphfnp lsrk Sdmz atkavcvlkgdgpqvq— infeqkesdgpvkvwgsikglte—glhgfhvhqfg----ndtagct sagphfnp LsrkSods vatkavcvlkgdgpqvq— infeak-gdtvkvwgsikgltepnglhgfhvhqfg----ndtagct sagphfnp lsrk Sdpb datkavcvlkgdgpqvq—-infeqkesdgpv----wgsikgltglhgfhvhqfgscasndtagctvlggssagphfnpehtnk
P(s) = (1-)()l pA #A* .. * pT
#T
Time reversible
## ##
#
* A # C G0
t
Thorne-Kishino-Felsenstein Process
Time reversibility
t1
a
t2
s2s1
t1 +t2s1 s2
Pi,j(t) = probability that i has evolved into j after time t.
(i) = probability of i after infinitely long time - equilibrium distribution
(i) Pi,j(t) = (j) Pj,i(t) t1-----------t2---------t3
# - - ... - # # # ... #
pk = t*[l*(k-1) pk-1 + m*k*pk+1 - (l+m)*k*pk]
# - - - ... - - # # # ... #
p’k=t*[l*(k-1) p’k-1+m*(k+1)*p’k+1-(l+m)*k*p’k+m*pk+1]
* - - - ... - * # # # ... #
p’’k=t*[l*k*p’’k-1+m*(k-1)*p’’k+1-((k+1)l+mk)*p’’k]
Initial Conditions: pk(0)= pk’’(0)= p’k (0)= 0 k>1 p0(0)= p0’’(0)= 1. p’0 (0)= 0
Diff. Equations for p-functions
& into Alignment Blocks
A. Amino Acids Ignored:
# - - - # - - - - * - - - -# # # # - # # # # * # # # # k k k
e-t[1-(t)]((t))k-1 [1-e-t-(t)][1-(t)]((t))k-1 [1-(t)]((t))k
pk(t) p’k(t) p’’k(t)
p’0(t)= (t) where (t)=[1-et]/[]
B. AA Considered: T - - - R Q S W Pt(T-->R)*Q*..*W*p4(t) 4
T - - - - - R Q S W R *Q*..*W*p’4(t) 4
Basic Pairwise Recursion (O(length3))
i
j
Survives Dies
i-1j
i-1 i-1i
j-2
j-1i
ijj
ij
P(s1i1 s2 j 2) * p2 * f (s1[i],s2[ j 1])
i-1
j-1
Fundamental Pairwise Recursion.
P(s1i->s2j) = p’0P(s1i-1->s2j) +
Initial Condition P(s10 ->s2j) = pj’’s2[1:j]
Simplification: Ri,j =(p1f(s1[i],s2[j]+p’1s2j[j])P(s1i-1->s2j-1)
P(s1i->s2j) = Ri,j + p’0 P(s1i->s2j-1)
P(s1i->s2j) = p’0P(s1i-1->s2j)+P(s1i->s2j-1) + (p1f(s1[i],s2[j]+p’1s2j[j]-
s2j[j] ))P(s1i-1->s2j-1)
Probability of observation P(s1 , s2) = P(s1) P(s1 ->s2)
P(s1i1 s2 j k)( pk f (s1[i],s2[ j k 1]) s2[ j k2: j ] p'k1kj
s2[ j k1: j ])
-globin (141) and -globin (146)
430.108 : -log(-globin) 327.320 : -log(-globin -globin) 730.428 : -log(l(sumalign))
*t: 0.0371805 +/- 0.0135899*t: 0.0374396 +/- 0.0136846s*t: 0.91701 +/- 0.119556
E(Length) E(Insertions,Deletions) E(Substitutions)
143.499 5.37255 131.59
Maximum contributing alignment:
V-LSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHF-DLS--H---GSAQVKGHGKKVADAL VHLTPEEKSAVTALWGKV--NVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGAF
TNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPAVHASLDKFLASVSTVLTSKYRSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANALAHKYH
Ratio l(maxalign)/l(sumalign) = 0.00565064
Probability for substitution: = 0.46
#children: p p' p'':
0. 0.0 3.60 10-2 9.64 10-1
1. 9.28 10-1 6.30 10-4 3.45 10-2
2. 3.32 10-2 2.26 10-5 1.23 10-3
3. 1.19 10-3 8.10 10-7 4.43 10-5
4. 4.27 10-5 2.90 10-8 1.59 10-6
5. 1.53 10-6 1.04 10-9 5.70 10-8
(t) = 9.64 * 10-1 (t) = 3.46 * 10-2
Length Evolution + Immortal Link
Accelerations of pairwise algorithm
1
2 - Better numerical search
3 - Simpler recursion
1991->2000 : an 106 acceleration for 2 proteins 1500 long.
4 - Better computers
Likelihood Surface
Homology Test
Wi,j= -ln(i*P2.5i,j/(i*j))
D(s1,s2) is evaluated in D(s1,s2*)
Real s1 = ATWYFCAK-AC Random s1 = ATWYFC-AKAC s2 = ETWYKCALLAD s2* = LTAYKADCWLE *** ** * * *
This test:
1. Test the competing hypothesis that 2 sequences are 2.5 events apart versus infinitely far apart.
2. It only handles substitutions “correctly”. The rationale for indel costs are more arbitrary.
3. It samples in (i*j) by permuting the order of amino acids in the second. I.e. uses drawing without replacement – a hypergeometric distribution.
Steel-Hein Algorithm
a
s1 s2
s3
*######
P(S) (1
)[P*(S)
P# (Tail )P(S Tail)]
*ACGGT
*ACGC *TTGT
Binary Tree Problem
s1
s2
s3
s4
a1 a2
ACCT
GTT
TGA
ACG
Markov Chains Generating the p-functions.
Generating Ancestral Sequences
1 Sequence # E * #
2 Sequence - # # E # # - E ** e- e-
-# e- e-
_# e- e-
#-
a1 * - # Ea2 * # # E e-
1 e
1 e
e
1 e
( )1 e
Fundamental Multiple Recursion Is1 - C G C T As2 A G A A T Ta1 # - a2 # #s3 A G C G Gs4 G - C C T G
#
#=
#
#
-
#
Sum over all
String partitions - Anc. state survivals - Anc. state MC jumps
Fundamental Multiple Recursion II
iS
P(k Si ,H )H C
P( )P (Si)
P(Sk): Epifixes (S1[k+1:l1]) starting in given MC starts i state .
( p' kj:H ( j )0
(t j ) sj [i( j) : k( j)])( pkj:H( j )1
( t j ) sj [i( j) 1: k( j)])F(kSi,H)
Where P(kS i,H =
P(Sk) =
Fundamental 4 sequence Recursion
Not a proper recursion!
Initialisation: PEE(Ø) =1 and P(Ø) are directly calculatable.
O(l2k)shown, O(lk) Algorithm possible
Toy 4-Sequence Program almost ready.
This approach could analyse up to 6-7 sequences.
Jens Ledet and others are working on Gibbs sampler approach.
Statistical Alignment Summary
Motivation for statistical alignment: Data is sequences not alignment!
Problems ahead: Longer Insertion Deletion Process
Position Heterogeneous Process Simultaneous Comparative Gene Finding and Alignment.
Making an TKF-process with a given HMM as stationary distribution.
Explore non-TKF processes.
References