Upload
amberly-griffin
View
218
Download
0
Embed Size (px)
Citation preview
Effective hidden Markov models for detecting splicing junction sites in
DNA sequences
Authors: Michael M. Yin and Jason T. L. Wang
Sources: Information Sciences, 139(1-2), pp. 139-163, 2001.
Advisor: Min-Shiang Hwang
Speaker: Chun-Ta Li
2
Introduction (1/1)codon:密碼子introns:內含子exons:編碼順序donor:捐贈者
3
Using HMMs to model splicing junction sites (1/3)
• The Donor Model
4
Using HMMs to model splicing junction sites (2/3)
• The Acceptor Model
5
Using HMMs to model splicing junction sites (3/3)
• Two modules for each model– The Donor Model:
• true site modules (true sites in the training data set)
• false site modules (false sites in the training data set)
– The Acceptor Model:• true site modules (true sites in the training data set)
• false site modules (false sites in the training data set)
6
Algorithm (1/3)
• Training algorithm (Donor Model)– Two training data sets
• positive training data set, Et
• negative training data set, Ef
– Probability of a transition form base bi to base bi+1
– True Donor Module• P( True | S, M(t)) S: a sequence S in set M
– False Donor Module• P( False | S, M(f)) S: a sequence S in set M
testing data set, M200 true donor sites14000 false donor sites
7
Algorithm (2/3)
• Bayes’ rule
• Probability of S being a donor sequence
• Probability of S being a nondonr sequence
1
11
)()( },,,{),,(),|(statesT
iiii
ti
t TCGAbbbftrMTrueSP
)(
)()|()|(
BP
APABPBAP
1
11
)()( },,,{),,(),|(statesT
iiii
fi
f TCGAbbbftrMFalseSP
)(
)(),|(),|(
)()(
SP
TruePMTrueSPMSTrueP
tt
)(
)(),|(),|(
)()(
SP
FalsePMFalseSPMSFalseP
ff
statesT
i
ti MTruebPSP
1
)( ),|()(
statesT
i
fi MFalsebPSP
1
)( ),|()(
8
Algorithm (3/3)
• The pratio is calculated for each sequence in set M
• Sort the pratio values in the descending order
• Calculates the positive lower bound, denoted Lp
• A sequence S > Lp assigns into set P
),|(
),|()(
)(
f
t
MSFalseP
MSTruePpratio
XT
TS
PP
TPemp
180
180
)(
)(
200
180
)(
)( NP
TPemn T
TS
thSNL emnp 1809.0*200* pratio value of positive sequence in set M
T(TP):屬於 set P的 positive sequences
T(P+N):在 set M的 positive sequences
T(PP):在 set P的 sequences
Algorithm for classifying splicing junction donor sequences
)(
)0(),0|(),|0(
)()(
cand
fcandf
cand SP
YPMYSPMSYP
),|0(
),|1()(
)(
fcand
tcand
MSYP
MSYPsratio
)(
)1(),1|(),|1(
)()(
cand
tcandt
cand SP
YPMYSPMSYP
.
,
0
1
otherwise
LsratioifKIND P
i
11
Example (1/4)• Training data set M
200 true donor sites14000 false donor sites
12
Example (2/4)• A sequence S (AGGGTCAGT)
1
11
)()( },,,{),,(),|(statesT
iiii
ti
t TCGAbbbftrMTrueSP
)(
)(),|(),|(
)()(
SP
TruePMTrueSPMSTrueP
tt
statesT
i
ti MTruebPSP
1
)( ),|()(
P(S|True,M(t))
P(True) = 200/14200=0.014
= 0.05*0.11*0.81*1*0.03*0.02*0.63*0.46
= 0.0000007746354
P(S)
= 0.32*0.13*0.81*1*1*0.03*0.72*0.83*0.51
= 0.0003081
P(True|S,M(t))
= (0.0000007746354*0.014)/0.0003081
= 0.0000352
13
Example (3/4)• A sequence S (AGGGTCAGT)
1
11
)()( },,,{),,(),|(statesT
iiii
ti
f TCGAbbbftrMFalseSP
)(
)(),|(),|(
)()(
SP
FalsePMFalseSPMSFalseP
ff
statesT
i
fi MFalsebPSP
1
)( ),|()(
P(S|False,M(f))
P(False) = 0.986
= 0.07*0.08*0.27*1*0.22*0.06*0.07*0.07
= 0.00000009779616
P(S)
= 0.25*0.25*0.27*1*1*0.22*0.24*0.25*0.3
= 0.00006683
P(False|S,M(f))
= (0.00000009779616*0.986)/0.00006683
= 0.001443
14
Example (4/4)• pratio = 0.0000352/ 0.001443 = 0.0244
),|0(
),|1()(
)(
fcand
tcand
MSYP
MSYPsratio
),|(
),|()(
)(
f
t
MSFalseP
MSTruePpratio
200 pratio values
Descending order
Lp = 180 th
Testing data, Scand
Table 1 & 2
sratio
.
,
0
1
otherwise
LsratioifKIND P
iKINDi
(True donor)(False donor)
(True donor or False donor)