Upload
rania
View
45
Download
0
Embed Size (px)
DESCRIPTION
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7. Background Readings : Chapter 3.3 in the text book, Biological Sequence Analysis , Durbin et al., 2001. Shlomo Moran, following Danny Geiger and Nir Friedman. HMM for CpG Islands. - PowerPoint PPT Presentation
Citation preview
HMM for CpG Islands
Parameter Estimation For HMMMaximum Likelihood and the Information Inequality
Lecture #7
Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Shlomo Moran, following Danny Geiger and Nir Friedman
HMM for CpG Islands
Reminder: Hidden Markov ModelS1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
11 11
( , ) ( , , ; ,..., ) ( )i i i
L
L L s s s ii
p p s s x x m e xs x
Next we apply HMM for the question of recognizing CpG ilands
3
Hidden Markov Model for CpG Islands
The states: Domain(Si)={+, -} {A,C,T,G} (8 values)
In this representation P(xi| si) = 0 or 1 depending on whether xi is consistent with si . e.g. xi= G is consistent with si=(+,G) and with si=(-,G) but not with any other state of si.
A- T+
A T
G +
G
… …… …
4
Reminder: Most Probable state pathS*1
S*2 S*L-1 S*L
x1 x2 XL-1 xL
M M M M
TTTT
Given an output sequence x = (x1,…,xL),
A most probable path s*= (s*1,…,s*
L) is one which maximizes p(s|x).
1( ,..., )
* *1 1 1* ( ,..., ) ( ,..., | ,..., )argmax
Ls sL L Ls s s p s s x x
5
A- C- T- T+
A C T T
G +
G
Predicting CpG islands via most probable path:
Output symbols: A, C, G, T (4 letters). Markov Chain states: 4 “-” states and 4 “+” states, two for each letter (8 states total).A most probable path (found by Viterbi’s algorithm) predicts CpG islands.
Experiment (Durbin et al, p. 60-61) shows that the predicted islands are shorter than the assumed ones. In addition quite a few “false negatives” are found.
7
Reminder: Most probable state
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Given an output sequence x = (x1,…,xL), si is a most probable state (at location i) if:
si=argmaxk p(Si=k |x).
( , )( | ) ( , )( )
ii i
p S kp S k p S kp
xx xx
8
Finding the probability that a letter is in a CpG island via the algorithm for most probable state:
The probability that the occurrence of G in the i-th location is in a CpG island (+ state) is:∑s+ p(Si =s+ |x) = ∑s+ F(Si=s+
)B(Si=s+ )
Where the summation is formally over the 4 “+” states, but actually only state G+ need to be considered (why?)
A- C- T- T+
A C T T
G +
G
i
10
Parameter Estimation for HMM
11
Defining the Parameters
An HMM model is defined by the parameters: mkl and ek(b), for all states k,l and all symbols b. Let θ denote the collection of these parameters:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
lk
b
mkl
ek(b)
{ : , are states} { ( ) : is a state, is a letter}kl km k l e b k b
12
Training Sets
To determine the values of (the parameters in) θ, use a training set = {x1,...,xn}, where each xj is a sequence which is assumed to fit the model.Given the parameters θ, each sequence xj has an assigned probability p(xj|θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
13
Maximum Likelihood Parameter Estimation for HMM
The elements of the training set {x1,...,xn}, are assumed to be independent, p(x1,..., xn|θ) = ∏j p (xj|θ).
ML parameter estimation looks for θ which maximizes the above.
The exact method for finding or approximating this θ depends on the nature of the training set used.
14
Data for HMM
The training set is characterized by:1.For each xj, the information on the states sj
i (the symbols
xji are usually known).
2.Its size (sum of lengths of all sequences).
S1 S2 SL-1 SL
x1 x2 XL-1 xL
M M M M
TTTT
15
Case 1: ML when Sequences are fully known
We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.
By the ML method, we look for parameters θ* which maximize the probability of the sample set: p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
16
Case 1: Sequences are fully known
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
Let Mkl= |{i: si-1=k,si=l}| (in xj). Ek(b)=|{i:si=k,xi=b}| (in xj).
( )
( , ) ( , )then: ( | ) [ ( )]kl kM E b
kklk l k b
jp m e bx
For each xj we have:
11
( | ) ( )j
i i i
Lj j
s s s ii
p x m e x
17
Case 1 (cont)
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
By the independence of the xj’s, p(x1,...,xn| θ)=∏jp(xj|θ).
Thus, if Mkl = #(transitions from k to l) in the training set, and Ek(b) = #(emissions of symbol b from state k) in the training set, we have:
( )
( , ) ( , )
1 ( | ) [ ( )],.., kl kM E bkkl
k l k b
np m e bx x 18
Case 1 (cont)
( )
( , ) ( , )[ ( )]kl kM E b
kklk l k b
m e b
So we need to find mkl’s and ek(b)’s which maximize:
Subject to:
For all states , 1 and ( ) 1
[ , ( ) 0 ]
kl kl b
kl k
k m e b
m e b
19
Case 1 (cont)
( )
( , ) ( , )
( )
[ ( )]
[ ( )][ ] [ ]
kl k
kl k
M E bkkl
k l k b
M E bkkl
k l k b
F m e b
m e b
kb
Subject to: for all , 1, and e ( ) 1.kll
k m b
Rewriting, we need to maximize:
20
Case 1 (cont)
If we maximize for each : s.t. 1klMklkl
llk m m ( )and also [ ( )] s.t. ( ) 1kE b
k kbb
e b e b
Then we will maximize also F.Each of the above is a simpler ML problem, which is similar to ML parameters estimation for a die, treated next.
21
ML Parameters Estimation for a Single Die
22
Defining The Problem
Let X be a random variable with 6 values x1,…,x6 denoting the six outcomes of a (possibly unfair) die. Here the parameters are θ ={1,2,3,4,5, 6} , ∑θi=1Assume that the data is one sequence:
Data = (x6,x1,x1,x3,x2,x2,x3,x4,x5,x2,x6)So we have to maximize
2 3 2 21 2 3 4 5 6( | )P Data
Subject to: θ1+θ2+ θ3+ θ4+ θ5+ θ6=1 [and θi 0 ]
252 3 2
1 2 3 4 51
i.e., ( | ) 1 ii
P Data
23
Side comment: Sufficient Statistics To compute the probability of data in the die
example we only require to record the number of times Ni falling on side i (namely,N1, N2,…,N6).
We do not need to recall the entire sequence of outcomes
{Ni | i=1…6} is called sufficient statistics for the multinomial sampling.
654321
5
154321 1)|(N
i iNNNNNDataP
24
Sufficient Statistics A sufficient statistics is a function of the data that
summarizes the relevant information for the likelihood
Formally, s(Data) is a sufficient statistics if for any two datasets D and D’ s(Data) = s(Data’ ) P(Data|) = P(Data’|)
Datasets
Statistics
Exercise:Define “sufficient statistics” for the HMM model.
25
Maximum Likelihood EstimateBy the ML approach, we look for parameters that maximizes the probability of data (i.e., the likelihood function ).We will find the parameters by considering the corresponding log-likelihood function:
63 51 2 4
551 2 3 4 1log ( | ) log 1
NN NN N NiiP Data
5
165
11loglog
i ii ii NN
A necessary condition for (local) maximum is:
01
)|(log5
1
6
¶
θ¶
i ij
j
j
NNDataP
26
Finding the Maximum
Rearranging terms:
ii
jj N
N Divide the jth equation by the ith
equation:
Sum from j=1 to 6:
ii
ii
j j
jj N
NN
N
6
16
1
1
So there is only one local – and hence global – maximum. Hence the MLE is given by:
6,..,1 iNN i
i
6
65
1
6
1 NNN
i ij
j
27
Note: Fractional Exponents are possible
Some models allow ni’s to be fractions (eg, if we are uncertain of a die outcome, we may consider it “6” with 20% confidence and “5” with 80%). Our analysis didn’t assume that the ni are integers, thus it applies also for fractional exponents.
28
Generalization for distribution with any number n of outcomes
Let X be a random variable with k values x1,…,xk denoting the k outcomes of Independently and Identically Distributed experiments, with parameters θ ={1,2,...,k} (θi is the probability of xi). Again, the data is one sequence of length n, in which xi appears ni times.
Then we have to maximize
1 211 2( | ) , ( ... )knn n
kkP Data n n n
Subject to: θ1+θ2+ ....+ θk=1
11
1
1 11
i.e., ( | ) 1k
k
nknn
iki
P Data
29
Generalization for n outcomes (cont)
i k
i k
n n
By treatment identical to the die case, the maximum is obtained when for all i:
Hence the MLE is given by the relative frequencies:
1,..,ii
ni k
n
30
ML for a Single Dice, Normalized Version
Consider the two experiments for a 3-sided dice:1. 8 tosses: 2 x1,, 3 x2, 5 x3.2. 800 tosses: 200 x1,, 300 x2, 500 x3
Clearly, both imply the same ML parameters.In general, when formulating ML for a single dice, we
can ignore the actual number n of tosses, and just use the fraction of each outcome.
31
1 2
1 1
1
1 2
Given positive numbers ,..., s.t. ... 1Find parameters ,..., which maximize:
( | ) .k
k k
kpp p
k
k p p p p
P Data
Thus we can replace the number of outcomes ni by pi=ni/n, and get the following normalized setting of the ML problem for a single dice:
And the same analysis yields that a maximum is obtained when:
1,..,i ip i k
Normalized version of ML (cont.)
32
Implication:
The Kullback Leibler
Information inequality
33
1 2
1
1
1 2
Let ( ,..., ) be a probability distribution over a -set.For any other such distribution ( ,..., ),
let the likelihood ( | ) . Then ( | ) is maximized only when
k
k
kpp pk
P p p kQ q q
P Data Q q q qP Data Q P
.Q
We can rephrase the “ML for single dice” inequality:
Rephrasing the ML inequality
1
1
Let ( ,..., ) be a probability distribution over a -set.For any other such distribution ( ,..., ), consider the sum
( ) log (data| ) log
Then has a unique maximum at .
k
k
i i
P p p kQ q q
R Q P Q p q
R Q P
Taking logarithms, we get
34
The Kullback-Leibler Information Inequality
1 1
Given probability distributions over a -set, ( ,..., ) and ( ,..., ).The of and is defined by:
( || ) log
Then ( || ) 0, with equalit
relativ
y onl
e ent
y
r
when
opy k k
ii
i
kP p p Q q q
P Q
pD P Q p
q
D P Q
.P Q
35
Proof of the information inequality
By the logarithmic version of the "normalized maximum likelihood"(2 slides back):
( || ) log log log 0,
and equality holds only when for all . QED
ii i i i i
i
i i
pD P Q p p p p q
q
p q i
36
Using the Solution for the“Dice Maximum Likelihood” to Find Parameters for HMM
When all States are Known
37
The Parameters
Let Mkl = #(transitions from k to l) in the training set.Ek(b) = #(emissions of symbol b from state k) in the training set. We need to:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
( )
( , ) ( , )Maximize [ ( )]kl kM E b
kklk l k b
m e b
k Subject to: for all states , 1, and e ( ) 1, , ( ) 0.kl kl kl b
k m b m e b 38
Apply to HMM (cont.)
We apply the previous technique to get for each k the parameters {mkl|l=1,..,m} and {ek(b)|bΣ}:
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
'' '
( ) , and ( )( ')
kl kkl k
kl kl b
M E bm e bM E b
Which gives the optimal ML parameters39
Summary of Case 1: Sequences are fully known
We know the complete structure of each sequence in the training set {x1,...,xn}. We wish to estimate mkl and ek(b) for all pairs of states k, l and symbols b.
When everything is known, we can find the (unique set of) parameters θ* which maximizes
p(x1,...,xn| θ*) =MAXθ p(x1,...,xn| θ).
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
40
Adding pseudo counts in HMM
We may modify the actual count by our prior knowledge/belief (e.g., when the sample set is too small) : rkl is our prior belief on transitions from k to l.rk(b) is our prior belief on emissions of b from state k.
s1 s2 sL-1 sL
X1 X2 XL-1 XL
si
Xi
' '' '
( ) ( )then , and ( )( ) ( ( ') ( '))
kl kl k kkl k
kl kl k kl b
M r E b r ba e bM r E b r b
41
Case 2: State Paths are Unknown. Here we use
ML with Hidden Parameters
42
Dice likelihood with hidden parameters
Let X be a random variable with 3 values 0,1,2. Hence the parameters are θ ={0,1,2} , ∑θi=1
Assume that the data is a sequence of 2 tosses which we don’t see, but we know the the sum of
the outcomes is 2.
The problem: Find parameters which maximize the likelihood (probability) of the observed data.
Basic fact: The probability of an event is the sum of the probabilities of the simple events it
contains.
The probability space here: all sequences of 2 tosses:
43
(0,0), (0,1), (0, 2), (1,0),..., (2, 2)
Defining The Problem
44
21 0 2Pr(sum = 2| ) = Pr {(1,1), (2,0), (0,2)} 2 .
Thus, we need to find parameters θ which maximize:
Finding an optimal solution is in general a difficult task. Hence we do the following procedure:1. “Guess” initial parameters2. Repeatedly improve the parameters using the EM
algorithm (to be studied later in this course).Next, we exemplify the EM algorithm on the above example.
E step: Average Counts:
45
Pr(sum = 2| ) 0.0625 2 0.125 0.3125.
We use the probabilities of the events to generate “average counts” of the outcomes:Average count of 0 is 2*0.125=0.25.Average count of 1 is 2*0.0625=0.125 .Average count of 2 is 2*0.125=0.25.
0 1 2
2
Assume our initial paramaters are:0.5, 0.25
Pr(1,1) 0.25 0.0625Pr(2,0) Pr(0,2) 0.5 0.25 0.125
M step: Updating probabilities by the average counts
46
Pr(sum = 2 | ) 0.04 2 0.16 0.36 0.3125 Pr(sum = 2 | ) .
0 2
1
0.25 0.4 .0.6250.125 0.2.0.625
The total of all average count is: 2*0.25+0.125=0.625.The new relative frequencies equal the new parameters, λ1, λ2, λ3 :
2
2
Pr(1,1) 0.2 0.04
Pr(2,0) Pr(0, 2) 0.4 0.16
The probabilities of the simple events according to the new parameters:
The probability of the events by the new parameters:
Summary of the algorithm:
47
• Start with some estimated parameters θ.• Use these parameters to define average counts of the outcomes.• define new parameters λ by the relative frequencies of the average
counts.
We will show that this algorithm never decreases, and usually increases the likelihood of the data.
An application of this algorithm for HMM is known as the Baum Welch algorithm, which we will see next.