Upload
emekaokaekwu
View
5
Download
1
Tags:
Embed Size (px)
DESCRIPTION
Part 5
Citation preview
Computational Biology, Part 5Hidden Markov Models
Robert F. MurphyRobert F. MurphyCopyright Copyright 2005-2009. 2005-2009.
All rights reserved.All rights reserved.
Markov chains
If we can predict all of the properties of a If we can predict all of the properties of a sequence knowing only the conditional sequence knowing only the conditional dinucleotide probabilities, then that dinucleotide probabilities, then that sequence is an example of a sequence is an example of a Markov chainMarkov chain
A A Markov chainMarkov chain is defined as a sequence is defined as a sequence of states in which each state depends only of states in which each state depends only on the previous stateon the previous state
Formalism for Markov chains MM=(=(Q,π,PQ,π,P) is a Markov chain, where) is a Markov chain, where QQ = vector (1,.., = vector (1,..,nn) is the list of states ) is the list of states
QQ(1)=A, (1)=A, QQ(2)=C, (2)=C, QQ(3)=G, (3)=G, QQ(4)=T for DNA(4)=T for DNA ππ = vector (p = vector (p11,..,p,..,pnn) is the initial probability of each state) is the initial probability of each state
ππ((ii)=pQ)=pQ((ii) ) (e,g., π(1)=p (e,g., π(1)=pA A for DNA)for DNA) PP= = nn x x nn matrix where the entry in row matrix where the entry in row i i and column and column jj is is
the probability of observing state the probability of observing state jj if the previous state is if the previous state is i i and the sum of entries in each row is 1 (and the sum of entries in each row is 1 ( dinucleotide dinucleotide probabilities) probabilities) PP(i,j)=p*(i,j)=p*Q(i)Q(i) Q(i)Q(i) (e.g., (e.g., PP(1,2)=p*(1,2)=p*ACAC for DNA) for DNA)
Generating Markov chains Given Given Q,π,PQ,π,P (and a random number generator), we (and a random number generator), we
can generate sequences that are members of the can generate sequences that are members of the Markov chain MMarkov chain M
If If π,Pπ,P are derived from a single sequence, the are derived from a single sequence, the family of sequences generated by family of sequences generated by MM will include will include that sequence as well as many othersthat sequence as well as many others
If If π,Pπ,P are derived from a sampled set of sequences, are derived from a sampled set of sequences, the family of sequences generated by the family of sequences generated by MM will be will be the population from which that set has been the population from which that set has been sampledsampled
Interactive Demonstration
(A11 Markov chains)(A11 Markov chains)
Matlab code for generating Markov chainschars = ['a' 'c' 'g' 't'];chars = ['a' 'c' 'g' 't']; % the dinucs array shows the frequency of observing the character in the % the dinucs array shows the frequency of observing the character in the % row followed by the character in the column% row followed by the character in the column% these values show strong preference for c-c% these values show strong preference for c-cdinucs = [2, 1, 2, 0; 0, 8, 0, 1; 2, 0, 2, 0; 1, 0, 0, 1];dinucs = [2, 1, 2, 0; 0, 8, 0, 1; 2, 0, 2, 0; 1, 0, 0, 1];% these values restrict transitions more% these values restrict transitions more%dinucs = [2, 0, 2, 0; 0, 8, 0, 0; 2, 0, 2, 0; 1, 1, 0, 1];%dinucs = [2, 0, 2, 0; 0, 8, 0, 0; 2, 0, 2, 0; 1, 1, 0, 1]; % calculate mononucleotide frequencies only as the probability of% calculate mononucleotide frequencies only as the probability of% starting with each nucleotide% starting with each nucleotidemonocounts = sum(dinucs,2);monocounts = sum(dinucs,2);monofreqs = monocounts/sum(monocounts);monofreqs = monocounts/sum(monocounts);cmonofreqs = cumsum(monofreqs);cmonofreqs = cumsum(monofreqs);
Matlab code for generating Markov chains% calculate dinucleotide frequencies and cumulative dinuc freqs% calculate dinucleotide frequencies and cumulative dinuc freqsfreqs = dinucs./repmat(monocounts,1,4);freqs = dinucs./repmat(monocounts,1,4);cfreqs = cumsum(freqs,2);cfreqs = cumsum(freqs,2); disp('Dinucleotide frequencies (transition probabilities)');disp('Dinucleotide frequencies (transition probabilities)');fprintf(' %c %c %c %c\n',chars)fprintf(' %c %c %c %c\n',chars)for i=1:4for i=1:4 fprintf('%c %f %f %f %f\n',chars(i),freqs(i,:))fprintf('%c %f %f %f %f\n',chars(i),freqs(i,:))endend
Matlab code for generating Markov chainsnseq = 10;nseq = 10;for ntries=1:20for ntries=1:20 rnums = rand(nseq,1);rnums = rand(nseq,1); % start sequence using mononucleotide frequencies% start sequence using mononucleotide frequencies seq(1) = min(find(cmonofreqs>=rnums(1)));seq(1) = min(find(cmonofreqs>=rnums(1))); for i=2:nseqfor i=2:nseq % extend it using the appropriate row from the dinuc freqs% extend it using the appropriate row from the dinuc freqs seq(i) = min(find(cfreqs(seq(i-1),:)>=rnums(i)));seq(i) = min(find(cfreqs(seq(i-1),:)>=rnums(i))); endend output=chars(seq);output=chars(seq); disp(strvcat(output));disp(strvcat(output));endend
Discriminating between two states with Markov chains To determine which of two states a To determine which of two states a
sequence is more likely to have resulted sequence is more likely to have resulted from, we calculatefrom, we calculate
S(x) log P(x | model)P(x | model-)
logaxi 1xi
axi 1xi
i1
L
S(x) xi 1xii1
L
State probablities for + and - models Given examples sequences that are from Given examples sequences that are from
either + model (CpG island) or - model (not either + model (CpG island) or - model (not CpG island), can calculate the probability CpG island), can calculate the probability that each nucleotide will occur for each that each nucleotide will occur for each model (the model (the aa values for each model) values for each model)
+ A C G T - A C G T+ A C G T - A C G TA 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292
Transition probabilities converted to log likelihood ratiosßß AA CC GG TTAA -0.740-0.740 0.4190.419 0.5800.580 -0.803-0.803CC -0.913-0.913 0.3020.302 1.8121.812 -0.685-0.685GG -0.624-0.624 0.4610.461 0.3310.331 -0.730-0.730TT -1.169-1.169 0.5730.573 0.3930.393 -0.679-0.679
Example
What is relative probability of C+G+C+ What is relative probability of C+G+C+ compared with C-G-C-?compared with C-G-C-?
First calculate log-odds ratio:First calculate log-odds ratio:S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273
Convert to relative probability:Convert to relative probability:222.2732.273=4.833=4.833
Relative probability is ratio of (+) to (-)Relative probability is ratio of (+) to (-)P(+)=4.833 P(-)P(+)=4.833 P(-)
Example
Convert to percentageConvert to percentageP(+) + P(-) = 1P(+) + P(-) = 14.833P(-) + P(-) = 14.833P(-) + P(-) = 1P(-) = 1/5.833 = 17%P(-) = 1/5.833 = 17%
ConclusionConclusionP(+)=83% P(-)=17%P(+)=83% P(-)=17%
Hidden Markov models
““Hidden” connotes that the sequence is Hidden” connotes that the sequence is generated by two or more states that have generated by two or more states that have different transition probability matricesdifferent transition probability matrices
More definitions
ii = state at position = state at position ii in a in a pathpath aaklkl = P( = P(ii = = ll | | ii-1-1 = = kk))
probabilityof going from one state to anotherprobabilityof going from one state to another ““transition probability”transition probability”
eekk(b)(b) = P( = P(xxii = = bb | | ii = = kk)) probability of probability of emittingemitting a a bb when in state when in state kk ““emission probability”emission probability”
Generating sequences (see previous example code) % force emission to match state (normal Markov % force emission to match state (normal Markov
model, not hidden)model, not hidden) emit = diag(repmat(1,4,1));emit = diag(repmat(1,4,1)); [seq2,states]=hmmgenerate(10,freqs,emit)[seq2,states]=hmmgenerate(10,freqs,emit) output2=chars(seq2);output2=chars(seq2); disp(strvcat(output2));disp(strvcat(output2));
Decoding
The goal of using an HMM is often to The goal of using an HMM is often to determine (estimate) the sequence of determine (estimate) the sequence of underlying states that likely gave rise to an underlying states that likely gave rise to an observed sequenceobserved sequence
This is called “decoding” in the jargon of This is called “decoding” in the jargon of speech recognitionspeech recognition
More definitions
Can calculate the joint probability of a Can calculate the joint probability of a sequence x and a state sequence sequence x and a state sequence
P(x, ) a01e i
(x i)a i i1i1
L
requiring L 1 0
Determining the optimal path: the Viterbi algorithm Viterbi algorithm is form of dynamic Viterbi algorithm is form of dynamic
programmingprogramming Definition: Let vDefinition: Let vkk(i) be the probability of the (i) be the probability of the
most probable path ending in state k with most probable path ending in state k with observation iobservation i
Determining the optimal path: the Viterbi algorithm Initialisation (Initialisation (ii=0): =0): vv00(0)=1, (0)=1, vvkk(0)=0 for (0)=0 for kk>0>0 Recursion (Recursion (ii=1..=1..LL): ): vvll(i)=(i)=eell((xxii)max)maxkk((vvkk(i-1)(i-1)aaklkl))
ptrptrii((ll)=argmax)=argmaxkk((vvkk(i-1)(i-1)aaklkl)) Termination: P(Termination: P(xx,,*)=max*)=maxkk((vvkk((LL))aak0k0))
LL*=argmax*=argmaxkk((vvkk(L)a(L)ak0k0)) Traceback (Traceback (ii==LL..1): ..1): i-1i-1*=ptr*=ptrii((ii*)*)
Block Diagram for Viterbi Algorithm
Viterbi Algorithm
transition probabilities
alphabetmost probable state sequence
emission probabilities
sequence
Multiple paths can give the same sequence The Viterbi algorithm finds the most likely The Viterbi algorithm finds the most likely
path given a sequencepath given a sequence Other paths could also give rise to the same Other paths could also give rise to the same
sequencesequence How do we calculate the probability of a How do we calculate the probability of a
sequence given an HMM?sequence given an HMM?
Probability of a sequence
Sum the probabilities of all possible paths Sum the probabilities of all possible paths that give that sequencethat give that sequence
Let Let P(x)P(x) be the probability of observing be the probability of observing sequence sequence xx given an HMM given an HMM
P(x) P(x, )
Probability of a sequence
Can find Can find P(x)P(x) using a variation on Viterbi using a variation on Viterbi algorithm using sum instead of maxalgorithm using sum instead of max
This is called the This is called the forward algorithmforward algorithm Replace Replace vvkk(i)(i) with with ffkk(i)=P(x(i)=P(x11…x…xii,,ii=k)=k)
Forward algorithm Initialisation (Initialisation (ii=0): =0): ff00(0)=1, (0)=1, ffkk(0)=0 for (0)=0 for kk>0>0 Recursion (Recursion (ii=1..=1..LL): ):
Termination:Termination:
f l (i) el (x i) f k (i 1)aklk
P(x) fk (L)ak0k
Backward algorithm
We may need to know the probability that a We may need to know the probability that a particular observation particular observation xxii came from a came from a particular state particular state kk given a sequence given a sequence xx, , P(P(ii=k|x)=k|x)
Use algorithm analogous to forward Use algorithm analogous to forward algorithm but starting from the endalgorithm but starting from the end
Backward algorithm Initialisation (Initialisation (ii=0): =0): bbkk((LL)=)=aakk00 for all for all kk Recursion (Recursion (ii=L-1,…,1): =L-1,…,1):
Termination:Termination:
bk (i) akll el (x i1)bl (i 1)
P(x) a0lel (x1)bl (1)l
Estimating probability of state at particular position Combine the forward and backward probabilities Combine the forward and backward probabilities
to estimate the posterior probability of the to estimate the posterior probability of the sequence being in a particular state at a particular sequence being in a particular state at a particular positionposition
P( i k | x) f k (i)bk (i)
P(x)
Parameter estimation for HMMs
Simple when state sequence is known for Simple when state sequence is known for training examplestraining examples
Can be very complex for unknown paths Can be very complex for unknown paths
Estimation when state sequence known Count number of times each transition Count number of times each transition
occurs, occurs, AAklkl
Count number of times each emission Count number of times each emission occurs from each state, occurs from each state, EEkk(b)(b)
Convert to probabilitiesConvert to probabilities
akl Akl
Akl 'l '
ek (b) E k (b)E k (b')
b '
Baum-Welch
Make initial parameter estimatesMake initial parameter estimates Use forward algorithm and backward Use forward algorithm and backward
algorithm to calculate probability of each algorithm to calculate probability of each sequence according to the modelsequence according to the model
Calculate new model parametersCalculate new model parameters Repeat until termination criteria met Repeat until termination criteria met
(change in log likelihood < threshold)(change in log likelihood < threshold)
Estimating transition frequencies
Probability that Probability that aaklkl is used as position is used as position ii in in sequence sequence xx
Sum over all positions (i) and all sequences Sum over all positions (i) and all sequences (j) to get expected number of times (j) to get expected number of times aaklkl is used is used
P( i k, i1 l | x,) f k (i)aklel (x i1)bl (i 1)
P(x)
Akl 1
P(x j )j fk
j (i)aklel (x ij
1)blj (i 1)
i
Estimating emission frequencies
Sum over all positions for which the emitted Sum over all positions for which the emitted character is character is bb and all sequences and all sequences
E k (b) 1
P(x j )j f k
j (i)bkj (i)
i|xij b
Updating model parameters
Convert expected numbers to probabilities Convert expected numbers to probabilities as if expected numbers were actual countsas if expected numbers were actual counts
akl Akl
Akl 'l '
ek (b) Ek (b)Ek (b')
b '
Test for termination Calculate the log likelihood of the model for all of the Calculate the log likelihood of the model for all of the
sequences using the new parameterssequences using the new parameters
If the change in log likelihood exceeds some If the change in log likelihood exceeds some threshold, go back and make new estimates of threshold, go back and make new estimates of a a and and ee
logP(x j |)j1
n