Lectures

Computational Biology, Part 5Hidden Markov Models

Robert F. MurphyRobert F. MurphyCopyright Copyright 2005-2009. 2005-2009.

All rights reserved.All rights reserved.

Markov chains

If we can predict all of the properties of a If we can predict all of the properties of a sequence knowing only the conditional sequence knowing only the conditional dinucleotide probabilities, then that dinucleotide probabilities, then that sequence is an example of a sequence is an example of a Markov chainMarkov chain

A A Markov chainMarkov chain is defined as a sequence is defined as a sequence of states in which each state depends only of states in which each state depends only on the previous stateon the previous state

Formalism for Markov chains MM=(=(Q,π,PQ,π,P) is a Markov chain, where) is a Markov chain, where QQ = vector (1,.., = vector (1,..,nn) is the list of states ) is the list of states

QQ(1)=A, (1)=A, QQ(2)=C, (2)=C, QQ(3)=G, (3)=G, QQ(4)=T for DNA(4)=T for DNA ππ = vector (p = vector (p11,..,p,..,pnn) is the initial probability of each state) is the initial probability of each state

ππ((ii)=pQ)=pQ((ii) ) (e,g., π(1)=p (e,g., π(1)=pA A for DNA)for DNA) PP= = nn x x nn matrix where the entry in row matrix where the entry in row i i and column and column jj is is

the probability of observing state the probability of observing state jj if the previous state is if the previous state is i i and the sum of entries in each row is 1 (and the sum of entries in each row is 1 ( dinucleotide dinucleotide probabilities) probabilities) PP(i,j)=p*(i,j)=p*Q(i)Q(i) Q(i)Q(i) (e.g., (e.g., PP(1,2)=p*(1,2)=p*ACAC for DNA) for DNA)

Generating Markov chains Given Given Q,π,PQ,π,P (and a random number generator), we (and a random number generator), we

can generate sequences that are members of the can generate sequences that are members of the Markov chain MMarkov chain M

If If π,Pπ,P are derived from a single sequence, the are derived from a single sequence, the family of sequences generated by family of sequences generated by MM will include will include that sequence as well as many othersthat sequence as well as many others

If If π,Pπ,P are derived from a sampled set of sequences, are derived from a sampled set of sequences, the family of sequences generated by the family of sequences generated by MM will be will be the population from which that set has been the population from which that set has been sampledsampled

Interactive Demonstration

(A11 Markov chains)(A11 Markov chains)

Matlab code for generating Markov chainschars = ['a' 'c' 'g' 't'];chars = ['a' 'c' 'g' 't']; % the dinucs array shows the frequency of observing the character in the % the dinucs array shows the frequency of observing the character in the % row followed by the character in the column% row followed by the character in the column% these values show strong preference for c-c% these values show strong preference for c-cdinucs = [2, 1, 2, 0; 0, 8, 0, 1; 2, 0, 2, 0; 1, 0, 0, 1];dinucs = [2, 1, 2, 0; 0, 8, 0, 1; 2, 0, 2, 0; 1, 0, 0, 1];% these values restrict transitions more% these values restrict transitions more%dinucs = [2, 0, 2, 0; 0, 8, 0, 0; 2, 0, 2, 0; 1, 1, 0, 1];%dinucs = [2, 0, 2, 0; 0, 8, 0, 0; 2, 0, 2, 0; 1, 1, 0, 1]; % calculate mononucleotide frequencies only as the probability of% calculate mononucleotide frequencies only as the probability of% starting with each nucleotide% starting with each nucleotidemonocounts = sum(dinucs,2);monocounts = sum(dinucs,2);monofreqs = monocounts/sum(monocounts);monofreqs = monocounts/sum(monocounts);cmonofreqs = cumsum(monofreqs);cmonofreqs = cumsum(monofreqs);

Matlab code for generating Markov chains% calculate dinucleotide frequencies and cumulative dinuc freqs% calculate dinucleotide frequencies and cumulative dinuc freqsfreqs = dinucs./repmat(monocounts,1,4);freqs = dinucs./repmat(monocounts,1,4);cfreqs = cumsum(freqs,2);cfreqs = cumsum(freqs,2); disp('Dinucleotide frequencies (transition probabilities)');disp('Dinucleotide frequencies (transition probabilities)');fprintf(' %c %c %c %c\n',chars)fprintf(' %c %c %c %c\n',chars)for i=1:4for i=1:4 fprintf('%c %f %f %f %f\n',chars(i),freqs(i,:))fprintf('%c %f %f %f %f\n',chars(i),freqs(i,:))endend

Matlab code for generating Markov chainsnseq = 10;nseq = 10;for ntries=1:20for ntries=1:20 rnums = rand(nseq,1);rnums = rand(nseq,1); % start sequence using mononucleotide frequencies% start sequence using mononucleotide frequencies seq(1) = min(find(cmonofreqs>=rnums(1)));seq(1) = min(find(cmonofreqs>=rnums(1))); for i=2:nseqfor i=2:nseq % extend it using the appropriate row from the dinuc freqs% extend it using the appropriate row from the dinuc freqs seq(i) = min(find(cfreqs(seq(i-1),:)>=rnums(i)));seq(i) = min(find(cfreqs(seq(i-1),:)>=rnums(i))); endend output=chars(seq);output=chars(seq); disp(strvcat(output));disp(strvcat(output));endend

Discriminating between two states with Markov chains To determine which of two states a To determine which of two states a

sequence is more likely to have resulted sequence is more likely to have resulted from, we calculatefrom, we calculate

S(x) log P(x | model)P(x | model-)

logaxi 1xi

axi 1xi

i1

L

S(x) xi 1xii1

L

State probablities for + and - models Given examples sequences that are from Given examples sequences that are from

either + model (CpG island) or - model (not either + model (CpG island) or - model (not CpG island), can calculate the probability CpG island), can calculate the probability that each nucleotide will occur for each that each nucleotide will occur for each model (the model (the aa values for each model) values for each model)

+ A C G T - A C G T+ A C G T - A C G TA 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210A 0.180 0.274 0.426 0.120 A 0.300 0.205 0.285 0.210C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302C 0.171 0.368 0.274 0.188 C 0.322 0.298 0.078 0.302G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208G 0.161 0.339 0.375 0.125 G 0.248 0.246 0.298 0.208T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292T 0.079 0.355 0.384 0.182 T 0.177 0.239 0.292 0.292

Transition probabilities converted to log likelihood ratiosßß AA CC GG TTAA -0.740-0.740 0.4190.419 0.5800.580 -0.803-0.803CC -0.913-0.913 0.3020.302 1.8121.812 -0.685-0.685GG -0.624-0.624 0.4610.461 0.3310.331 -0.730-0.730TT -1.169-1.169 0.5730.573 0.3930.393 -0.679-0.679

Example

What is relative probability of C+G+C+ What is relative probability of C+G+C+ compared with C-G-C-?compared with C-G-C-?

First calculate log-odds ratio:First calculate log-odds ratio:S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273S(CGC)= ß(CG) +ß(GC)=1.812+0.461=2.273

Convert to relative probability:Convert to relative probability:222.2732.273=4.833=4.833

Relative probability is ratio of (+) to (-)Relative probability is ratio of (+) to (-)P(+)=4.833 P(-)P(+)=4.833 P(-)

Example

Convert to percentageConvert to percentageP(+) + P(-) = 1P(+) + P(-) = 14.833P(-) + P(-) = 14.833P(-) + P(-) = 1P(-) = 1/5.833 = 17%P(-) = 1/5.833 = 17%

ConclusionConclusionP(+)=83% P(-)=17%P(+)=83% P(-)=17%

Hidden Markov models

““Hidden” connotes that the sequence is Hidden” connotes that the sequence is generated by two or more states that have generated by two or more states that have different transition probability matricesdifferent transition probability matrices

More definitions

ii = state at position = state at position ii in a in a pathpath aaklkl = P( = P(ii = = ll | | ii-1-1 = = kk))

probabilityof going from one state to anotherprobabilityof going from one state to another ““transition probability”transition probability”

eekk(b)(b) = P( = P(xxii = = bb | | ii = = kk)) probability of probability of emittingemitting a a bb when in state when in state kk ““emission probability”emission probability”

Generating sequences (see previous example code) % force emission to match state (normal Markov % force emission to match state (normal Markov

model, not hidden)model, not hidden) emit = diag(repmat(1,4,1));emit = diag(repmat(1,4,1)); [seq2,states]=hmmgenerate(10,freqs,emit)[seq2,states]=hmmgenerate(10,freqs,emit) output2=chars(seq2);output2=chars(seq2); disp(strvcat(output2));disp(strvcat(output2));

Decoding

The goal of using an HMM is often to The goal of using an HMM is often to determine (estimate) the sequence of determine (estimate) the sequence of underlying states that likely gave rise to an underlying states that likely gave rise to an observed sequenceobserved sequence

This is called “decoding” in the jargon of This is called “decoding” in the jargon of speech recognitionspeech recognition

More definitions

Can calculate the joint probability of a Can calculate the joint probability of a sequence x and a state sequence sequence x and a state sequence

P(x, ) a01e i

(x i)a i i1i1

L

requiring L 1 0

Determining the optimal path: the Viterbi algorithm Viterbi algorithm is form of dynamic Viterbi algorithm is form of dynamic

programmingprogramming Definition: Let vDefinition: Let vkk(i) be the probability of the (i) be the probability of the

most probable path ending in state k with most probable path ending in state k with observation iobservation i

Determining the optimal path: the Viterbi algorithm Initialisation (Initialisation (ii=0): =0): vv00(0)=1, (0)=1, vvkk(0)=0 for (0)=0 for kk>0>0 Recursion (Recursion (ii=1..=1..LL): ): vvll(i)=(i)=eell((xxii)max)maxkk((vvkk(i-1)(i-1)aaklkl))

ptrptrii((ll)=argmax)=argmaxkk((vvkk(i-1)(i-1)aaklkl)) Termination: P(Termination: P(xx,,*)=max*)=maxkk((vvkk((LL))aak0k0))

LL*=argmax*=argmaxkk((vvkk(L)a(L)ak0k0)) Traceback (Traceback (ii==LL..1): ..1): i-1i-1*=ptr*=ptrii((ii*)*)

Block Diagram for Viterbi Algorithm

Viterbi Algorithm

transition probabilities

alphabetmost probable state sequence

emission probabilities

sequence

Multiple paths can give the same sequence The Viterbi algorithm finds the most likely The Viterbi algorithm finds the most likely

path given a sequencepath given a sequence Other paths could also give rise to the same Other paths could also give rise to the same

sequencesequence How do we calculate the probability of a How do we calculate the probability of a

sequence given an HMM?sequence given an HMM?

Probability of a sequence

Sum the probabilities of all possible paths Sum the probabilities of all possible paths that give that sequencethat give that sequence

Let Let P(x)P(x) be the probability of observing be the probability of observing sequence sequence xx given an HMM given an HMM

P(x) P(x, )

Probability of a sequence

Can find Can find P(x)P(x) using a variation on Viterbi using a variation on Viterbi algorithm using sum instead of maxalgorithm using sum instead of max

This is called the This is called the forward algorithmforward algorithm Replace Replace vvkk(i)(i) with with ffkk(i)=P(x(i)=P(x11…x…xii,,ii=k)=k)

Forward algorithm Initialisation (Initialisation (ii=0): =0): ff00(0)=1, (0)=1, ffkk(0)=0 for (0)=0 for kk>0>0 Recursion (Recursion (ii=1..=1..LL): ):

Termination:Termination:

f l (i) el (x i) f k (i 1)aklk

P(x) fk (L)ak0k

Backward algorithm

We may need to know the probability that a We may need to know the probability that a particular observation particular observation xxii came from a came from a particular state particular state kk given a sequence given a sequence xx, , P(P(ii=k|x)=k|x)

Use algorithm analogous to forward Use algorithm analogous to forward algorithm but starting from the endalgorithm but starting from the end

Backward algorithm Initialisation (Initialisation (ii=0): =0): bbkk((LL)=)=aakk00 for all for all kk Recursion (Recursion (ii=L-1,…,1): =L-1,…,1):

Termination:Termination:

bk (i) akll el (x i1)bl (i 1)

P(x) a0lel (x1)bl (1)l

Estimating probability of state at particular position Combine the forward and backward probabilities Combine the forward and backward probabilities

to estimate the posterior probability of the to estimate the posterior probability of the sequence being in a particular state at a particular sequence being in a particular state at a particular positionposition

P( i k | x) f k (i)bk (i)

P(x)

Parameter estimation for HMMs

Simple when state sequence is known for Simple when state sequence is known for training examplestraining examples

Can be very complex for unknown paths Can be very complex for unknown paths

Estimation when state sequence known Count number of times each transition Count number of times each transition

occurs, occurs, AAklkl

Count number of times each emission Count number of times each emission occurs from each state, occurs from each state, EEkk(b)(b)

Convert to probabilitiesConvert to probabilities

akl Akl

Akl 'l '

ek (b) E k (b)E k (b')

b '

Baum-Welch

Make initial parameter estimatesMake initial parameter estimates Use forward algorithm and backward Use forward algorithm and backward

algorithm to calculate probability of each algorithm to calculate probability of each sequence according to the modelsequence according to the model

Calculate new model parametersCalculate new model parameters Repeat until termination criteria met Repeat until termination criteria met

(change in log likelihood < threshold)(change in log likelihood < threshold)

Estimating transition frequencies

Probability that Probability that aaklkl is used as position is used as position ii in in sequence sequence xx

Sum over all positions (i) and all sequences Sum over all positions (i) and all sequences (j) to get expected number of times (j) to get expected number of times aaklkl is used is used

P( i k, i1 l | x,) f k (i)aklel (x i1)bl (i 1)

P(x)

Akl 1

P(x j )j fk

j (i)aklel (x ij

1)blj (i 1)

i

Estimating emission frequencies

Sum over all positions for which the emitted Sum over all positions for which the emitted character is character is bb and all sequences and all sequences

E k (b) 1

P(x j )j f k

j (i)bkj (i)

i|xij b

Updating model parameters

Convert expected numbers to probabilities Convert expected numbers to probabilities as if expected numbers were actual countsas if expected numbers were actual counts

akl Akl

Akl 'l '

ek (b) Ek (b)Ek (b')

b '

Test for termination Calculate the log likelihood of the model for all of the Calculate the log likelihood of the model for all of the

sequences using the new parameterssequences using the new parameters

If the change in log likelihood exceeds some If the change in log likelihood exceeds some threshold, go back and make new estimates of threshold, go back and make new estimates of a a and and ee

logP(x j |)j1

n

Documents

Lectures