51
1 Hidden Markov Models Hsin-Min Wang [email protected] References: 1. L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter 6 2. X. Huang et. al., (2001) Spoken Language Processing, Chapter 8 3. L. R. Rabiner, (1989) “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition,” Proceedings of the IEEE, vol. 77, No. 2, February 1989

8. Hidden Markov Models

Embed Size (px)

DESCRIPTION

machine learning

Citation preview

HMM

7The parameters of a Markov chain, with N states labeled by {1,,N} and the state at time t in the Markov chain denoted as qt, can be described as aij=P(qt= j|qt-1=i) 1i,jNi =P(q1=i) 1iN The output of the process is the set of states at each time instant t, where each state corresponds to an observable event XiThere is a one-to-one correspondence between the observable sequence and the Markov chain state sequenceObservable Markov Model

(Rabiner 1989)72Hidden Markov Model (HMM)HistoryPublished in Baums papers in late 1960s and early 1970sIntroduced to speech processing by Baker (CMU) and Jelinek (IBM) in the 1970sIntroduced to computational biology in late1980s Lander and Green (1987) used HMMs in the construction of genetic linkage mapsChurchill (1989) employed HMMs to distinguish coding from noncoding regions in DNA23Hidden Markov Model (HMM)AssumptionSpeech signal (DNA sequence) can be characterized as a parametric random processParameters can be estimated in a precise, well-defined manner

Three fundamental problemsEvaluation of probability (likelihood) of a sequence of observations given a specific HMMDetermination of a best sequence of model statesAdjustment of model parameters so as to best account for observed signal/sequence35Probability Theorem Consider the simple scenario of rolling two dice, labeled die 1 and die 2. Define the following three events:A: Die 1 lands on 3. B: Die 2 lands on 1. C: The dice sum to 8.

Prior probability: P(A)=P(B)=1/6, P(C)=5/36.

Joint probability: P(A,B) (or P(AB)) =1/36, two events A and B are statistically independent if and only if P(A,B) = P(A)xP(B).

P(B,C)=0, two events B and C are mutually exclusive if and only if BC=, i.e., P(BC)=0.

Conditional probability: , P(B|A)=P(B), P(C|B)=0

Bayes rulePosterior probabilitymaximum likelihood principle {(2,6), (3,5), (4,4), (5,3), (6,2)} AB ={(3,1)}BC=54Hidden Markov Model (HMM)S2S1S3{A:.34,B:.33,C:.33}{A:.33,B:.34,C:.33}{A:.33,B:.33,C:.34}0.340.340.330.330.330.330.330.330.34

Given an initial model as follows:We can train HMMs for the following two classes using their training data respectively.Training set for class 1:1. ABBCABCAABC 2. ABCABC 3. ABCA ABC 4. BBABCAB 5. BCAABCCAB 6. CACCABCA 7. CABCABCA 8. CABCA 9. CABCA Training set for class 2:1. BBBCCBC 2. CCBABB 3. AACCBBB 4. BBABBAC 5. CCAABBAB 6. BBBCCBAA 7. ABBBBABA 8. CCCCC 9. BBAAA We can then decide which class the following testing sequences belong to.ABCABCCABAABABCCCCBBBback48The Markov Chain Ex 1A 3-state Markov Chain State 1 generates symbol A only, State 2 generates symbol B only, State 3 generates symbol C only

Given a sequence of observed symbols O={CABBCABC}, the only one corresponding state sequence is Q={S3S1S2S2S3S1S2S3}, and the corresponding probability is

P(O|)=P(CABBCABC|)=P(Q| )=P(S3S1S2S2S3S1S2S3 |) =(S3)P(S1|S3)P(S2|S1)P(S2|S2)P(S3|S2)P(S1|S3)P(S2|S1)P(S3|S2)=0.10.30.30.70.20.30.30.2=0.00002268

S2S3ABC0.60.70.30.10.20.20.10.30.5S189

The Markov Chain Ex 2A three-state Markov chain for the Dow Jones Industrial average

The probability of 5 consecutive up days

(Huang et al., 2001)910Extension to Hidden Markov ModelsHMM: an extended version of Observable Markov ModelThe observation is a probabilistic function (discrete or continuous) of a state instead of an one-to-one correspondence of a stateThe model is a doubly embedded stochastic process with an underlying stochastic process that is not directly observable (hidden)What is hidden? The State Sequence!According to the observation sequence, we are not sure which state sequence generates it! 1011Hidden Markov Models Ex 1A 3-state discrete HMM

Given an observation sequence O={ABC}, there are 27 possible corresponding state sequences, and therefore the probability, P(O|), is

S2S1S3{A:.3,B:.2,C:.5}{A:.7,B:.1,C:.2}{A:.3,B:.6,C:.1}0.60.70.30.10.20.20.10.30.5

Initial model1112Hidden Markov Models Ex 2

(Huang et al., 2001)Given a three-state Hidden Markov Model for the Dow Jones Industrial averageas follows:How to find the probability P(up, up, up, up, up|)?How to find the optimal state sequence of the model which generates the observation sequence up, up, up, up, up?

cf. the Markov chain(35 state sequences can generate up, up, up, up, up.)1213Elements of an HMMAn HMM is characterized by the following: N, the number of states in the model M, the number of distinct observation symbols per stateThe state transition probability distribution A={aij}, where aij=P[qt+1=j|qt=i], 1i,jNThe observation symbol probability distribution in state j, B={bj(vk)} , where bj(vk)=P[ot=vk|qt=j], 1jN, 1kMThe initial state distribution ={i}, where i=P[q1=i], 1iNFor convenience, we usually use a compact notation =(A,B,) to indicate the complete parameter set of an HMMRequires specification of two model parameters (N and M)1314Two Major Assumptions for HMMFirst-order Markov assumptionThe state transition depends only on the origin and destination

The state transition probability is time invariant

Output-independent assumptionThe observation is dependent on the state that generates it, not dependent on its neighbor observations aij=P(qt+1=j|qt=i), 1i, jN

1415Three Basic Problems for HMMsGiven an observation sequence O=(o1,o2,,oT), and an HMM =(A,B,)Problem 1:How to compute P(O|) efficiently ? Evaluation ProblemProblem 2: How to choose an optimal state sequence Q=(q1,q2,, qT) which best explains the observations? Decoding ProblemProblem 3: How to adjust the model parameters =(A,B,) to maximize P(O|)? Learning/Training Problem

P(up, up, up, up, up|)?1516Solution to Problem 11617Solution to Problem 1 - Direct EvaluationGiven O and , find P(O|)= Pr{observing O given }Evaluating all possible state sequences of length T that generate the observation sequence O

: The probability of the path QBy first-order Markov assumption

: The joint output probability along the path Q By output-independent assumption

1718Solution to Problem 1 - Direct Evaluation (contd)S2S3S1o1S2S3S1S2S3S1S2S3S1Stateo2o3oT1 2 3 T-1 T TimeS2S3S1oT-1Sjmeans bj(ot) has been computed aijmeans aij has been computed

1819Solution to Problem 1 - Direct Evaluation (contd)A Huge Computation Requirement: O(NT) (NT state sequences)Exponential computational complexity

A more efficient algorithm can be used to evaluate The Forward Procedure/Algorithm

1920Solution to Problem 1 - The Forward ProcedureBase on the HMM assumptions, the calculation of and involves only qt-1, qt , and ot , so it is possible to compute the likelihood with recursion on t

Forward variable : The probability of the joint event that o1,o2,,ot are observed and the state at time t is i, given the model

2021

Solution to Problem 1 - The Forward Procedure (contd)

Output-independent assumption

First-order Markov assumption2122Solution to Problem 1 - The Forward Procedure (contd)3(2)=P(o1,o2,o3,q3=2|) =[2(1)*a12+ 2(2)*a22 +2(3)*a32]b2(o3)

S2S3S1o1S2S3S1S3S2S1S2S3S1Stateo2o3oT1 2 3 T-1 T TimeS2S3S1oT-1Sjmeans bj(ot) has been computed aijmeans aij has been computed2(1)2(2)2(3)a12a22a32b2(o3)Time indexState index2223Solution to Problem 1 - The Forward Procedure (contd)Algorithm

Complexity: O(N2T)

Based on the lattice (trellis) structureComputed in a time-synchronous fashion from left-to-right, where each cell for time t is completely computed before proceeding to time t+1All state sequences, regardless how long previously, merge to N nodes (states) at each time instance t

cf. O(NT) for direct evaluation2324Solution to Problem 1 - The Forward Procedure (contd)A three-state Hidden Markov Model for the Dow Jones Industrial average

b1(up)=0.7b2(up)= 0.1b3(up)=0.3a11=0.6a21=0.5a31=0.4(Huang et al., 2001)b1(up)=0.7b2(up)= 0.1b3(up)=0.31=0.52=0.23=0.31(1)=0.5*0.71(2)= 0.2*0.11(3)= 0.3*0.32(1)= (0.35*0.6+0.02*0.5+0.09*0.4)*0.72(2)=(0.35*0.2+0.02*0.3+0.09*0.1)*0.12(3)=(0.35*0.2+0.02*0.2+0.09*0.5)*0.3P(up, up|) = 2(1)+2(2)+2(3)a12=0.2a22=0.3a32=0.1a13=0.2a23=0.2a33=0.52425Solution to Problem 22526Solution to Problem 2 - The Viterbi AlgorithmThe Viterbi algorithm can be regarded as the dynamic programming algorithm applied to the HMM or as a modified forward algorithmInstead of summing probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remembers the best pathFind a single optimal state sequence Q*

The Viterbi algorithm also can be illustrated in a trellis framework similar to the one for the forward algorithm

2627Solution to Problem 2 - The Viterbi Algorithm (contd)S2S3S1o1S2S3S1S2S3S1S2S1S3Stateo2o3oT1 2 3 T-1 T TimeS2S3S1oT-12728Solution to Problem 2 - The Viterbi Algorithm (contd)Initialization

Induction

Termination

Backtracking

Complexity: O(N2T)is the best state sequence

2829

b1(up)=0.7b2(up)= 0.1b3(up)=0.3a11=0.6a21=0.5a31=0.4b1(up)=0.7b2(up)= 0.1b3(up)=0.31=0.52=0.23=0.3Solution to Problem 2 - The Viterbi Algorithm (contd)A three-state Hidden Markov Model for the Dow Jones Industrial average(Huang et al., 2001)1(1)=0.5*0.71(2)= 0.2*0.11(3)= 0.3*0.32(1)=max (0.35*0.6, 0.02*0.5, 0.09*0.4)*0.72(1)= 0.35*0.6*0.7=0.1472(1)=10.09a12=0.2a22=0.3a32=0.1a13=0.2a23=0.2a33=0.52(2)=max (0.35*0.2, 0.02*0.3, 0.09*0.1)*0.12(2)= 0.35*0.2*0.1=0.0072(2)=12(3)=max (0.35*0.2, 0.02*0.2, 0.09*0.5)*0.32(3)= 0.35*0.2*0.3=0.0212(3)=1The most likely state sequence that generates up up: 1 1

2930Some Examples3031Isolated Digit Recognitiono1o2o3oT1 2 3 T-1 T TimeoT-1S2S3S1S2S3S1S2S3S1S2S3S1S2S3S1S2S3S1S2S3S1S2S3S1S2S3S1S2S3S110

S2S3S1

S2S3S1S2S3S13132Continuous Digit Recognitiono1o2o3oT1 2 3 T-1 T TimeoT-1S2S3S1S2S3S1S2S3S1S2S3S1S2S3S1S5S6S4S5S6S4S5S6S4S5S6S4S5S6S410

S2S3S1

S2S3S1S5S6S4S5S6S4

3233Continuous Digit Recognition (contd)1 2 3 4 5 6 7 8 9 TimeS2S3S1S2S3S1S2S3S1S2S3S1S5S6S4S5S6S4S5S6S4S5S6S410S2S3S1S5S6S4S2S3S1S5S6S4S2S3S1S5S6S4S2S3S1S5S6S4S2S3S1S5S6S4

S1S1S2S6S3S3S4S5S5Beststate sequence3334CpG IslandsTwo QuestionsQ1: Given a short sequence, does it come from a CpG island?Q2: Given a long sequence, how would we find the CpG islands in it?3435CpG IslandsAnswer to Q1:Given sequence x, probabilistic model M1 of CpG islands, and probabilistic model M2 for non-CpG island regionsCompute p1=P(x|M1) and p2=P(x|M2)If p1 > p2, then x comes from a CpG island (CpG+)If p2 > p1, then x does not come from a CpG island (CpG-)S1:AS2:CS3:TS4:GCpG+ACGTA0.1800.2740.4260.120C0.1710.3680.2740.188G0.1610.3390.3750.125T0.0790.3550.3840.182CpG-ACGTA0.3000.2050.2850.210C0.3220.2980.0780.302G0.2480.2460.2980.208T0.1770.2390.2920.292Large CG transition probability

vs.

Small CG transition probability3536CpG IslandsAnswer to Q2:S1S2A: 0.3C: 0.2G: 0.2T: 0.3A: 0.2C: 0.3G: 0.3T: 0.2p22=0.9999p11=0.99999p12=0.00001p21=0.0001CpG+CpG- A C T C G A G T A S1S1S1S1S2S2S2S2S1ObservableHidden3637A Toy Example: 5 Splice Site Recognition5 splice site indicates the switch from an exon to an intronAssumptions:Uniform base composition on average in exons (25% each base) Introns are A/T rich (40% A/T, and 10% C/G)The 5SS consensus nucleotide is almost always a G (say, 95% G and 5% A)

From What is a hidden Markov Model?, by Sean R. Eddy3738A Toy Example: 5 Splice Site Recognition

3839Solution to Problem 33940Solution to Problem 3 Maximum Likelihood Estimation of Model ParametersHow to adjust (re-estimate) the model parameters =(A,B,) to maximize P(O|)?The most difficult one among the three problems, because there is no known analytical method that maximizes the joint probability of the training data in a closed formThe data is incomplete because of the hidden state sequenceThe problem can be solved by the iterative Baum-Welch algorithm, also known as the forward-backward algorithmThe EM (Expectation Maximization) algorithm is perfectly suitable for this problemAlternatively, it can be solved by the iterative segmental K-means algorithmThe model parameters are adjusted to maximize P(O, Q* |), Q* is the state sequence given by the Viterbi algorithmProvide a good initialization of Baum-Welch training4041Solution to Problem 3 The Segmental K-means AlgorithmAssume that we have a training set of observations and an initial estimate of model parametersStep 1 : Segment the training dataThe set of training observation sequences is segmented into states, based on the current model, by the Viterbi AlgorithmStep 2 : Re-estimate the model parameters

Step 3: Evaluate the model If the difference between the new and current model scores exceeds a threshold, go back to Step 1; otherwise, return

4142Solution to Problem 3 The Segmental K-means Algorithm (contd)3 states and 2 codewords

1=1, 2=3=0 a11=3/4, a12=1/4a22=2/3, a23=1/3a33=1b1(A)=3/4, b1(B)=1/4b2(A)=1/3, b2(B)=2/3b3(A)=2/3, b3(B)=1/3ABO1StateO2O3 1 2 3 4 5 6 7 8 9 10O4s2s3s1s2s3s1s2s3s1s2s3s1s2s3s1s2s3s1s2s3s1s2s3s1s2s3s1s2s3s1O5O6O9O8O7O10Training data:Re-estimatedparameters:What if the training data is labeled?4243Solution to Problem 3 The Backward ProcedureBackward variable :The probability of the partial observation sequence ot+1,ot+2,,oT, given state i at time t and the model 2(3)=P(o3,o4,, oT|q2=3,) =a31* b1(o3)*3(1)+a32* b2(o3)*3(2)+a33* b3(o3)*3(3)

S2S3S1o1S2S3S1S2S3S1S2S3S1o2o3oT1 2 3 T-1 T TimeS2S3S3oT-1S2S3S1State

3(1)b1(o3)a314344Solution to Problem 3 The Backward Procedure (contd)Algorithm

cf.4445Solution to Problem 3 The Forward-Backward AlgorithmRelation between the forward and backward variables

(Huang et al., 2001)

4546Solution to Problem 3 The Forward-Backward Algorithm (contd)

4647Solution to Problem 3 The Intuitive ViewDefine two new variables:t(i)= P(qt = i | O, ) Probability of being in state i at time t, given O and

t( i, j )=P(qt = i, qt+1 = j | O, ) Probability of being in state i at time t and state j at time t+1, given O and

4748Solution to Problem 3 The Intuitive View (contd)P(q3 = 1, O | )=3(1)*3(1)

o1s2s1s3s2s1s3S2S3S1Stateo2o3oT1 2 3 4 T-1 T TimeoT-1S2S3S1S2S3S1S2S3S1S2S3S1S2S3S13(1)3(1)4849Solution to Problem 3 The Intuitive View (contd)P(q3 = 1, q4 = 3, O | )=3(1)*a13*b3(o4)*4(3)o1s2s1s3s2s1s3S2S3S1Stateo2o3oT1 2 3 4 T-1 T TimeoT-1S2S3S1S2S3S1S3S2S1S2S3S1S2S3S13(1)4(3)a13b3(o4)4950Solution to Problem 3 The Intuitive View (contd)t( i, j )=P(qt = i, qt+1 = j | O, )

t(i)= P(qt = i | O, )

5051Solution to Problem 3 The Intuitive View (contd)Re-estimation formulae for , A, and B are

51