Hidden Markov Models - Technische Universität Münchencampar.in.tum.de/.../Far/MachineLearningWiSe2003/HiddenMarkovModels.pdf · Example of a hidden Markov model: Indirect observation

1

Hidden Markov Models

Hauptseminar Machine Learning

Referent: Nikolas Dörfler

18.11.2003

2

Overview

Markov ModelsHidden Markov Models

Evaluation: Forward algorithm

Decoding: Viterbi algorithm

Learning: Forward-Backward Algorithm

Backward algorithm

Speech recogition with HMM

Three central problems:

Types of Hidden Markov ModelsApplications using HMMs

Isolated word recognizer

Conclusion

Measured performance

3

Markov Models

- A System of n states : the system is at time t in state w(t)

- Changes of state are nondeterministic but depending on the previousstates

- Transition probabilities: Often shown as state transition matrix

In a First order, ..., n-order model changes depend on the previous 1, ..., nstates

- Vector of initial probabilities !

4

Example: Weather system

Three states: sunny, cloudy, rainy

Transition Matrix A

State vector

p = (1,0,0)T

˜˜˜

¯

ˆ

ÁÁÁ

Ë

Ê

375,0625,0125,0

375,0125,0375,0

25,025,05,0

A =

5

Hidden Markov Models

- n invisible states

- every state emits at time t a visible symbol/state v(t)

- System generates a sequence of symbols (states) VT= { v(1), v(2),v(3), ..., v(n)}

- Probability of emitting v(t) in state w(t)=j: P( v(t) | w(t)=j) = bj(k)

- Transition probability: P(w(t+1)=j | w(t)=i) = aij

- Normalization conditions

= Transition matrix A

= Confusion matrix B

- Vector of initial probabilities !

Â =j

ija 1 Â =j

j kb 1)( for all kfor all i

6

Example of a hidden Markov model:

Indirect observation of the weather using a piece of seaweed

7

Transition matrix A

Confusion matrix B

˜˜˜

¯

ˆ

ÁÁÁ

Ë

Ê

375,0625,0125,0

375,0125,0375,0

25,025,05,0

˜˜˜

¯

ˆ

ÁÁÁ

Ë

Ê

5,035,01,005,0

25,025,025,025,0

05,015,02,06,0sun

clouds

rain

Dry Dryish Damp Soggy

sun

clouds

rain

sun clouds rain

p = (1,0,0)TState vector

8

Types of Hidden Markov Models

Ergodic

All states can be reached within one step from everywhere

Transition matrix A entries are never zero

˜̃

˜˜˜

¯

ˆ

ÁÁÁÁÁ

Ë

Ê

44434241

34333231

24232221

14131211

aaaa

aaaa

aaaa

aaaa

A =

9

Left-right Models

˜̃

˜˜˜

¯

ˆ

ÁÁÁÁÁ

Ë

Ê

44

3433

242322

131211

000

00

0

0

a

aa

aaa

aaa

A =

No backleading transitions

aij = 0 j < i

Additional constraints:

aij = 0 j > i + ∆

(f.e. ∆ = 2 => no jumps of more than 2 states)

10

Applications using HMM

Speech recognition

Language modelling

Recognition of hand writings

Protein sequence analysis

Financial/Economic Models

11

Three central problems

1. Evaluation

2. Decoding

3. Learning

Given a HMM (A,B,!) :

Find the probability that a sequence of visible states VT was generatedby that model

Given a HMM (A,B,!) and a set of observations:

Find the most probable sequence of hidden states that led to thoseobservations

Given the number of visible and hidden states and somesequences of training observations:

Find the parameters aij and bj(k)

12

Evaluation:

)()|()(max

1

Tr

Tr

r

r

TT PVPVP wwÂ=

=

Probability that the system M produces a sequence VT

For all possible sequences wrT={w(1), w(2),..., w(T)}

That means: Take all sequences of hidden states, calculate theprobability that they calculated VT and add them

Â’= =

-=N

r

T

t

T ttPttvPVP1 1

))1(|)(())(|)(()( www

N is the number of states , T is the number of visible symbols / steps to go

Problem: Complexity is o(NT T)

13

The Forward alogithm:

Calculate the Problem recursively

ÔÓ

ÔÌ

Ï

-=

Â=

))(()1(

))0(()(

1

tvbat

vbt

jij

N

ii

jj

j a

pa

t=0 and j = initial state

bj(v(t)) means the probability to emit the state selected by v(t)

aj(t) is the probability that the model is in state j and has produced thefirst t elements of VT

else

14

initialize aj(0)= p(j)bj(v(0)), t=0 aij,bj,visible sequence VT

for t <= t + 1

until t = T

return P(VT) = afinal(T)

end

))(()1()(1

tvbatt jij

N

i ij Â =-ææ¨ aa for all j £ N

Forward algorithm

Complexity of this algorithm: o(N2T)

15

Example: Forward algorithm

t = 1 t = 2 t = 3 t = 1 t = 2 t = 3 t = 1 t = 2 t = 3

State 1

State 2

State 3

16

Link to Java appletexample

17

Backward algorithm

initialize bi(T)= 1, t=T aij, bjk, visible sequence VT

for t <= t - 1

until t = 1

return P(VT) = bi(0)

end

))1(()1()(1

++ææ¨ Â =tvbatt jij

N

i ij bb for all j < N

bi(t) is the probability that the model is in state j and will producethe last T - t elements of VT

18

Decoding (Viterbi Algorithm)

Can be recursively calculated with the Viterbi algorithm:

Finds the sequence of hidden states in a N-state model M that mostprobable generated a sequence VT = {v(0),v(1),...,v(t)} of visible states.

di(t) = max P(w(1),..., w(t)=i, v(1),...,v(t) | M ) over all pathes w

recursively: dj(t) = maxi [di(t-1)aij] bj(v(t))

di(t) is the maximal probability along a path w(1),..., w(t) = ito generate the sequence v(1),...,v(t) (partial best path)

to keep track of the path maximizing di(t) the array yj(t) is used

with dj(0) = p(i) bi(v(0))

19

initialize di(0) = p(i)bi(v(0)), yi(0)=0, t=0 für alle i

for t <= t + 1

for all states jdj(t)= max1£i£N[di(t-1)aij] bj(v(t))

yj(t)= arg max1£i£N[di(t-1)aij]

until t = T

Sequence backtracking:

Sequence calculation:

for t = T-1, t <= t - 1

w(t) = yw(t+1)(t+1)

w(T) = arg max1£i£N[di(t-1)aij]

until t = 0

Sequence Termination:

20

Example:

21

Link to Java appletexample

22

Positive aspects of the Viterbi Algorithm

Reduction of computational complexity by recursion

Takes the entire context to find a optimal solution, thereforelower error rates with noisy data

23

Learning (Forward Backward algorithm)

determine the N-state model parameters aij and bjk based on a trainingsequence by iteratively calculating better values

Definition:

)|(

)1())1(()()(

MVP

ttvbatt

T

jjijiij

++=

bax

Probability of a transition wi(t) to wj(t+1) given the model Mgenerated VT by any path

Â ==

N

j iji tt1

)()( xg Probability of being in state wi at time t

),|)1(),(()( MVttPt Tjiij += wwx

Â Â= =++

++= N

k l

N

l lklki

jjijiij

ttvbat

ttvbatt

1 1)1())1(()(

)1())1(()()(

ba

bax

24

25

ÂÂ

-

=

-

== 1

1

1

1

)(

)(' T

t i

T

t ijij

t

ta

g

x

which gives us better values for aij

Â-

=

1

1)(

T

t i tg

Expected number of times in wi

Â-

=

1

1)(

T

t ij tx

Expected number of transitions from wi to any other state

Expected number of transitions from wi to wj

Â =

T

t i t1

)(g

ÂÂ

Â-

= =

-

=

++

++= 1

1 1

1

1

)1())1(()(

)1())1(()(

T

t

N

kkkikk

T

tjiijit

ttvbat

ttvbat

ba

ba

26

ÂÂ

=

== T

t j

T

t ktjkj

t

vbtvb

1

1

)(

)()()('

g

g

Expected number of times in wi

Expected number of times in wi emitting vk

p‘(i) = gi(0) Probability of being in state i at time 0

Calculation of the bj’

27

Positive Aspect

arbitrary precision of estimation

estimate these parameters by:

For ! and aij either random or uniform

for the bj better initial estimates are very useful for fast convergence

ProblemsHow to choose the initial parameters aij and bj?

Manual segmentation

Maximum Likelihood segmentation

28

What is the appropriate model

Must be decided on the kind of signal modeled

In speech recognition often a left-right model is used to model theadvancing of time

f.e. every sylable get a state and a final silent state

29

Scaling

Problem: The ai and bi are always smaller than 1 so thecalculations converge against zero

This exceeds the precision range even in double precision

Solution: Multiply the ai and bi by a scaling coefficient ct thatis independent from i but depends on t

For example:

Â=

= N

ii

t

tc

1

)(

1

a

These parameters fall out in the calculation of aij‘ and bj‘

30

Isolated Word recognizer

Build a HMM for each word in a vocabulary and calculate the(A,B,!) parameters (train the model)

For each word to be recognized:

- Feature analysis (Vector quantization)

- Run Viterbi on all models to find the most probable model forthat observation sequence

Generates Observation Vectors from the signal

Speech recognizers using HMMs

31

Block diagram of an isolated word recognizer

32

a) log energy and b) state assignment for the word ‚six‘

33

Measured performance of an isolated word recognizer

100 digits by 100 talkers(50 female / 50 male)

Original Training: the original training set was usedTS2: the original speakers as in the training

TS3: Complete new set of speakers

TS4: Another new set of speakers

34

Conclusion

There are various processes where the real activity is invisible andonly a generated pattern can be observed. These can be modeled byHMM

needs enough training data

Advantages

Limitations of HMM :

The Markov assumption that each state only depends on the previousstate is not always true

HMM are the predominant method for current automatic speechrecognition and play a great role in other recognition systems

Acceptable calculation complexity

Low error rates

35

Bibliography

Rabiner, L.R.A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition;Proceedings of the IEEE, Vol.77, Iss.2, Feb 1989; Pages:257-286

Richard O. Duda, Peter E. Hart, David G. StorkPattern Classification chapter 3.10

Eric Keller (Editor)Fundamentals of Speech synthesis and Speech recognitionChapter 8 by Kari Torkkola

Used Websites

http://www.comp.leeds.ac.uk/roger/HiddenMarkovModels/html_dev/main.htmlIntroduction to Hidden Markov Models, University of Leeds

http://www.cnel.ufl.edu/~yadu/report1.htmlSpeech recognition using Hidden Markov Models,Yadunandana Nagaraja Rao , University of Florida

Documents

Hidden Markov Models - Technische Universität Münchencampar.in.tum.de/.../Far/MachineLearningWiSe2003/HiddenMarkovModels.pdf · Example of a hidden Markov model: Indirect observation