Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Ch 9. Markov Models

고려대학교 자연어처리연구실한 경 수

2000. 3. 25.

Natural Language Processing Lab., Korea Univ. 2

Contents

Markov models

Hidden Markov models

The three fundamental questions for HMMsFinding the probability of an observation

Finding the best state sequence

Parameter estimation

HMMs: implementation, properties, and variants


Markov Models

Markov propertiesLimited horizon

Time invariant (stationary)

Stochastic transition matrix A

Probabilities of different initial states

)|()|( 121 XsXPXsXP ktkt

)|(),...,|( 111 tkttkt XsXPXXsXP

N

j ijijitjtij aiajisXsXPa11 1: and 0:, ; )|(

1 ; )(N

11 i iii sXP

If X has the Markov property, X is said to be a Markov

chain.

: sequence of random variables taking values in some finite set , the state space.},...,{ 1 NssS

),...,( 1 TXXX

)|()|( 121 XsXPXsXP ktkt


Markov Models (Cont.)

Markov models can be used whenever one wants to model the probability of a linear sequence of events

word n-gram models, modeling valid phone sequences in speech recognition, sequences of speech acts in dialog systems

thought of a probabilistic finite-state automaton.

Probability of a sequence of states

mth order Markov modelm: # of previous states that we are using to predict the next state

n-gram model is equivalent to an (n-1)th order Markov model.

1

1

123121

112131211

11

)|()|()|()(

),...,|(),|()|()(),...,(

T

tXXX

TT

TTT

tta

XXPXXPXXPXP

XXXPXXXPXXPXPXXP


Markov Models (Cont.)

P.319 Figure 9.1

18.0

6.03.00.1

)|()|()(),,( 23121

iXpXPtXiXPtXPpitP


Hidden Markov Models

In an HMM,You don’t know the state sequence that the model passes through, but only some probabilistic function of it.

Emission probability for the observations

Example: the crazy soft drink machineQ: What is the probability of seeing the output sequence {lem, ice_t} if the machine always starts off in the cola preferring state?

A: consider all paths that might be taken through the HMM, and then to sum over them.

),|( 1 jtittijk sXsXkOPb

084.07.05.03.03.07.05.03.03.0

1.03.03.07.01.07.03.07.0


The Crazy Soft Drink Machine

CP IP

0.5

0.3

start

0.7 0.5

cola iced tea

(ice_t)

lemonade

(lem)

CP 0.6 0.1 0.3

IP 0.1 0.7 0.2

Hidden Markov Models (Cont.)


Why use HMMs?

HMMs are useful when one can think of underlying events probabilistically generating surface events.

POS tagging (Chap. 10)

There exist efficient methods of training through use of the EM algorithm.

Given plenty of data that we assume to be generated by some HMM,This algorithm allows us to automatically learn the model parameters that best account for the observed data.

Linear interpolation of n-gram models

We can build an HMM with hidden states that represent the choice of whether to use the unigram, bigram, or trigram probabilities.

1 ; ),|()|()(),|( 21331221121 i innnnnnnnnli wwwPwwPwPwwwP



Linear interpolation of n-gram models

P.323 Figure 9.3



General form of an HMM

An HMM is specified by a five-tuple : set of states

: output alphabet

: initial state probabilities

: state transition probabilities

: symbol emission probabilities

: state sequence

: output sequence

arc-emission HMM vs. state-emission HMMarc-emission HMM: the symbol emitted at time t depends on both the state at time t and at time t+1.

state-emission HMM: the symbol emitted at time t depends just on the state at time t.

),,,,( BAKS },...,{ 1 NssS

},...,1{},...,{ 1 MkkK M Sii },{SjiaA ij ,},{

KkSjibB ijk ,,},{},...,1{: ),...,( 11 NSXXXX tT

KoooO tT ),...,( 1



The Three Fundamental Questions for HMMs

1. Given a model , how do we efficiently compute how likely a certain observation is, that is ?

used to decide between models which is best.

2. Given the observation sequence O and a model , how do we choose a state sequence that best explains the observations?

guess what path was probably followed through the Markov chain; used for classification (e.g. POS tagging)

3. Given an observation sequence O, and a space of possible models found by varying the model parameters , how do we find the model that best explains the observed data?

estimate model parameters from data

),,( BA

)|( OP

),...,( 11 TXX

),,( BA


Finding the probability of an observation

Decoding

requires multiplications

11

1111

)|(),|(

)|,()|(

T

tttttXX

T

toXXXXX

X

X

ba

XPXOP

XOPOP

1)12( TNT

T

toXXoXXoXXoXX

T

tttt tttTTT

bbbbXXoPXOP11

1 11232121),,|(),|(

T

tXXXXXXXXXX ttTT

aaaaXP1

11132211)|(

The Three Fundamental Questions for HMMs (Cont.)


Trellis algorithms

The secret to avoiding this complexity is the general technique of dynamic programming.

Remember partial results rather than recomputing them.

Trellis algorithmsMake a square array of states versus time

Compute the probabilities of being at each state at each time in terms of the probabilities for being in each state at the preceding time instant.

Finding the probability of an observation (Cont.)


Trellis algorithms (Cont.)

P.328 Figure 9.5



The forward procedure

Forward variables

is stored at in the trellis

expresses the total probability of ending up in state at time t

is calculated by summing probabilities for all incoming arcs at a trellis node


)|,()( 121 iXoooPt tti

),( tsi

is

1. Initialization

2. Induction

3. Total

Niii 1 ,)1(

NjTtbattN

iijoijij t

1,1 ,)()1(1

N

ii TOP

1

)1()|(

Requires multiplications

TN 22


The forward procedure

P.329 Figure 9.6



The backward procedure

Backward variables

The total probability of seeing the rest of the observation sequence given that we were in state at time t

Combination of forward and backward probabilities is vital for solving the third problem of parameter reestimation


),|()( iXooPt tTti

is

1. Initialization

2. Induction

3. Total

NiTi 1 ,1)1(

NiTttbatN

jjijoiji t

1,1 ,)1()(1

N

iiiOP

1

)1()|(


Variable calculations

P.330 Table 9.2



Combining themFinding the probability of an observation (Cont.)

11 , )()()|(1

TtttOPN

iii

)()(

),|()|,(

),,|()|,(

)|,,(

)|,()|,(

11

1111

11

1

tt

iXooPiXooP

iXooooPiXooP

ooiXooP

iXooPiXOP

ii

tTttt

ttTttt

Tttt

tTt


Finding the best state sequence

Choosing the states individuallyFor each t, we would find that maximizes

The individually most likely state

This quantity maximizes the expected number of states that will be guessed correctly.

However, it may yield a quite unlikely state sequence.

This is not the method that is normally used.


tX ),|( OXP t

N

j ii

ii

t

ti

tt

tt

OP

OiXP

OiXPt

1)()(

)()(

)|(

)|,(

),|()(

11 , )(maxargˆ1

TttX iNi

t


Viterbi algorithm

We want to find the most likely complete path

This variable stores for each point in the trellis the probability of the most probable path that leads to that node.

Records the node of the incoming arc that led to this most probable path.

Finding the best state sequence (Cont.)

)|,(maxarg),|(maxarg OXPOXPXX

)|,,(max)( 111111

jXooXXPt tttXX

jt

)(tj


Viterbi algorithm (Cont.)

1. Initialization

2. Induction

Store backtrace

3. Termination and path readout (by backtracking)

Njjj 1 ,)1(

Njbatttijoiji

Nij

1 ,)(max)1(

1

Njbatttijoiji

Nij

1 ,)(maxarg)1(

1

)1(maxargˆ1

1

TX iNi

T

)1(ˆ1

ˆ tX

tXt

)1(max)ˆ(1

TXP iNi

t

Finding the best state sequence (Cont.)


The third problem: Parameter estimation

There is no known analytic method to choose We can locally maximize it by an iterative hill-climbing algorithm

Baum-Welch or Forward-Backward algorithm

Work out the probability of the observation sequence using some (perhaps randomly chosen) model.

We can see which state transitions and symbol emissions were probably used the most.

By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence.

Training !

)|(maxarg

trainingOP



Baum-Welch algorithm

Probability of traversing a certain arc at time t given observation sequence O

= expected number of transitions from state i in O

= expected number of transitions from state i to j in O

The third problem: Parameter estimation (Cont.)

T

tt jip

1

),(

N

m

N

n nmnomnm

jijoiji

N

m mm

jijoiji

ttttt

tbat

tbat

tt

tbat

OP

OjXiXPOjXiXPjip

t

tt

1 11

11

)1()(

)1()(

)()(

)1()(

)|(

)|,,(),|,(),(

N

jti jipt

1

),()(

T

ti t

1

)(


Baum-Welch algorithm

P.334 Figure 9.7



Baum-Welch algorithm (Cont.)

Begin with some model (perhaps preselected, perhaps just chosen randomly)

Run O through the current model to estimate the expectations of each model parameter.

Change the model to maximize the values of the paths that are used a lot.

Repeat this process, hoping to converge on optimal values for the model parameter .




Reestimation: from , derive

Continues reestimating the parameters until results are no longer improving significantly

Doest not guarantee that we will find the best modelLocal maximum, saddle point


)|()ˆ|( OPOP

)1(

1 at time statein frequency expectedˆ

i

i ti

T

t i

T

t tij

t

jip

i

jia

1

1

)(

),(

state from ns transitioof # expected

to state from ns transitioof # expectedˆ

T

t t

Ttkot t

ijkjip

jip

ji

kjib t

1

}1,:{

),(

),(

to state from ns transitioof # expected

observed with to state from ns transitioof # expectedˆ

),,( BA )ˆ,ˆ,ˆ(ˆ BA

)|()ˆ|( OPOP



P.336



Implementation

Floating point underflowThe probabilities we are calculating consist of keeping multiplying together very small numbers.

Work with logarithmIt also speeds up the computation

Employ auxiliary scaling coefficientsWhose values grow with the time t so that the probabilities multiplied by the scaling coefficient remain within the floating point range of the computer.

When the parameter values are reestimated, these scaling factors cancel out.

HMMs: Implementation, Properties, and Variants (Cont.)


Variants

Epsilon or null transitions

State-emission modelMake the output distribution dependent just on a single state.

Large number of parameters that need to be estimatedParameter tying

Assumptions that probability distributions certain arcs or at certain states are the same as each other.

Structural zeroDecide that certain things are impossible (probability zero)



Multiple input observations

Ergodic modelEvery state is connected to every other state

We simply concatenate all the observation sequences and train on them as one long input.

We do not get sufficient data to be able to reestimate the initial probabilities successfully.

Feed forward modelNot fully connected.

There is an ordered set of states.

One can only proceed at each time instant to the same or a higher numbered state.

We need to extend the reestimation formulae to work with a sequence of inputs.


i


Initialization of parameter values

If we would rather find the global maximum,Try to start the HMM in a region of the parameter space that is near the global maximum.

Good initial estimates for the output parameters turn out to be particularly important, while random initial estimates for the parameters A and are normally satisfactory.


}{ ijkbB

Documents

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25