32
Ch 9. Markov Models 고고고고고 고고고고고고고고 2000. 3. 25.

Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Embed Size (px)

Citation preview

Page 1: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Ch 9. Markov Models

고려대학교 자연어처리연구실한 경 수

2000. 3. 25.

Page 2: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 2

Contents

Markov models

Hidden Markov models

The three fundamental questions for HMMsFinding the probability of an observation

Finding the best state sequence

Parameter estimation

HMMs: implementation, properties, and variants

Page 3: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 3

Markov Models

Markov propertiesLimited horizon

Time invariant (stationary)

Stochastic transition matrix A

Probabilities of different initial states

)|()|( 121 XsXPXsXP ktkt

)|(),...,|( 111 tkttkt XsXPXXsXP

N

j ijijitjtij aiajisXsXPa11 1: and 0:, ; )|(

1 ; )(N

11 i iii sXP

If X has the Markov property, X is said to be a Markov

chain.

: sequence of random variables taking values in some finite set , the state space.},...,{ 1 NssS

),...,( 1 TXXX

)|()|( 121 XsXPXsXP ktkt

Page 4: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 4

Markov Models (Cont.)

Markov models can be used whenever one wants to model the probability of a linear sequence of events

word n-gram models, modeling valid phone sequences in speech recognition, sequences of speech acts in dialog systems

thought of a probabilistic finite-state automaton.

Probability of a sequence of states

mth order Markov modelm: # of previous states that we are using to predict the next state

n-gram model is equivalent to an (n-1)th order Markov model.

1

1

123121

112131211

11

)|()|()|()(

),...,|(),|()|()(),...,(

T

tXXX

TT

TTT

tta

XXPXXPXXPXP

XXXPXXXPXXPXPXXP

Page 5: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 5

Markov Models (Cont.)

P.319 Figure 9.1

18.0

6.03.00.1

)|()|()(),,( 23121

iXpXPtXiXPtXPpitP

Page 6: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 6

Hidden Markov Models

In an HMM,You don’t know the state sequence that the model passes through, but only some probabilistic function of it.

Emission probability for the observations

Example: the crazy soft drink machineQ: What is the probability of seeing the output sequence {lem, ice_t} if the machine always starts off in the cola preferring state?

A: consider all paths that might be taken through the HMM, and then to sum over them.

),|( 1 jtittijk sXsXkOPb

084.07.05.03.03.07.05.03.03.0

1.03.03.07.01.07.03.07.0

Page 7: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 7

The Crazy Soft Drink Machine

CP IP

0.5

0.3

start

0.7 0.5

cola iced tea

(ice_t)

lemonade

(lem)

CP 0.6 0.1 0.3

IP 0.1 0.7 0.2

Hidden Markov Models (Cont.)

Page 8: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 8

Why use HMMs?

HMMs are useful when one can think of underlying events probabilistically generating surface events.

POS tagging (Chap. 10)

There exist efficient methods of training through use of the EM algorithm.

Given plenty of data that we assume to be generated by some HMM,This algorithm allows us to automatically learn the model parameters that best account for the observed data.

Linear interpolation of n-gram models

We can build an HMM with hidden states that represent the choice of whether to use the unigram, bigram, or trigram probabilities.

1 ; ),|()|()(),|( 21331221121 i innnnnnnnnli wwwPwwPwPwwwP

Hidden Markov Models (Cont.)

Page 9: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 9

Linear interpolation of n-gram models

P.323 Figure 9.3

Hidden Markov Models (Cont.)

Page 10: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 10

General form of an HMM

An HMM is specified by a five-tuple : set of states

: output alphabet

: initial state probabilities

: state transition probabilities

: symbol emission probabilities

: state sequence

: output sequence

arc-emission HMM vs. state-emission HMMarc-emission HMM: the symbol emitted at time t depends on both the state at time t and at time t+1.

state-emission HMM: the symbol emitted at time t depends just on the state at time t.

),,,,( BAKS },...,{ 1 NssS

},...,1{},...,{ 1 MkkK M Sii },{SjiaA ij ,},{

KkSjibB ijk ,,},{},...,1{: ),...,( 11 NSXXXX tT

KoooO tT ),...,( 1

Hidden Markov Models (Cont.)

Page 11: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 11

The Three Fundamental Questions for HMMs

1. Given a model , how do we efficiently compute how likely a certain observation is, that is ?

used to decide between models which is best.

2. Given the observation sequence O and a model , how do we choose a state sequence that best explains the observations?

guess what path was probably followed through the Markov chain; used for classification (e.g. POS tagging)

3. Given an observation sequence O, and a space of possible models found by varying the model parameters , how do we find the model that best explains the observed data?

estimate model parameters from data

),,( BA

)|( OP

),...,( 11 TXX

),,( BA

Page 12: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 12

Finding the probability of an observation

Decoding

requires multiplications

11

1111

)|(),|(

)|,()|(

T

tttttXX

T

toXXXXX

X

X

ba

XPXOP

XOPOP

1)12( TNT

T

toXXoXXoXXoXX

T

tttt tttTTT

bbbbXXoPXOP11

1 11232121),,|(),|(

T

tXXXXXXXXXX ttTT

aaaaXP1

11132211)|(

The Three Fundamental Questions for HMMs (Cont.)

Page 13: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 13

Trellis algorithms

The secret to avoiding this complexity is the general technique of dynamic programming.

Remember partial results rather than recomputing them.

Trellis algorithmsMake a square array of states versus time

Compute the probabilities of being at each state at each time in terms of the probabilities for being in each state at the preceding time instant.

Finding the probability of an observation (Cont.)

Page 14: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 14

Trellis algorithms (Cont.)

P.328 Figure 9.5

Finding the probability of an observation (Cont.)

Page 15: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 15

The forward procedure

Forward variables

is stored at in the trellis

expresses the total probability of ending up in state at time t

is calculated by summing probabilities for all incoming arcs at a trellis node

Finding the probability of an observation (Cont.)

)|,()( 121 iXoooPt tti

),( tsi

is

1. Initialization

2. Induction

3. Total

Niii 1 ,)1(

NjTtbattN

iijoijij t

1,1 ,)()1(1

N

ii TOP

1

)1()|(

Requires multiplications

TN 22

Page 16: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 16

The forward procedure

P.329 Figure 9.6

Finding the probability of an observation (Cont.)

Page 17: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 17

The backward procedure

Backward variables

The total probability of seeing the rest of the observation sequence given that we were in state at time t

Combination of forward and backward probabilities is vital for solving the third problem of parameter reestimation

Finding the probability of an observation (Cont.)

),|()( iXooPt tTti

is

1. Initialization

2. Induction

3. Total

NiTi 1 ,1)1(

NiTttbatN

jjijoiji t

1,1 ,)1()(1

N

iiiOP

1

)1()|(

Page 18: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 18

Variable calculations

P.330 Table 9.2

Finding the probability of an observation (Cont.)

Page 19: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 19

Combining themFinding the probability of an observation (Cont.)

11 , )()()|(1

TtttOPN

iii

)()(

),|()|,(

),,|()|,(

)|,,(

)|,()|,(

11

1111

11

1

tt

iXooPiXooP

iXooooPiXooP

ooiXooP

iXooPiXOP

ii

tTttt

ttTttt

Tttt

tTt

Page 20: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 20

Finding the best state sequence

Choosing the states individuallyFor each t, we would find that maximizes

The individually most likely state

This quantity maximizes the expected number of states that will be guessed correctly.

However, it may yield a quite unlikely state sequence.

This is not the method that is normally used.

The Three Fundamental Questions for HMMs (Cont.)

tX ),|( OXP t

N

j ii

ii

t

ti

tt

tt

OP

OiXP

OiXPt

1)()(

)()(

)|(

)|,(

),|()(

11 , )(maxargˆ1

TttX iNi

t

Page 21: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 21

Viterbi algorithm

We want to find the most likely complete path

This variable stores for each point in the trellis the probability of the most probable path that leads to that node.

Records the node of the incoming arc that led to this most probable path.

Finding the best state sequence (Cont.)

)|,(maxarg),|(maxarg OXPOXPXX

)|,,(max)( 111111

jXooXXPt tttXX

jt

)(tj

Page 22: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 22

Viterbi algorithm (Cont.)

1. Initialization

2. Induction

Store backtrace

3. Termination and path readout (by backtracking)

Njjj 1 ,)1(

Njbatttijoiji

Nij

1 ,)(max)1(

1

Njbatttijoiji

Nij

1 ,)(maxarg)1(

1

)1(maxargˆ1

1

TX iNi

T

)1(ˆ1

ˆ tX

tXt

)1(max)ˆ(1

TXP iNi

t

Finding the best state sequence (Cont.)

Page 23: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 23

The third problem: Parameter estimation

There is no known analytic method to choose We can locally maximize it by an iterative hill-climbing algorithm

Baum-Welch or Forward-Backward algorithm

Work out the probability of the observation sequence using some (perhaps randomly chosen) model.

We can see which state transitions and symbol emissions were probably used the most.

By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence.

Training !

)|(maxarg

trainingOP

The Three Fundamental Questions for HMMs (Cont.)

Page 24: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 24

Baum-Welch algorithm

Probability of traversing a certain arc at time t given observation sequence O

= expected number of transitions from state i in O

= expected number of transitions from state i to j in O

The third problem: Parameter estimation (Cont.)

T

tt jip

1

),(

N

m

N

n nmnomnm

jijoiji

N

m mm

jijoiji

ttttt

tbat

tbat

tt

tbat

OP

OjXiXPOjXiXPjip

t

tt

1 11

11

)1()(

)1()(

)()(

)1()(

)|(

)|,,(),|,(),(

N

jti jipt

1

),()(

T

ti t

1

)(

Page 25: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 25

Baum-Welch algorithm

P.334 Figure 9.7

The third problem: Parameter estimation (Cont.)

Page 26: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 26

Baum-Welch algorithm (Cont.)

Begin with some model (perhaps preselected, perhaps just chosen randomly)

Run O through the current model to estimate the expectations of each model parameter.

Change the model to maximize the values of the paths that are used a lot.

Repeat this process, hoping to converge on optimal values for the model parameter .

The third problem: Parameter estimation (Cont.)

Page 27: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 27

Baum-Welch algorithm (Cont.)

Reestimation: from , derive

Continues reestimating the parameters until results are no longer improving significantly

Doest not guarantee that we will find the best modelLocal maximum, saddle point

The third problem: Parameter estimation (Cont.)

)|()ˆ|( OPOP

)1(

1 at time statein frequency expectedˆ

i

i ti

T

t i

T

t tij

t

jip

i

jia

1

1

)(

),(

state from ns transitioof # expected

to state from ns transitioof # expectedˆ

T

t t

Ttkot t

ijkjip

jip

ji

kjib t

1

}1,:{

),(

),(

to state from ns transitioof # expected

observed with to state from ns transitioof # expectedˆ

),,( BA )ˆ,ˆ,ˆ(ˆ BA

)|()ˆ|( OPOP

Page 28: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 28

Baum-Welch algorithm (Cont.)

P.336

The third problem: Parameter estimation (Cont.)

Page 29: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 29

Implementation

Floating point underflowThe probabilities we are calculating consist of keeping multiplying together very small numbers.

Work with logarithmIt also speeds up the computation

Employ auxiliary scaling coefficientsWhose values grow with the time t so that the probabilities multiplied by the scaling coefficient remain within the floating point range of the computer.

When the parameter values are reestimated, these scaling factors cancel out.

HMMs: Implementation, Properties, and Variants (Cont.)

Page 30: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 30

Variants

Epsilon or null transitions

State-emission modelMake the output distribution dependent just on a single state.

Large number of parameters that need to be estimatedParameter tying

Assumptions that probability distributions certain arcs or at certain states are the same as each other.

Structural zeroDecide that certain things are impossible (probability zero)

HMMs: Implementation, Properties, and Variants (Cont.)

Page 31: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 31

Multiple input observations

Ergodic modelEvery state is connected to every other state

We simply concatenate all the observation sequences and train on them as one long input.

We do not get sufficient data to be able to reestimate the initial probabilities successfully.

Feed forward modelNot fully connected.

There is an ordered set of states.

One can only proceed at each time instant to the same or a higher numbered state.

We need to extend the reestimation formulae to work with a sequence of inputs.

HMMs: Implementation, Properties, and Variants (Cont.)

i

Page 32: Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수 2000. 3. 25

Natural Language Processing Lab., Korea Univ. 32

Initialization of parameter values

If we would rather find the global maximum,Try to start the HMM in a region of the parameter space that is near the global maximum.

Good initial estimates for the output parameters turn out to be particularly important, while random initial estimates for the parameters A and are normally satisfactory.

HMMs: Implementation, Properties, and Variants (Cont.)

}{ ijkbB