Upload
bennett-neal
View
220
Download
1
Embed Size (px)
Citation preview
Ch 9. Markov Models
고려대학교 자연어처리연구실한 경 수
2000. 3. 25.
Natural Language Processing Lab., Korea Univ. 2
Contents
Markov models
Hidden Markov models
The three fundamental questions for HMMsFinding the probability of an observation
Finding the best state sequence
Parameter estimation
HMMs: implementation, properties, and variants
Natural Language Processing Lab., Korea Univ. 3
Markov Models
Markov propertiesLimited horizon
Time invariant (stationary)
Stochastic transition matrix A
Probabilities of different initial states
)|()|( 121 XsXPXsXP ktkt
)|(),...,|( 111 tkttkt XsXPXXsXP
N
j ijijitjtij aiajisXsXPa11 1: and 0:, ; )|(
1 ; )(N
11 i iii sXP
If X has the Markov property, X is said to be a Markov
chain.
: sequence of random variables taking values in some finite set , the state space.},...,{ 1 NssS
),...,( 1 TXXX
)|()|( 121 XsXPXsXP ktkt
Natural Language Processing Lab., Korea Univ. 4
Markov Models (Cont.)
Markov models can be used whenever one wants to model the probability of a linear sequence of events
word n-gram models, modeling valid phone sequences in speech recognition, sequences of speech acts in dialog systems
thought of a probabilistic finite-state automaton.
Probability of a sequence of states
mth order Markov modelm: # of previous states that we are using to predict the next state
n-gram model is equivalent to an (n-1)th order Markov model.
1
1
123121
112131211
11
)|()|()|()(
),...,|(),|()|()(),...,(
T
tXXX
TT
TTT
tta
XXPXXPXXPXP
XXXPXXXPXXPXPXXP
Natural Language Processing Lab., Korea Univ. 5
Markov Models (Cont.)
P.319 Figure 9.1
18.0
6.03.00.1
)|()|()(),,( 23121
iXpXPtXiXPtXPpitP
Natural Language Processing Lab., Korea Univ. 6
Hidden Markov Models
In an HMM,You don’t know the state sequence that the model passes through, but only some probabilistic function of it.
Emission probability for the observations
Example: the crazy soft drink machineQ: What is the probability of seeing the output sequence {lem, ice_t} if the machine always starts off in the cola preferring state?
A: consider all paths that might be taken through the HMM, and then to sum over them.
),|( 1 jtittijk sXsXkOPb
084.07.05.03.03.07.05.03.03.0
1.03.03.07.01.07.03.07.0
Natural Language Processing Lab., Korea Univ. 7
The Crazy Soft Drink Machine
CP IP
0.5
0.3
start
0.7 0.5
cola iced tea
(ice_t)
lemonade
(lem)
CP 0.6 0.1 0.3
IP 0.1 0.7 0.2
Hidden Markov Models (Cont.)
Natural Language Processing Lab., Korea Univ. 8
Why use HMMs?
HMMs are useful when one can think of underlying events probabilistically generating surface events.
POS tagging (Chap. 10)
There exist efficient methods of training through use of the EM algorithm.
Given plenty of data that we assume to be generated by some HMM,This algorithm allows us to automatically learn the model parameters that best account for the observed data.
Linear interpolation of n-gram models
We can build an HMM with hidden states that represent the choice of whether to use the unigram, bigram, or trigram probabilities.
1 ; ),|()|()(),|( 21331221121 i innnnnnnnnli wwwPwwPwPwwwP
Hidden Markov Models (Cont.)
Natural Language Processing Lab., Korea Univ. 9
Linear interpolation of n-gram models
P.323 Figure 9.3
Hidden Markov Models (Cont.)
Natural Language Processing Lab., Korea Univ. 10
General form of an HMM
An HMM is specified by a five-tuple : set of states
: output alphabet
: initial state probabilities
: state transition probabilities
: symbol emission probabilities
: state sequence
: output sequence
arc-emission HMM vs. state-emission HMMarc-emission HMM: the symbol emitted at time t depends on both the state at time t and at time t+1.
state-emission HMM: the symbol emitted at time t depends just on the state at time t.
),,,,( BAKS },...,{ 1 NssS
},...,1{},...,{ 1 MkkK M Sii },{SjiaA ij ,},{
KkSjibB ijk ,,},{},...,1{: ),...,( 11 NSXXXX tT
KoooO tT ),...,( 1
Hidden Markov Models (Cont.)
Natural Language Processing Lab., Korea Univ. 11
The Three Fundamental Questions for HMMs
1. Given a model , how do we efficiently compute how likely a certain observation is, that is ?
used to decide between models which is best.
2. Given the observation sequence O and a model , how do we choose a state sequence that best explains the observations?
guess what path was probably followed through the Markov chain; used for classification (e.g. POS tagging)
3. Given an observation sequence O, and a space of possible models found by varying the model parameters , how do we find the model that best explains the observed data?
estimate model parameters from data
),,( BA
)|( OP
),...,( 11 TXX
),,( BA
Natural Language Processing Lab., Korea Univ. 12
Finding the probability of an observation
Decoding
requires multiplications
11
1111
)|(),|(
)|,()|(
T
tttttXX
T
toXXXXX
X
X
ba
XPXOP
XOPOP
1)12( TNT
T
toXXoXXoXXoXX
T
tttt tttTTT
bbbbXXoPXOP11
1 11232121),,|(),|(
T
tXXXXXXXXXX ttTT
aaaaXP1
11132211)|(
The Three Fundamental Questions for HMMs (Cont.)
Natural Language Processing Lab., Korea Univ. 13
Trellis algorithms
The secret to avoiding this complexity is the general technique of dynamic programming.
Remember partial results rather than recomputing them.
Trellis algorithmsMake a square array of states versus time
Compute the probabilities of being at each state at each time in terms of the probabilities for being in each state at the preceding time instant.
Finding the probability of an observation (Cont.)
Natural Language Processing Lab., Korea Univ. 14
Trellis algorithms (Cont.)
P.328 Figure 9.5
Finding the probability of an observation (Cont.)
Natural Language Processing Lab., Korea Univ. 15
The forward procedure
Forward variables
is stored at in the trellis
expresses the total probability of ending up in state at time t
is calculated by summing probabilities for all incoming arcs at a trellis node
Finding the probability of an observation (Cont.)
)|,()( 121 iXoooPt tti
),( tsi
is
1. Initialization
2. Induction
3. Total
Niii 1 ,)1(
NjTtbattN
iijoijij t
1,1 ,)()1(1
N
ii TOP
1
)1()|(
Requires multiplications
TN 22
Natural Language Processing Lab., Korea Univ. 16
The forward procedure
P.329 Figure 9.6
Finding the probability of an observation (Cont.)
Natural Language Processing Lab., Korea Univ. 17
The backward procedure
Backward variables
The total probability of seeing the rest of the observation sequence given that we were in state at time t
Combination of forward and backward probabilities is vital for solving the third problem of parameter reestimation
Finding the probability of an observation (Cont.)
),|()( iXooPt tTti
is
1. Initialization
2. Induction
3. Total
NiTi 1 ,1)1(
NiTttbatN
jjijoiji t
1,1 ,)1()(1
N
iiiOP
1
)1()|(
Natural Language Processing Lab., Korea Univ. 18
Variable calculations
P.330 Table 9.2
Finding the probability of an observation (Cont.)
Natural Language Processing Lab., Korea Univ. 19
Combining themFinding the probability of an observation (Cont.)
11 , )()()|(1
TtttOPN
iii
)()(
),|()|,(
),,|()|,(
)|,,(
)|,()|,(
11
1111
11
1
tt
iXooPiXooP
iXooooPiXooP
ooiXooP
iXooPiXOP
ii
tTttt
ttTttt
Tttt
tTt
Natural Language Processing Lab., Korea Univ. 20
Finding the best state sequence
Choosing the states individuallyFor each t, we would find that maximizes
The individually most likely state
This quantity maximizes the expected number of states that will be guessed correctly.
However, it may yield a quite unlikely state sequence.
This is not the method that is normally used.
The Three Fundamental Questions for HMMs (Cont.)
tX ),|( OXP t
N
j ii
ii
t
ti
tt
tt
OP
OiXP
OiXPt
1)()(
)()(
)|(
)|,(
),|()(
11 , )(maxargˆ1
TttX iNi
t
Natural Language Processing Lab., Korea Univ. 21
Viterbi algorithm
We want to find the most likely complete path
This variable stores for each point in the trellis the probability of the most probable path that leads to that node.
Records the node of the incoming arc that led to this most probable path.
Finding the best state sequence (Cont.)
)|,(maxarg),|(maxarg OXPOXPXX
)|,,(max)( 111111
jXooXXPt tttXX
jt
)(tj
Natural Language Processing Lab., Korea Univ. 22
Viterbi algorithm (Cont.)
1. Initialization
2. Induction
Store backtrace
3. Termination and path readout (by backtracking)
Njjj 1 ,)1(
Njbatttijoiji
Nij
1 ,)(max)1(
1
Njbatttijoiji
Nij
1 ,)(maxarg)1(
1
)1(maxargˆ1
1
TX iNi
T
)1(ˆ1
ˆ tX
tXt
)1(max)ˆ(1
TXP iNi
t
Finding the best state sequence (Cont.)
Natural Language Processing Lab., Korea Univ. 23
The third problem: Parameter estimation
There is no known analytic method to choose We can locally maximize it by an iterative hill-climbing algorithm
Baum-Welch or Forward-Backward algorithm
Work out the probability of the observation sequence using some (perhaps randomly chosen) model.
We can see which state transitions and symbol emissions were probably used the most.
By increasing the probability of those, we can choose a revised model which gives a higher probability to the observation sequence.
Training !
)|(maxarg
trainingOP
The Three Fundamental Questions for HMMs (Cont.)
Natural Language Processing Lab., Korea Univ. 24
Baum-Welch algorithm
Probability of traversing a certain arc at time t given observation sequence O
= expected number of transitions from state i in O
= expected number of transitions from state i to j in O
The third problem: Parameter estimation (Cont.)
T
tt jip
1
),(
N
m
N
n nmnomnm
jijoiji
N
m mm
jijoiji
ttttt
tbat
tbat
tt
tbat
OP
OjXiXPOjXiXPjip
t
tt
1 11
11
)1()(
)1()(
)()(
)1()(
)|(
)|,,(),|,(),(
N
jti jipt
1
),()(
T
ti t
1
)(
Natural Language Processing Lab., Korea Univ. 25
Baum-Welch algorithm
P.334 Figure 9.7
The third problem: Parameter estimation (Cont.)
Natural Language Processing Lab., Korea Univ. 26
Baum-Welch algorithm (Cont.)
Begin with some model (perhaps preselected, perhaps just chosen randomly)
Run O through the current model to estimate the expectations of each model parameter.
Change the model to maximize the values of the paths that are used a lot.
Repeat this process, hoping to converge on optimal values for the model parameter .
The third problem: Parameter estimation (Cont.)
Natural Language Processing Lab., Korea Univ. 27
Baum-Welch algorithm (Cont.)
Reestimation: from , derive
Continues reestimating the parameters until results are no longer improving significantly
Doest not guarantee that we will find the best modelLocal maximum, saddle point
The third problem: Parameter estimation (Cont.)
)|()ˆ|( OPOP
)1(
1 at time statein frequency expectedˆ
i
i ti
T
t i
T
t tij
t
jip
i
jia
1
1
)(
),(
state from ns transitioof # expected
to state from ns transitioof # expectedˆ
T
t t
Ttkot t
ijkjip
jip
ji
kjib t
1
}1,:{
),(
),(
to state from ns transitioof # expected
observed with to state from ns transitioof # expectedˆ
),,( BA )ˆ,ˆ,ˆ(ˆ BA
)|()ˆ|( OPOP
Natural Language Processing Lab., Korea Univ. 28
Baum-Welch algorithm (Cont.)
P.336
The third problem: Parameter estimation (Cont.)
Natural Language Processing Lab., Korea Univ. 29
Implementation
Floating point underflowThe probabilities we are calculating consist of keeping multiplying together very small numbers.
Work with logarithmIt also speeds up the computation
Employ auxiliary scaling coefficientsWhose values grow with the time t so that the probabilities multiplied by the scaling coefficient remain within the floating point range of the computer.
When the parameter values are reestimated, these scaling factors cancel out.
HMMs: Implementation, Properties, and Variants (Cont.)
Natural Language Processing Lab., Korea Univ. 30
Variants
Epsilon or null transitions
State-emission modelMake the output distribution dependent just on a single state.
Large number of parameters that need to be estimatedParameter tying
Assumptions that probability distributions certain arcs or at certain states are the same as each other.
Structural zeroDecide that certain things are impossible (probability zero)
HMMs: Implementation, Properties, and Variants (Cont.)
Natural Language Processing Lab., Korea Univ. 31
Multiple input observations
Ergodic modelEvery state is connected to every other state
We simply concatenate all the observation sequences and train on them as one long input.
We do not get sufficient data to be able to reestimate the initial probabilities successfully.
Feed forward modelNot fully connected.
There is an ordered set of states.
One can only proceed at each time instant to the same or a higher numbered state.
We need to extend the reestimation formulae to work with a sequence of inputs.
HMMs: Implementation, Properties, and Variants (Cont.)
i
Natural Language Processing Lab., Korea Univ. 32
Initialization of parameter values
If we would rather find the global maximum,Try to start the HMM in a region of the parameter space that is near the global maximum.
Good initial estimates for the output parameters turn out to be particularly important, while random initial estimates for the parameters A and are normally satisfactory.
HMMs: Implementation, Properties, and Variants (Cont.)
}{ ijkbB