View
22
Download
0
Category
Preview:
Citation preview
Introduction to Hidden Markov Models
Introduction
• A Hidden Markov Model (HMM) is a simple temporal Bayesian
network.
• The structure of the network dictates how a system that varies over
time moves from one state to the next.
• Generally, we start at an initial state at time zero and the process
moves to each of its possible new states with a probability based
solely on the current state.
• In Markov models, the states are directly observable – they are the
values being measured
• In Hidden Markov Models, the states are hidden, but lead to
observable values with certain probabilities.
• Set of states:
• Process moves from one state to another generating a
sequence of states :
•Markov chain property: probability of each subsequent state depends only on
what was the previous state:
•To define Markov model, the following probabilities have to be specified:
inital probabilities and transition probabilities
Firstly, Markov Models
},,,{ 21 Nsss K
KK ,,,,21 ikii
sss
)|(),,,|( 1121 !! =ikikikiiikssPssssP K
)|( jiij ssPa =)(iisP=!
Rain Dry
0.70.3
0.2 0.8
• Two states : ‘Rain’ and ‘Dry’.
• Transition probabilities:
P(‘Rain’|‘Rain’)=0.3 , P(‘Dry’|‘Rain’)=0.7 ,
P(‘Rain’|‘Dry’)=0.2, P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
Example of Markov Model
• By Markov chain property, probability of state sequence can be found by the
formula:
•Suppose we want to calculate a probability of a sequence of states in our
example, {‘Dry’,’Dry’,’Rain’,Rain’}.
P({‘Dry’,’Dry’,’Rain’,Rain’} )
= P(‘Rain’|’Rain’) P(‘Rain’|’Dry’) P(‘Dry’|’Dry’) P(‘Dry’)
= 0.3*0.2*0.8*0.6
Calculation of sequence probability
)()|()|()|(
),,,()|(
),,,(),,,|(),,,(
112211
1211
12112121
iiiikikikik
ikiiikik
ikiiikiiikikii
sPssPssPssP
sssPssP
sssPssssPsssP
K
KK
KKK
!!!
!!
!!
=
==
=
Low High
0.70.3
0.2 0.8
DryRain
0.6 0.60.4 0.4
Example of Hidden Markov Model
• Two states : ‘Low’ and ‘High’ atmospheric pressure.
• Two observations : ‘Rain’ and ‘Dry’.
• Transition probabilities:
P(‘Low’|‘Low’)=0.3 , P(‘High’|‘Low’)=0.7 ,
P(‘Low’|‘High’)=0.2, P(‘High’|‘High’)=0.8
• Observation probabilities :
P(‘Rain’|‘Low’)=0.6 , P(‘Dry’|‘Low’)=0.4 , P(‘Rain’|‘High’)=0.4 ,
P(‘Dry’|‘High’)=0.3 .
• Initial probabilities: say P(‘Low’)=0.4 , P(‘High’)=0.6 .
Example of Hidden Markov Model
•Suppose we want to calculate a probability of a sequence of observations in
our example, {‘Dry’,’Rain’}.
•Consider all possible hidden state sequences:
P({‘Dry’,’Rain’} )
= P({‘Dry’,’Rain’} , {‘Low’,’Low’}) + P({‘Dry’,’Rain’} ,
{‘Low’,’High’}) + P({‘Dry’,’Rain’} , {‘High’,’Low’}) +
P({‘Dry’,’Rain’} , {‘High’,’High’})
where first term is :
P({‘Dry’,’Rain’} , {‘Low’,’Low’})
= P({‘Dry’,’Rain’} | {‘Low’,’Low’}) P({‘Low’,’Low’})
= P(‘Dry’|’Low’)P(‘Rain’|’Low’) P(‘Low’)P(‘Low’|’Low’)
= 0.4*0.4*0.6*0.4*0.3
Calculation of observation sequence probability
How HMM Works
• HMM is a stochastic generative
model for sequences
• Defined by
– finite set of states S
– finite alphabet A
– transition prob matrix T
– emission prob matrix E
• Move from state to state
according to T while emitting
symbols according to Esk
s1
…
s2
a1
a2
• Game:
– You bet £1
– You roll your die
– Casino rolls their die
– Highest number wins £2
• Question: Suppose we played 2
games, and the sequence of rolls
was 1, 6, 2, 6. Have we been
cheated?
Example: Dishonest Casino
• Casino has two dices:
– Fair dice
• P(i) = 1/6, i = 1..6
– Loaded dice
• P(i) = 1/10, i = 1..5
• P(i) = 1/2, i = 6
• Casino switches between fair &
loaded die with probability of 1/2.
• Initially, dice is always fair
“Visualisation” of Dishonest Casino
1, 6, 2, 6?
We were probably cheated...
E(1|Fair) * T(?,Fair)*
E(6|Loaded) * T(Fair,Loaded)*
E(2|Fair) * T(Loaded, Fair) *
E(6|Loaded) * T(Fair, Loaded)
Evaluation problem. Given the HMM M=(A, B, !) and the observation sequence O=o1
o2 ... oK , calculate the probability that model M has generated sequence O .
• Decoding problem. Given the HMM M=(A, B, !) and the observation sequence O=o1
o2 ... oK , calculate the most likely sequence of hidden states si that produced this observation
sequence O.
• Learning problem. Given some training observation sequences O=o1 o2 ... oK and general
structure of HMM (numbers of hidden and visible states), determine HMM parameters
M=(A, B, !) that best fit training data.
O=o1...oK denotes a sequence of observations ok!{v1,…,vM}.
Main uses of HMMs :
• Typed word recognition, assume all characters are separated.
• Character recognizer outputs probability of the image being
particular character, P(image|character).
0.5
0.03
0.005
0.31z
c
b
a
Word recognition example(1).
Hidden state Observation
• If lexicon is given, we can construct separate HMM models
for each lexicon word.
Amherst a m h e r s t
Buffalo b u f f a l o
0.5 0.03
• Here recognition of word image is equivalent to the problem
of evaluating few HMM models.
•This is an application of Evaluation problem.
Word recognition example(3).
0.4 0.6
• We can construct a single HMM for all words.
• Hidden states = all characters in the alphabet.
• Transition probabilities and initial probabilities are calculated
from language model.
• Observations and observation probabilities are as before.
a m
h e
r
s
t
b v
f
o
• Here we have to determine the best sequence of hidden states,
the one that most likely produced word image.
• This is an application of Decoding problem.
Word recognition example(4).
What makes a good HMM problem space?
Characteristics:
• Classification problems
There are two main types of output from an HMM:
– Scoring of sequences
• For example, protein family modelling
– Labeling of observations within a sequence
• For example, identifying genes in a particular sequence
HMM Requirements
So you’ve decided you want to build an HMM,
here’s what you need:
• An architecture
– Probably the hardest part
– Should be sound & easy to interpret
• A well-defined success measure
– Necessary for any form of machine learning
HMM Requirements
Continued
• Training data
– Labeled or unlabeled – it depends
• You do not always need a labeled training set to do
observation labeling, but it helps
– Amount of training data needed is:
• Directly proportional to the number of free parameters in the
model
• Inversely proportional to the size of the training sequences
HMM Advantages
• Statistical Grounding
– Statisticians are comfortable with the theory behind hidden
Markov models
– Freedom to manipulate the training and verification processes
– Mathematical / theoretical analysis of the results and processes
– HMMs are still very powerful modeling tools – far more powerful
than many statistical methods
HMM Advantages continued
• Modularity
– HMMs can be combined into larger HMMs
• Transparency of the Model
– Assuming an architecture with a good design
– People can read the model and make sense of it
– The model itself can help increase understanding
HMM Advantages continued
• Incorporation of Prior Knowledge
– Incorporate prior knowledge into the architecture
– Initialize the model close to something believed to be correct
– Use prior knowledge to constrain training process
HMM Disadvantages
• Markov Chains
– States are supposed to be independent
– P(y) must be independent of P(x), and vice versa
– This usually isn’t true
– Can get around it when relationships are local
P(x) … P(y)
HMM Disadvantages
continued
• …and then there are the standard Machine Learning Problems
• Watch out for local maxima
– Model may not converge to a truly optimal parameter set for a
given training set
• Avoid over-fitting
– You’re only as good as your training set
– More training is not always good
HMM Disadvantages
continued
• Speed!!!
– Almost everything one does in an HMM involves: “enumerating
all possible paths through the model”
– There are efficient ways to do this
– Still slow in comparison to other methods
Conclusions
• HMMs have problems where they excel, and problems where they
do not
• You should consider using one if:
– Problem can be phrased as classification
– Observations are ordered
– The observations follow some sort of grammatical structure
(optional)
Conclusions
Advantages:
– Statistics
– Modularity
– Transparency
– Prior Knowledge
Disadvantages:
– State independence
– Over-fitting
– Local Maximums
– Speed
Recommended