Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Albert Gatt
Corpora and Statistical Methods
Lecture 8
Markov and Hidden Markov Models: Conceptual
Introduction
Part 2
In this lecture
We focus on (Hidden) Markov Models
conceptual intro to Markov Models
relevance to NLP
Hidden Markov Models
algorithms
Acknowledgement
Some of the examples in this lecture are taken from a tutorial
on HMMs by Wolgang Maass
Talking about the weather Suppose we want to predict tomorrow’s weather. The
possible predictions are: sunny foggy rainy
We might decide to predict tomorrow’s outcome based on earlier weather if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it
had been rainy all week how far back do we want to go to predict tomorrow’s weather?
Statistical weather model Notation:
S: the state space, a set of possible values for the weather: {sunny, foggy, rainy}
(each state is identifiable by an integer i)
X: a sequence of random variables, each taking a value from S
these model weather over a sequence of days
t is an integer standing for time
(X1, X2, X3, ... XT) models the value of a series of random variables
each takes a value from S with a certain probability P(X=si)
the entire sequence tells us the weather over T days
Statistical weather model
If we want to predict the weather for day t+1, our model might
look like this:
E.g. P(weather tomorrow = sunny), conditional on the weather
in the past t days.
Problem: the larger t gets, the more calculations we have to
make.
)...|( 11 tkt XXsXP
Markov Properties I: Limited horizon
The probability that we’re in state si at time t+1 only
depends on where we were at time t:
Given this assumption, the probability of any sequence is
just:
)|()...|( 111 tittit XsXPXXsXP
)|(),...,( 1
1
1
ii
T
i
T XXPXXP
Markov Properties II: Time invariance
The probability of being in state si given the previous state does
not change over time:
)|()|( 121 XsXPXsXP itit
Concrete instantiation
Day t Day t+1
sunny rainy foggy
sunny 0.8 0.05 0.15
rainy 0.2 0.6 0.2
foggy 0.2 0.3 0.5
This is essentially a transition matrix, which gives us probabilities of going from one state to the other.
We can denote state transition probabilities as aij (prob. of going from state i to state j)
Graphical view
Components of the model:
1. states (s)
2. transitions
3. transition probabilities
4. initial probability distribution
for states
Essentially, a non-deterministic finite
state automaton.
Example continued
If the weather today (Xt) is sunny, what’s the probability that
tomorrow (Xt+1) is sunny and the day after (Xt+2) is rainy?
04.0
8.005.0
)|()|(
)|(),|(
)|,(
112
112
21
sXsXPsXrXP
sXsXPsXsXrXP
sXrXsXP
tttt
ttttt
ttt
Markov assumption
Formal definition
A Markov Model is a triple (S, , A) where:
S is the set of states
are the probabilities of being initially in some state
A are the transition probabilities
Hidden Markov Models
A slight variation on the example You’re locked in a room with no windows
You can’t observe the weather directly
You only observe whether the guy who brings you food is carrying an umbrella or not
Need a model telling you the probability of seeing the umbrella, given the weather
distinction between observations and their underlying emitting state.
Define:
Ot as an observation at time t
K = {+umbrella, -umbrella} as the possible outputs
We’re interested in P(Ot=k|Xt=si) i.e. p. of a given observation at t given that the underlying weather state at t is si
Symbol emission probabilities
weather Probability of umbrella
sunny 0.1
rainy 0.8
foggy 0.3
This is the hidden model, telling us the probability that Ot = k given that Xt = si
We assume that each underlying state Xt = si emits an observation with a given probability.
Using the hidden model
Model gives:P(Ot=k|Xt=si)
Then, by Bayes’ Rule we can compute: P(Xt=si|Ot=k)
Generalises easily to an entire sequence
)(
)()|()|(
kOP
sXPsXkOPkOsXP
t
ititttit
HMM in graphics
Circles indicate states
Arrows indicate probabilistic dependencies between states
HMM in graphics
Green nodes are hidden states
Each hidden state depends only on the previous state (Markov assumption)
Why HMMs?
HMMs are a way of thinking of underlying events
probabilistically generating surface events.
Example: Parts of speech
a POS is a class or set of words
we can think of language as an underlying Markov Chain of
parts of speech from which actual words are generated
(“emitted”)
So what are our hidden states here, and what are the
observations?
HMMs in POS Tagging
ADJ N VDET
Hidden layer (constructed through training)
Models the sequence of POSs in the training corpus
HMMs in POS Tagging
ADJ
tall
N
lady
V
is
DET
the
Observations are words.
They are “emitted” by their corresponding hidden state.
The state depends on its previous state.
Why HMMs
There are efficient algorithms to train HMMs using
Expectation Maximisation
General idea:
training data is assumed to have been generated by some HMM
(parameters unknown)
try and learn the unknown parameters in the data
Similar idea is used in finding the parameters of some n-gram
models, especially those that use interpolation.
Formalisation of a Hidden Markov model
Crucial ingredients (familiar) Underlying states: S = {s1,…,sN}
Output alphabet (observations): K = {k1,…,kM}
State transition probabilities:
A = {aij}, i,j Є S
State sequence: X = (X1,…,XT+1)
+ a function mapping each Xt to a state s
Output sequence: O = (O1,…,OT) where each ot Є K
Crucial ingredients (additional) Initial state probabilities:
Π = {πi}, i Є S
(tell us the initial probability of each state)
Symbol emission probabilities:
B = {bijk}, i,j Є S, k Є K
(tell us the probability b of seeing observation Ot=k, given that Xt=si and Xt+1 = sj)
Trellis diagram of an HMM
s1
s2
s3
a1,1
a1,2
a1,3
Trellis diagram of an HMM
s1
s2
s3
a1,1
a1,2
a1,3
o1 o2 o3Obs. seq:
time: t1 t2t3
Trellis diagram of an HMM
s1
s2
s3
a1,1
a1,2
a1,3
o1 o2 o3Obs. seq:
time: t1 t2t3
b1,1,k b1,1,k
b1,2,k
b1,3,k
The fundamental questions for HMMs
1. Given a model μ = (A, B, Π), how do we compute the likelihood of an observation P(O| μ)?
2. Given an observation sequence O, and model μ, which is the state sequence (X1,…,Xt+1) that best explains the observations?
This is the decoding problem
3. Given an observation sequence O, and a space of possible models μ = (A, B, Π), which model best explains the observed data?
Application of question 1 (ASR)
Given a model μ = (A, B, Π), how do we compute the
likelihood of an observation P(O| μ)?
Input of an ASR system: a continuous stream of sound
waves, which is ambiguous
Need to decode it into a sequence of phones.
is the input the sequence [n iy d] or [n iy]?
which sequence is the most probable?
Application of question 2 (POS Tagging) Given an observation sequence O, and model μ, which is the state sequence
(X1,…,Xt+1) that best explains the observations?
this is the decoding problem
Consider a POS Tagger
Input observation sequence:
I can read
need to find the most likely sequence of underlying POS tags:
e.g. is can a modal verb, or the noun?
how likely is it that can is a noun, given that the previous word is a pronoun?
Summary
HMMs are a way of representing:
sequences of observations arising from
sequences of states
states are the variables of interest, giving rise to the
observations
Next up:
algorithms for answering the fundamental questions about
HMMs