Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Hidden Markov Models (II):Log-Likelihood (Forward Algorithm)
CMSC 473/673UMBC
October 11th, 2017
Course Announcement: Assignment 2
Due next Saturday, 10/21 at 11:59 PM (~10 days)
Any questions?
Recap from last time…
Expectation Maximization (EM)0. Assume some value for your parameters
Two step, iterative algorithm
1. E-step: count under uncertainty, assuming these parameters
2. M-step: maximize log-likelihood, assuming these uncertain counts
estimated counts
Counting Requires Marginalizing
w z1 & w z2 & w z3 & w z4 & w
𝑝𝑝 𝑤𝑤 = 𝑝𝑝 𝑧𝑧1,𝑤𝑤 + 𝑝𝑝 𝑧𝑧2,𝑤𝑤 + 𝑝𝑝 𝑧𝑧3,𝑤𝑤 + 𝑝𝑝 𝑧𝑧4,𝑤𝑤 = �𝑧𝑧=1
4
𝑝𝑝(𝑧𝑧𝑖𝑖 ,𝑤𝑤)
E-step: count under uncertainty, assuming these parameters
break into 4 disjoint pieces
Agenda
HMM Motivation (Part of Speech) and Brief Definition
What is Part of Speech?
HMM Detailed Definition
HMM Tasks
Parts of Speech
Adapted from Luke Zettlemoyer
Closed class words
Open class words
Nouns
milk cat
cats
UMBC
Baltimorebread
speak
give
cando
may
Verbs
Adjectives
would-be
wettestlargehappy
red
fake
Kamp & Partee (1995)
Adverbs
recentlyhappily
thenthere (location)
intransitive
run
ditransitive
transitive
subsective
non-subsective
modals, auxiliaries
I you
Determiners
PrepositionsConjunctions
Pronouns
athe
everywhat
inunder
topParticles
(set) up
so (far)not
(call) off
and or ifbecause
there
Numbers
one
1,324
because
Penn Treebank Part of Speech
3SLP: Chapter 10
http://universaldependencies.org/
Hidden Markov Models: Part of Speech
p(British Left Waffles on Falkland Islands)
Adjective Noun Verb
Noun Verb Noun Prep Noun Noun
Prep Noun Noun(i):
(ii):
Class-basedmodel
Bigram modelof the classes
Model allclass sequences
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖|𝑧𝑧𝑖𝑖−1 �𝑧𝑧1,..,𝑧𝑧𝑁𝑁
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁
Hidden Markov Model
Goal: maximize (log-)likelihood
In practice: we don’t actually observe these zvalues; we just see the words w
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1| 𝑧𝑧0 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁| 𝑧𝑧𝑁𝑁−1 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖| 𝑧𝑧𝑖𝑖−1
if we knew the probability parametersthen we could estimate z and evaluate
likelihood… but we don’t! :(
if we did observe z, estimating the probability parameters would be easy…
but we don’t! :(
Hidden Markov Model Terminology
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
Q: How many different probability values are there with K states and V vocab items?
A: VK emission values and K2 transition values
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1| 𝑧𝑧0 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁| 𝑧𝑧𝑁𝑁−1 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖| 𝑧𝑧𝑖𝑖−1
emissionprobabilities/parameters
transitionprobabilities/parameters
Hidden Markov Model Representation𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1| 𝑧𝑧0 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁| 𝑧𝑧𝑁𝑁−1 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁
= �𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖| 𝑧𝑧𝑖𝑖−1emission
probabilities/parameterstransitionprobabilities/parameters
z1
w1
…
w2 w3 w4
z2 z3 z4
𝑝𝑝 𝑤𝑤1|𝑧𝑧1 𝑝𝑝 𝑤𝑤2|𝑧𝑧2 𝑝𝑝 𝑤𝑤3|𝑧𝑧3 𝑝𝑝 𝑤𝑤4|𝑧𝑧4
𝑝𝑝 𝑧𝑧2| 𝑧𝑧1 𝑝𝑝 𝑧𝑧3| 𝑧𝑧2 𝑝𝑝 𝑧𝑧4| 𝑧𝑧3𝑝𝑝 𝑧𝑧1| 𝑧𝑧0
initial starting distribution
(“BOS”)
Each zi can take the value of one of K latent states
Transition and emission distributions do not change
Graphical Models(see 478/678)
Example: 2-state Hidden Markov Model as a Lattice
z1 = N
w1
…
w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤2|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉𝑝𝑝 𝑉𝑉| start
…𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤4|𝑉𝑉𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉𝑝𝑝 𝑤𝑤1|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁
Hidden Markov Model
Unigram Class-based Language Model (“K” coins)
Unigram Language Model
Comparison of Joint Probabilities𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1| 𝑧𝑧0 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁| 𝑧𝑧𝑁𝑁−1 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖| 𝑧𝑧𝑖𝑖−1
Agenda
HMM Motivation (Part of Speech) and Brief Definition
What is Part of Speech?
HMM Detailed Definition
HMM Tasks
Hidden Markov Model Tasks
Calculate the (log) likelihood of an observed sequence w1, …, wN
Calculate the most likely sequence of states (for an observed sequence)
Learn the emission and transition parameters
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1| 𝑧𝑧0 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁| 𝑧𝑧𝑁𝑁−1 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖| 𝑧𝑧𝑖𝑖−1emission
probabilities/parameterstransitionprobabilities/parameters
Hidden Markov Model Tasks
Calculate the (log) likelihood of an observed sequence w1, …, wN
Calculate the most likely sequence of states (for an observed sequence)
Learn the emission and transition parameters
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1| 𝑧𝑧0 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁| 𝑧𝑧𝑁𝑁−1 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �
𝑖𝑖
𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖| 𝑧𝑧𝑖𝑖−1emission
probabilities/parameterstransitionprobabilities/parameters
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1,⋯,𝑧𝑧𝑁𝑁
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1,⋯,𝑧𝑧𝑁𝑁
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
A: KN
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1,⋯,𝑧𝑧𝑁𝑁
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
A: KN
Goal:Find a way to compute this exponential sum efficiently
(in polynomial time)
HMM Likelihood Task
Marginalize over all latent sequence joint likelihoods
𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1,⋯,𝑧𝑧𝑁𝑁
𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁
Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?
A: KN
Goal:Find a way to compute this exponential sum efficiently
(in polynomial time)
Like in language modeling, you need to model when to stop generating.
This ending state is generally not included in “K.”
2 (3)-State HMM Likelihood
z1 = N
w1
…
w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤2|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉𝑝𝑝 𝑉𝑉| start
…𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤4|𝑉𝑉𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉𝑝𝑝 𝑤𝑤1|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁
Q: What are the latent sequences here (EOS excluded)?
2 (3)-State HMM Likelihood
z1 = N
w1
…
w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤2|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉𝑝𝑝 𝑉𝑉| start
…𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤4|𝑉𝑉𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉𝑝𝑝 𝑤𝑤1|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁
(N, w1), (N, w2), (N, w3), (N, w4)(N, w1), (N, w2), (N, w3), (V, w4)(N, w1), (N, w2), (V, w3), (N, w4)(N, w1), (N, w2), (V, w3), (V, w4)
A:(N, w1), (V, w2), (N, w3), (N, w4)(N, w1), (V, w2), (N, w3), (V, w4)(N, w1), (V, w2), (V, w3), (N, w4)(N, w1), (V, w2), (V, w3), (V, w4)
(V, w1), (N, w2), (N, w3), (N, w4)(V, w1), (N, w2), (N, w3), (V, w4)… (six more)
Q: What are the latent sequences here (EOS excluded)?
2 (3)-State HMM Likelihood
z1 = N
w1
…
w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤2|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉𝑝𝑝 𝑉𝑉| start
…𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤4|𝑉𝑉𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉𝑝𝑝 𝑤𝑤1|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁
(N, w1), (N, w2), (N, w3), (N, w4)(N, w1), (N, w2), (N, w3), (V, w4)(N, w1), (N, w2), (V, w3), (N, w4)(N, w1), (N, w2), (V, w3), (V, w4)
A:(N, w1), (V, w2), (N, w3), (N, w4)(N, w1), (V, w2), (N, w3), (V, w4)(N, w1), (V, w2), (V, w3), (N, w4)(N, w1), (V, w2), (V, w3), (V, w4)
(V, w1), (N, w2), (N, w3), (N, w4)(V, w1), (N, w2), (N, w3), (V, w4)… (six more)
Q: What are the latent sequences here (EOS excluded)?
2 (3)-State HMM Likelihood
z1 = N
w1
…
w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤2|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉𝑝𝑝 𝑉𝑉| start
…𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤4|𝑉𝑉𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉𝑝𝑝 𝑤𝑤1|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1
…
w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤2|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉𝑝𝑝 𝑉𝑉| start
…𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤4|𝑉𝑉𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉𝑝𝑝 𝑤𝑤1|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁
Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉
Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?
A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉
Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4) with ending included (unique ending symbol “#”)?A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) *
(.05 * 1) =0.00001235
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4 #
N .7 .2 .05 .05 0
V .2 .6 .1 .1 0end 0 0 0 0 1
2 (3)-State HMM Likelihood
z1 = N
w1
…
w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤2|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉𝑝𝑝 𝑉𝑉| start
…𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤4|𝑉𝑉𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉𝑝𝑝 𝑤𝑤1|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁
Q: What’s the probability of(N, w1), (V, w2), (N, w3), (N, w4)?N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁
Q: What’s the probability of(N, w1), (V, w2), (N, w3), (N, w4)?
A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) =0.00007056
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁
Q: What’s the probability of(N, w1), (V, w2), (N, w3), (N, w4) with ending (unique symbol “#”)?
A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) *(.05 * 1) =
0.000002646
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4 #
N .7 .2 .05 .05 0
V .2 .6 .1 .1 0end 0 0 0 0 1
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| start
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| start
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉
Up until here, all the computation was the same
2 (3)-State HMM Likelihood
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| start
z2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2 w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉
Up until here, all the computation was the same
Let’s reuse what computations we can
Reusing Computation: Attempt #1
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
(.7*.7)
(.7*.7) * (.8*.6)
(.7*.7) * (.8*.6) * (.4*.1)
(.7*.7) * (.8*.6) * (.6*.05)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
(.7*.7)
(.7*.7) * (.8*.6)
(.7*.7) * (.8*.6) * (.4*.1)
(.7*.7) * (.8*.6) * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
(.7*.7)
aN[2, V] =(.7*.7) * (.8*.6)
aN[2, V] * (.4*.1)
aN[2, V] * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
a[1,N] =(.7*.7)
aN[2, V] =a[1, V] * (.8*.6)
aN[2, V] * (.4*.1)
aN[2, V] * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
These aP(t, s) values
represent the probability of a state s at time
t in a particular shared path P
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
a[1,N] =(.7*.7)
aN[2, V] =a[1, V] * (.8*.6)
aN[2, V] * (.4*.1)
aN[2, V] * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
These aP(t, s) values
represent the probability of a state s at time
t in a particular shared path P
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
“NV” is path prefix
Reusing Computation: Attempt #1
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
(.7*.7)
aN[2, V] =(.7*.7) * (.8*.6)
aN[2, V] * (.4*.1)
aN [2, V] * (.6*.05)
Remember: these are only two of the 16 paths through the
trellis
How do we handle the other 14 paths?
Marginalize at each time step
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
Reusing Computation: Attempt #1
so far, we’ve only considered “particular shared path (AB) *”
aAB(i, *) = aA(i-1, B) * p(* | B) * p(obs at i | *)
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
what about any of the other paths (AAB, ACB, etc.)?
aA(i-1, B)a(i-2, A) aAB(i, *)
Reusing Computation: Attempt #2
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
let’s first consider “any shared path ending with B (AB, BB, or CB) B”
zi-2= C
zi-2= B
zi-2 = A
Reusing Computation: Attempt #2
let’s first consider “any shared path ending with B (AB, BB, or CB) B”marginalize across the previous hidden state values
𝛼𝛼 𝑖𝑖,𝐵𝐵 = �𝑠𝑠
𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠 ∗ 𝑝𝑝 𝐵𝐵 𝑠𝑠) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝐵𝐵)
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
Reusing Computation: Attempt #2
let’s first consider “any shared path ending with B (AB, BB, or CB) B”marginalize across the previous hidden state values
𝛼𝛼 𝑖𝑖,𝐵𝐵 = �𝑠𝑠
𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠 ∗ 𝑝𝑝 𝐵𝐵 𝑠𝑠) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝐵𝐵)
aAB(i, B) = aA(i-1, B) * p(B | B) * p(obs at i | B)
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
Reusing Computation: Attempt #2
let’s first consider “any shared path ending with B (AB, BB, or CB) B”marginalize across the previous hidden state values
𝛼𝛼 𝑖𝑖,𝐵𝐵 = �𝑠𝑠
𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠 ∗ 𝑝𝑝 𝐵𝐵 𝑠𝑠) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝐵𝐵)
aAB(i, B) = aA(i-1, B) * p(B | B) * p(obs at i | B)
computing α at time i-1 will correctly incorporate paths
through time i-2: we correctly obey the Markov property
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
Forward Probability
let’s first consider “any shared path ending with B (AB, BB, or CB) B”marginalize across the previous hidden state values
𝛼𝛼 𝑖𝑖,𝐵𝐵 = �𝑠𝑠′𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠′ ∗ 𝑝𝑝 𝐵𝐵 𝑠𝑠′) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝐵𝐵)
aAB(i, B) = aA(i-1, B) * p(B | B) * p(obs at i | B)
computing α at time i-1 will correctly incorporate paths
through time i-2: we correctly obey the Markov property
α(i, B) is the total probability of all paths to that state B from the
beginning
zi-1= C
zi-1= B
zi-1 = A
zi = C
zi = B
zi = A
zi-2= C
zi-2= B
zi-2 = A
Forward Probability
α(i, s) is the total probability of all paths:1. that start from the beginning
2. that end (currently) in s at step i3. that emit the observation obs at i
𝛼𝛼 𝑖𝑖, 𝑠𝑠 = �𝑠𝑠′𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠′ ∗ 𝑝𝑝 𝑠𝑠 𝑠𝑠′) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝑠𝑠)
Forward Probability
α(i, s) is the total probability of all paths:1. that start from the beginning
2. that end (currently) in s at step i3. that emit the observation obs at i
𝛼𝛼 𝑖𝑖, 𝑠𝑠 = �𝑠𝑠′𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠′ ∗ 𝑝𝑝 𝑠𝑠 𝑠𝑠′) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝑠𝑠)
how likely is it to get into state s this way?
what are the immediate ways to get into state s?
what’s the total probability up until now?
2 (3) -State HMM Likelihood with Forward Probabilities
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
α[1, N] =(.7*.7)
α[2, V] =α[1, N] * (.8*.6) +α[1, V] * (.35*.6)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
𝑝𝑝 𝑉𝑉| start
α[3, V] =α[2, V] * (.35*.1)+α[2, N] * (.8*.1)
α[3, N] =α[2, V] * (.6*.05) +α[2, N] * (.15*.05)
2 (3) -State HMM Likelihood with Forward Probabilities
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
α[1, N] =(.7*.7)
α[2, V] =α[1, N] * (.8*.6) +α[1, V] * (.35*.6)
α[3, V] =α[2, V] * (.35*.1)+α[2, N] * (.8*.1)
α[3, N] =α[2, V] * (.6*.05) +α[2, N] * (.15*.05)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
𝑝𝑝 𝑉𝑉| start
α[1, V] =(.2*.2)
2 (3) -State HMM Likelihood with Forward Probabilities
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
α[1, N] =(.7*.7) = .49
α[2, V] =α[1, N] * (.8*.6) +α[1, V] * (.35*.6) =0.2436
α[3, V] =α[2, V] * (.35*.1)+α[2, N] * (.8*.1)
α[3, N] =α[2, V] * (.6*.05) +α[2, N] * (.2*.05)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
𝑝𝑝 𝑉𝑉| start
α[1, V] =(.2*.2) = .04
α[2, N] =α[1, N] * (.15*.2) +α[1, V] * (.6*.2) =.0195
2 (3) -State HMM Likelihood with Forward Probabilities
w3 w4
𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁
z3 = N
z4 = N
z3 = V
z4 = V
𝑝𝑝 𝑁𝑁| 𝑁𝑁
z1 = N
w1 w2
w3 w4
𝑝𝑝 𝑤𝑤1|𝑁𝑁
𝑝𝑝 𝑤𝑤4|𝑁𝑁
𝑝𝑝 𝑁𝑁| startz2 = N
z3 = N
z4 = N
z1 = V
z2 = V
z3 = V
z4 = V
𝑝𝑝 𝑉𝑉| 𝑉𝑉
𝑝𝑝 𝑉𝑉| 𝑁𝑁
𝑝𝑝 𝑁𝑁| 𝑉𝑉
𝑝𝑝 𝑤𝑤3|𝑉𝑉
𝑝𝑝 𝑤𝑤2|𝑉𝑉
𝑝𝑝 𝑁𝑁| 𝑉𝑉
α[1, N] =(.7*.7) = .49
α[2, V] =α[1, N] * (.8*.6) +α[1, V] * (.35*.6) =0.2436
α[3, V] =α[2, V] * (.35*.1)+α[2, N] * (.8*.1)
α[3, N] =α[2, V] * (.6*.05) +α[2, N] * (.2*.05)
N V end
start .7 .2 .1
N .15 .8 .05
V .6 .35 .05
w1 w2 w3 w4
N .7 .2 .05 .05
V .2 .6 .1 .1
𝑝𝑝 𝑉𝑉| start
α[1, V] =(.2*.2) = .04
α[2, N] =α[1, N] * (.15*.2) +α[1, V] * (.6*.2) =.0195
Use dynamic programming to build the α left-to-right
Forward Algorithm
α: a 2D table, N+2 x K*N+2: number of observations (+2 for the
BOS & EOS symbols)K*: number of states
Use dynamic programming to build the α left-to-right
Forward Algorithmα = double[N+2][K*]α[0][*] = 1.0
for(i = 1; i ≤ N+1; ++i) {
}
Forward Algorithmα = double[N+2][K*]α[0][*] = 1.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
}}
Forward Algorithmα = double[N+2][K*]α[0][*] = 1.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)
}}
Forward Algorithmα = double[N+2][K*]α[0][*] = 1.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)α[i][state] += α[i-1][old] * pobs * pmove
}}
}
Forward Algorithmα = double[N+2][K*]α[0][*] = 1.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)α[i][state] += α[i-1][old] * pobs * pmove
}}
}
we still need to learn these (EM)
Forward Algorithmα = double[N+2][K*]α[0][*] = 1.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)α[i][state] += α[i-1][old] * pobs * pmove
}}
}
Q: What do we return? (How do we return the likelihood of the sequence?)
Forward Algorithmα = double[N+2][K*]α[0][*] = 1.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = ptransition(state | old)α[i][state] += α[i-1][old] * pobs * pmove
}}
}
Q: What do we return? (How do we return the likelihood of the sequence?)
A: α[N+1][end]
Interactive HMM Example
https://goo.gl/rbHEoc (Jason Eisner, 2002)
Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls
Forward Algorithm in Log-Spaceα = double[N+2][K*]
α[0][*] = 0.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)α[i][state] = logadd(α[i][state],
α[i-1][old] + pobs + pmove)}
}}
Forward Algorithm in Log-Spaceα = double[N+2][K*]
α[0][*] = 0.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)α[i][state] = logadd(α[i][state],
α[i-1][old] + pobs + pmove)}
}}
logadd 𝑙𝑙𝑝𝑝, 𝑙𝑙𝑙𝑙 = log exp 𝑙𝑙𝑝𝑝 + exp(𝑙𝑙𝑙𝑙)
Forward Algorithm in Log-Spaceα = double[N+2][K*]
α[0][*] = 0.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)α[i][state] = logadd(α[i][state],
α[i-1][old] + pobs + pmove)}
}}
logadd 𝑙𝑙𝑝𝑝, 𝑙𝑙𝑙𝑙 = log exp 𝑙𝑙𝑝𝑝 + exp(𝑙𝑙𝑙𝑙)
this can still overflow! (why?)
Forward Algorithm in Log-Spaceα = double[N+2][K*]
α[0][*] = 0.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)α[i][state] = logadd(α[i][state],
α[i-1][old] + pobs + pmove)}
}} logadd 𝑙𝑙𝑝𝑝, 𝑙𝑙𝑙𝑙 = � 𝑙𝑙𝑝𝑝 + log 1 + exp 𝑙𝑙𝑙𝑙 − 𝑙𝑙𝑝𝑝 , 𝑙𝑙𝑝𝑝 ≥ 𝑙𝑙𝑙𝑙
𝑙𝑙𝑙𝑙 + log 1 + exp 𝑙𝑙𝑝𝑝 − 𝑙𝑙𝑙𝑙 , 𝑙𝑙𝑙𝑙 > 𝑙𝑙𝑝𝑝
Forward Algorithm in Log-Spaceα = double[N+2][K*]
α[0][*] = 0.0
for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {
pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {
pmove = log ptransition(state | old)α[i][state] = logadd(α[i][state],
α[i-1][old] + pobs + pmove)}
}} logadd 𝑙𝑙𝑝𝑝, 𝑙𝑙𝑙𝑙 = � 𝑙𝑙𝑝𝑝 + log 1 + exp 𝑙𝑙𝑙𝑙 − 𝑙𝑙𝑝𝑝 , 𝑙𝑙𝑝𝑝 ≥ 𝑙𝑙𝑙𝑙
𝑙𝑙𝑙𝑙 + log 1 + exp 𝑙𝑙𝑝𝑝 − 𝑙𝑙𝑙𝑙 , 𝑙𝑙𝑙𝑙 > 𝑙𝑙𝑝𝑝
scipy.misc.logsumexp