Hidden Markov Models (II): Log-Likelihood (Forward Algorithm)

Hidden Markov Models (II):Log-Likelihood (Forward Algorithm)

CMSC 473/673UMBC

October 11th, 2017

Course Announcement: Assignment 2

Due next Saturday, 10/21 at 11:59 PM (~10 days)

Any questions?

Recap from last time…

Expectation Maximization (EM)0. Assume some value for your parameters

Two step, iterative algorithm

1. E-step: count under uncertainty, assuming these parameters

2. M-step: maximize log-likelihood, assuming these uncertain counts

estimated counts

Counting Requires Marginalizing

w z1 & w z2 & w z3 & w z4 & w

𝑝𝑝 𝑤𝑤 = 𝑝𝑝 𝑧𝑧1,𝑤𝑤 + 𝑝𝑝 𝑧𝑧2,𝑤𝑤 + 𝑝𝑝 𝑧𝑧3,𝑤𝑤 + 𝑝𝑝 𝑧𝑧4,𝑤𝑤 = �𝑧𝑧=1

4

𝑝𝑝(𝑧𝑧𝑖𝑖 ,𝑤𝑤)

E-step: count under uncertainty, assuming these parameters

break into 4 disjoint pieces

Agenda

HMM Motivation (Part of Speech) and Brief Definition

What is Part of Speech?

HMM Detailed Definition

HMM Tasks

Parts of Speech

Adapted from Luke Zettlemoyer

Closed class words

Open class words

Nouns

milk cat

cats

UMBC

Baltimorebread

speak

give

cando

may

Verbs

Adjectives

would-be

wettestlargehappy

red

fake

Kamp & Partee (1995)

Adverbs

recentlyhappily

thenthere (location)

intransitive

run

ditransitive

transitive

subsective

non-subsective

modals, auxiliaries

I you

Determiners

PrepositionsConjunctions

Pronouns

athe

everywhat

inunder

topParticles

(set) up

so (far)not

(call) off

and or ifbecause

there

Numbers

one

1,324

because

Penn Treebank Part of Speech

3SLP: Chapter 10

http://universaldependencies.org/

Hidden Markov Models: Part of Speech

p(British Left Waffles on Falkland Islands)

Adjective Noun Verb

Noun Verb Noun Prep Noun Noun

Prep Noun Noun(i):

(ii):

Class-basedmodel

Bigram modelof the classes

Model allclass sequences

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖|𝑧𝑧𝑖𝑖−1 �𝑧𝑧1,..,𝑧𝑧𝑁𝑁

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁

Hidden Markov Model

Goal: maximize (log-)likelihood

In practice: we don’t actually observe these zvalues; we just see the words w

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1| 𝑧𝑧0 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁| 𝑧𝑧𝑁𝑁−1 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖| 𝑧𝑧𝑖𝑖−1

if we knew the probability parametersthen we could estimate z and evaluate

likelihood… but we don’t! :(

if we did observe z, estimating the probability parameters would be easy…

but we don’t! :(

Hidden Markov Model Terminology

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

Q: How many different probability values are there with K states and V vocab items?

A: VK emission values and K2 transition values


𝑖𝑖


emissionprobabilities/parameters

transitionprobabilities/parameters

Hidden Markov Model Representation𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1| 𝑧𝑧0 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁| 𝑧𝑧𝑁𝑁−1 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁

= �𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖| 𝑧𝑧𝑖𝑖−1emission

probabilities/parameterstransitionprobabilities/parameters

z1

w1

…

w2 w3 w4

z2 z3 z4

𝑝𝑝 𝑤𝑤1|𝑧𝑧1 𝑝𝑝 𝑤𝑤2|𝑧𝑧2 𝑝𝑝 𝑤𝑤3|𝑧𝑧3 𝑝𝑝 𝑤𝑤4|𝑧𝑧4

𝑝𝑝 𝑧𝑧2| 𝑧𝑧1 𝑝𝑝 𝑧𝑧3| 𝑧𝑧2 𝑝𝑝 𝑧𝑧4| 𝑧𝑧3𝑝𝑝 𝑧𝑧1| 𝑧𝑧0

initial starting distribution

(“BOS”)

Each zi can take the value of one of K latent states

Transition and emission distributions do not change

Graphical Models(see 478/678)

Example: 2-state Hidden Markov Model as a Lattice

z1 = N

w1

…

w2 w3 w4

𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤2|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁

𝑝𝑝 𝑁𝑁| startz2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉 𝑝𝑝 𝑉𝑉| 𝑉𝑉𝑝𝑝 𝑉𝑉| start

…𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁 𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉 𝑝𝑝 𝑁𝑁| 𝑉𝑉

𝑝𝑝 𝑤𝑤4|𝑉𝑉𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉𝑝𝑝 𝑤𝑤1|𝑉𝑉

𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁 𝑝𝑝 𝑁𝑁| 𝑁𝑁

Hidden Markov Model

Unigram Class-based Language Model (“K” coins)

Unigram Language Model

Comparison of Joint Probabilities𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑤𝑤1 𝑝𝑝 𝑤𝑤2 ⋯𝑝𝑝 𝑤𝑤𝑁𝑁 = �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖

𝑝𝑝 𝑧𝑧1,𝑤𝑤1, 𝑧𝑧2,𝑤𝑤2, … , 𝑧𝑧𝑁𝑁,𝑤𝑤𝑁𝑁 = 𝑝𝑝 𝑧𝑧1 𝑝𝑝 𝑤𝑤1|𝑧𝑧1 ⋯𝑝𝑝 𝑧𝑧𝑁𝑁 𝑝𝑝 𝑤𝑤𝑁𝑁|𝑧𝑧𝑁𝑁= �

𝑖𝑖

𝑝𝑝 𝑤𝑤𝑖𝑖|𝑧𝑧𝑖𝑖 𝑝𝑝 𝑧𝑧𝑖𝑖


𝑖𝑖


Agenda

HMM Motivation (Part of Speech) and Brief Definition

What is Part of Speech?

HMM Detailed Definition

HMM Tasks

Hidden Markov Model Tasks

Calculate the (log) likelihood of an observed sequence w1, …, wN

Calculate the most likely sequence of states (for an observed sequence)

Learn the emission and transition parameters


𝑖𝑖



Hidden Markov Model Tasks

Calculate the (log) likelihood of an observed sequence w1, …, wN

Calculate the most likely sequence of states (for an observed sequence)

Learn the emission and transition parameters


𝑖𝑖



HMM Likelihood Task

Marginalize over all latent sequence joint likelihoods

𝑝𝑝 𝑤𝑤1,𝑤𝑤2, … ,𝑤𝑤𝑁𝑁 = �𝑧𝑧1,⋯,𝑧𝑧𝑁𝑁


Q: In a K-state HMM for a length N observation sequence, how many summands (different latent sequences) are there?

HMM Likelihood Task





A: KN

HMM Likelihood Task





A: KN

Goal:Find a way to compute this exponential sum efficiently

(in polynomial time)

HMM Likelihood Task





A: KN

Goal:Find a way to compute this exponential sum efficiently

(in polynomial time)

Like in language modeling, you need to model when to stop generating.

This ending state is generally not included in “K.”

2 (3)-State HMM Likelihood

z1 = N

w1

…

w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V





Q: What are the latent sequences here (EOS excluded)?


z1 = N

w1

…

w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V





(N, w1), (N, w2), (N, w3), (N, w4)(N, w1), (N, w2), (N, w3), (V, w4)(N, w1), (N, w2), (V, w3), (N, w4)(N, w1), (N, w2), (V, w3), (V, w4)

A:(N, w1), (V, w2), (N, w3), (N, w4)(N, w1), (V, w2), (N, w3), (V, w4)(N, w1), (V, w2), (V, w3), (N, w4)(N, w1), (V, w2), (V, w3), (V, w4)

(V, w1), (N, w2), (N, w3), (N, w4)(V, w1), (N, w2), (N, w3), (V, w4)… (six more)



z1 = N

w1

…

w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V





(N, w1), (N, w2), (N, w3), (N, w4)(N, w1), (N, w2), (N, w3), (V, w4)(N, w1), (N, w2), (V, w3), (N, w4)(N, w1), (N, w2), (V, w3), (V, w4)

A:(N, w1), (V, w2), (N, w3), (N, w4)(N, w1), (V, w2), (N, w3), (V, w4)(N, w1), (V, w2), (V, w3), (N, w4)(N, w1), (V, w2), (V, w3), (V, w4)

(V, w1), (N, w2), (N, w3), (N, w4)(V, w1), (N, w2), (N, w3), (V, w4)… (six more)



z1 = N

w1

…

w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V





N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


z1 = N

w1

…

w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V





Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


z1 = N

w1 w2 w3 w4

𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁


z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V

𝑝𝑝 𝑉𝑉| 𝑉𝑉

𝑝𝑝 𝑉𝑉| 𝑁𝑁𝑝𝑝 𝑁𝑁| 𝑉𝑉

𝑝𝑝 𝑤𝑤3|𝑉𝑉𝑝𝑝 𝑤𝑤2|𝑉𝑉

Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4)?

A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) =0.0002822

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


z1 = N

w1 w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V




Q: What’s the probability of(N, w1), (V, w2), (V, w3), (N, w4) with ending included (unique ending symbol “#”)?A: (.7*.7) * (.8*.6) * (.35*.1) * (.6*.05) *

(.05 * 1) =0.00001235

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4 #

N .7 .2 .05 .05 0

V .2 .6 .1 .1 0end 0 0 0 0 1


z1 = N

w1

…

w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V





Q: What’s the probability of(N, w1), (V, w2), (N, w3), (N, w4)?N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


z1 = N

w1 w2 w3 w4

𝑝𝑝 𝑤𝑤1|𝑁𝑁 𝑝𝑝 𝑤𝑤3|𝑁𝑁 𝑝𝑝 𝑤𝑤4|𝑁𝑁


z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V


𝑝𝑝 𝑤𝑤2|𝑉𝑉

𝑝𝑝 𝑁𝑁| 𝑁𝑁

Q: What’s the probability of(N, w1), (V, w2), (N, w3), (N, w4)?

A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) =0.00007056

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


z1 = N

w1 w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V




Q: What’s the probability of(N, w1), (V, w2), (N, w3), (N, w4) with ending (unique symbol “#”)?

A: (.7*.7) * (.8*.6) * (.6*.05) * (.15*.05) *(.05 * 1) =

0.000002646

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4 #

N .7 .2 .05 .05 0

V .2 .6 .1 .1 0end 0 0 0 0 1


z1 = N

w1 w2 w3 w4


𝑝𝑝 𝑁𝑁| start

z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V




z1 = N

w1 w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V


𝑝𝑝 𝑉𝑉| 𝑁𝑁

𝑝𝑝 𝑁𝑁| 𝑉𝑉



z1 = N

w1 w2 w3 w4



z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V




z1 = N

w1 w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V





Up until here, all the computation was the same


z1 = N

w1 w2 w3 w4



z2 = N

z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V




z1 = N

w1 w2 w3 w4



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V





Up until here, all the computation was the same

Let’s reuse what computations we can

Reusing Computation: Attempt #1

w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4

𝑝𝑝 𝑤𝑤1|𝑁𝑁



z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







(.7*.7)

(.7*.7) * (.8*.6)

(.7*.7) * (.8*.6) * (.4*.1)

(.7*.7) * (.8*.6) * (.6*.05)

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







(.7*.7)

(.7*.7) * (.8*.6)

(.7*.7) * (.8*.6) * (.4*.1)

(.7*.7) * (.8*.6) * (.6*.05)

Remember: these are only two of the 16 paths through the

trellis

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







(.7*.7)

aN[2, V] =(.7*.7) * (.8*.6)

aN[2, V] * (.4*.1)

aN[2, V] * (.6*.05)


trellis

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







a[1,N] =(.7*.7)

aN[2, V] =a[1, V] * (.8*.6)

aN[2, V] * (.4*.1)

aN[2, V] * (.6*.05)


trellis

These aP(t, s) values

represent the probability of a state s at time

t in a particular shared path P

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







a[1,N] =(.7*.7)

aN[2, V] =a[1, V] * (.8*.6)

aN[2, V] * (.4*.1)

aN[2, V] * (.6*.05)


trellis

These aP(t, s) values

represent the probability of a state s at time

t in a particular shared path P

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

“NV” is path prefix


w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







(.7*.7)

aN[2, V] =(.7*.7) * (.8*.6)

aN[2, V] * (.4*.1)

aN [2, V] * (.6*.05)


trellis

How do we handle the other 14 paths?

Marginalize at each time step

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


so far, we’ve only considered “particular shared path (AB) *”

aAB(i, *) = aA(i-1, B) * p(* | B) * p(obs at i | *)

zi-1= C

zi-1= B

zi-1 = A

zi = C

zi = B

zi = A

zi-2= C

zi-2= B

zi-2 = A

what about any of the other paths (AAB, ACB, etc.)?

aA(i-1, B)a(i-2, A) aAB(i, *)


zi-1= C

zi-1= B

zi-1 = A

zi = C

zi = B

zi = A

let’s first consider “any shared path ending with B (AB, BB, or CB) B”

zi-2= C

zi-2= B

zi-2 = A


let’s first consider “any shared path ending with B (AB, BB, or CB) B”marginalize across the previous hidden state values

𝛼𝛼 𝑖𝑖,𝐵𝐵 = �𝑠𝑠

𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠 ∗ 𝑝𝑝 𝐵𝐵 𝑠𝑠) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝐵𝐵)

zi-1= C

zi-1= B

zi-1 = A

zi = C

zi = B

zi = A

zi-2= C

zi-2= B

zi-2 = A





aAB(i, B) = aA(i-1, B) * p(B | B) * p(obs at i | B)

zi-1= C

zi-1= B

zi-1 = A

zi = C

zi = B

zi = A

zi-2= C

zi-2= B

zi-2 = A






computing α at time i-1 will correctly incorporate paths

through time i-2: we correctly obey the Markov property

zi-1= C

zi-1= B

zi-1 = A

zi = C

zi = B

zi = A

zi-2= C

zi-2= B

zi-2 = A

Forward Probability


𝛼𝛼 𝑖𝑖,𝐵𝐵 = �𝑠𝑠′𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠′ ∗ 𝑝𝑝 𝐵𝐵 𝑠𝑠′) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝐵𝐵)


computing α at time i-1 will correctly incorporate paths

through time i-2: we correctly obey the Markov property

α(i, B) is the total probability of all paths to that state B from the

beginning

zi-1= C

zi-1= B

zi-1 = A

zi = C

zi = B

zi = A

zi-2= C

zi-2= B

zi-2 = A

Forward Probability

α(i, s) is the total probability of all paths:1. that start from the beginning

2. that end (currently) in s at step i3. that emit the observation obs at i

𝛼𝛼 𝑖𝑖, 𝑠𝑠 = �𝑠𝑠′𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠′ ∗ 𝑝𝑝 𝑠𝑠 𝑠𝑠′) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝑠𝑠)

Forward Probability

α(i, s) is the total probability of all paths:1. that start from the beginning

2. that end (currently) in s at step i3. that emit the observation obs at i

𝛼𝛼 𝑖𝑖, 𝑠𝑠 = �𝑠𝑠′𝛼𝛼 𝑖𝑖 − 1, 𝑠𝑠′ ∗ 𝑝𝑝 𝑠𝑠 𝑠𝑠′) ∗ 𝑝𝑝(obs at 𝑖𝑖 | 𝑠𝑠)

how likely is it to get into state s this way?

what are the immediate ways to get into state s?

what’s the total probability up until now?

2 (3) -State HMM Likelihood with Forward Probabilities

w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







α[1, N] =(.7*.7)

α[2, V] =α[1, N] * (.8*.6) +α[1, V] * (.35*.6)

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1

𝑝𝑝 𝑉𝑉| start

α[3, V] =α[2, V] * (.35*.1)+α[2, N] * (.8*.1)

α[3, N] =α[2, V] * (.6*.05) +α[2, N] * (.15*.05)


w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







α[1, N] =(.7*.7)

α[2, V] =α[1, N] * (.8*.6) +α[1, V] * (.35*.6)

α[3, V] =α[2, V] * (.35*.1)+α[2, N] * (.8*.1)

α[3, N] =α[2, V] * (.6*.05) +α[2, N] * (.15*.05)

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


α[1, V] =(.2*.2)


w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







α[1, N] =(.7*.7) = .49

α[2, V] =α[1, N] * (.8*.6) +α[1, V] * (.35*.6) =0.2436

α[3, V] =α[2, V] * (.35*.1)+α[2, N] * (.8*.1)

α[3, N] =α[2, V] * (.6*.05) +α[2, N] * (.2*.05)

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


α[1, V] =(.2*.2) = .04

α[2, N] =α[1, N] * (.15*.2) +α[1, V] * (.6*.2) =.0195


w3 w4


z3 = N

z4 = N

z3 = V

z4 = V


z1 = N

w1 w2

w3 w4




z3 = N

z4 = N

z1 = V

z2 = V

z3 = V

z4 = V







α[1, N] =(.7*.7) = .49

α[2, V] =α[1, N] * (.8*.6) +α[1, V] * (.35*.6) =0.2436

α[3, V] =α[2, V] * (.35*.1)+α[2, N] * (.8*.1)

α[3, N] =α[2, V] * (.6*.05) +α[2, N] * (.2*.05)

N V end

start .7 .2 .1

N .15 .8 .05

V .6 .35 .05

w1 w2 w3 w4

N .7 .2 .05 .05

V .2 .6 .1 .1


α[1, V] =(.2*.2) = .04

α[2, N] =α[1, N] * (.15*.2) +α[1, V] * (.6*.2) =.0195

Use dynamic programming to build the α left-to-right

Forward Algorithm

α: a 2D table, N+2 x K*N+2: number of observations (+2 for the

BOS & EOS symbols)K*: number of states

Use dynamic programming to build the α left-to-right

Forward Algorithmα = double[N+2][K*]α[0][*] = 1.0

for(i = 1; i ≤ N+1; ++i) {

}


for(i = 1; i ≤ N+1; ++i) {for(state = 0; state < K*; ++state) {

}}



pobs = pemission(obsi | state)

}}



pobs = pemission(obsi | state)for(old = 0; old < K*; ++old) {

pmove = ptransition(state | old)α[i][state] += α[i-1][old] * pobs * pmove

}}

}





}}

}

we still need to learn these (EM)





}}

}

Q: What do we return? (How do we return the likelihood of the sequence?)





}}

}

Q: What do we return? (How do we return the likelihood of the sequence?)

A: α[N+1][end]

Interactive HMM Example

https://goo.gl/rbHEoc (Jason Eisner, 2002)

Original: http://www.cs.jhu.edu/~jason/465/PowerPoint/lect24-hmm.xls

https://goo.gl/rbHEoc

http://www.cs.jhu.edu/%7Ejason/465/PowerPoint/lect24-hmm.xls

Forward Algorithm in Log-Spaceα = double[N+2][K*]

α[0][*] = 0.0


pobs = log pemission(obsi | state)for(old = 0; old < K*; ++old) {

pmove = log ptransition(state | old)α[i][state] = logadd(α[i][state],

α[i-1][old] + pobs + pmove)}

}}


α[0][*] = 0.0





}}

logadd 𝑙𝑙𝑝𝑝, 𝑙𝑙𝑙𝑙 = log exp 𝑙𝑙𝑝𝑝 + exp(𝑙𝑙𝑙𝑙)


α[0][*] = 0.0





}}

logadd 𝑙𝑙𝑝𝑝, 𝑙𝑙𝑙𝑙 = log exp 𝑙𝑙𝑝𝑝 + exp(𝑙𝑙𝑙𝑙)

this can still overflow! (why?)


α[0][*] = 0.0





}} logadd 𝑙𝑙𝑝𝑝, 𝑙𝑙𝑙𝑙 = � 𝑙𝑙𝑝𝑝 + log 1 + exp 𝑙𝑙𝑙𝑙 − 𝑙𝑙𝑝𝑝 , 𝑙𝑙𝑝𝑝 ≥ 𝑙𝑙𝑙𝑙

𝑙𝑙𝑙𝑙 + log 1 + exp 𝑙𝑙𝑝𝑝 − 𝑙𝑙𝑙𝑙 , 𝑙𝑙𝑙𝑙 > 𝑙𝑙𝑝𝑝


α[0][*] = 0.0





}} logadd 𝑙𝑙𝑝𝑝, 𝑙𝑙𝑙𝑙 = � 𝑙𝑙𝑝𝑝 + log 1 + exp 𝑙𝑙𝑙𝑙 − 𝑙𝑙𝑝𝑝 , 𝑙𝑙𝑝𝑝 ≥ 𝑙𝑙𝑙𝑙

𝑙𝑙𝑙𝑙 + log 1 + exp 𝑙𝑙𝑝𝑝 − 𝑙𝑙𝑙𝑙 , 𝑙𝑙𝑙𝑙 > 𝑙𝑙𝑝𝑝

scipy.misc.logsumexp

Documents

Hidden Markov Models (II): Log-Likelihood (Forward Algorithm)