Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Lecture16:HiddenMarkovModelsSanjeevArora EladHazan
COS402– MachineLearningand
ArtificialIntelligenceFall2016
Courseprogress
• Learningfromexamples• Definition+fundamentaltheoremofstatisticallearning,motivatedefficientalgorithms/optimization
• Convexity,greedyoptimization– gradientdescent• Neuralnetworks
• KnowledgeRepresentation• NLP• Logic• Bayesnets• Optimization:MCMC• HMM(today)(aspecialcaseofBayesnets)
• Next:reinforcementlearning
Admin
• (written)ex4– announcedtoday• DueafterThanksgiving(Thu)
MarkovChain
26
Markov chain with three states (s = 3)
Markov Chains Example
Transition graphTransition matrix
Directedgraph,andatransititionmatrixgiving,foreachi,jtheprobabilityofsteppingtojwhenati.
Ergodictheorem
Everyirreducibleanda-periodicMarkovchainhasauniquestationarydistribution,andeveryrandomwalkstartingfromanynodeconvergestoit!
Non-stationaryMarkovchains
1 2
1
1
“periodic”
Notice:self-loopà notperiodicanymore
Non-stationaryMarkovchains
12
1
0.5
2
0.5
1
“reducible”
Irreducibleà foranypairofvertices,Tij >0afterfinitelymanyiterations.
Thislecture:temporalmodelsHiddenMarkovModels
X1
E1
X2
E2
X3
E3
X3 X5
Hiddenvariables
Evidencevariables(observed)
X0
Applications
• Time-dependentvariables/problems(e.g.treatingpatientswithchangingbiometricsovertime)
• Naturalsequentialdata(speech,text,etc.).
• Example- texttagging:
thedogsawacatDNVDN
HiddenMarkovModels:definitions
X1
E1
X2
E2
X3
E3
X3 X5X0• Xt =stateattimet• Et =evidenceattimet• P(X0)=initialstate• 𝑃(𝑋$|𝑋$&') – transitionmodel=Markovchain• 𝑃(𝐸$|𝑋$) – sensor/observationmodel,random• Assumptions:• Futureisindependentofpastgivenpresent(1st order)𝑃 𝑋$ 𝑋*:$&' = 𝑃(𝑋$|𝑋$&')
• Currentevidenceonlydependsoncurrentstate𝑃 𝐸$ 𝑋*:$, 𝐸':$&' = 𝑃(𝐸$|𝑋$)
HiddenMarkovModels– 2nd orderdependenciesnaturalextension
X1
E1
X2
E2
X3
E3
X3 X5X0
𝑃 𝑋$ 𝑋*:$&' = 𝑃(𝑋$|𝑋$&',𝑋$&.)
HiddenMarkovModels– translation
X1
E1
X2
E2
X3
E3
X3 X5X0
English
A1 A2 A3 A4 A5 Hebrew
HMMs– questionswewanttosolve
1. Filtering:what’sthecurrentstate?𝑃 𝑋$ 𝐸$ =?
2. Prediction:wherewillIbeinksteps?𝑃 𝑋$01 𝐸':$ =?
3. Smoothing:wherewasIinthepast?𝑃 𝑋1 𝐸':$ =?
4. Mostlikelysequencetothedataargmax
78:9𝑃 𝑋*:$ 𝐸':$ =?
Example– wordtaggingbytrigramHMM
thedogsawacatDNVDN
• LetK={V,N,D,Adv,…,*,STOP}beasetoflabels.Thesearegoingtobeourstates• V=dictionarywords,thesearetheobservations• Model=HMMwith3arcsback.Trigramassumption:
P X< X*:<&' = P X< X<&',X<&. ,𝑃 𝐸$ 𝑋*:$, 𝐸':$&' = 𝑃(𝐸$|𝑋$)
X1
E1
X2
E2
X3
E3
X3 X5X0
Example– wordtaggingbytrigramHMM
thedogsawacatDNVDN
• Wewillsee,P thedoglaughs, DNVSTOP =P(𝐷 ∗,∗ ×𝑃 𝑁 ∗,𝐷 ×𝑃 𝑉 𝐷𝑁 ×𝑃 𝑆𝑇𝑂𝑃 𝑁,𝑉 ×
×𝑃 𝑡ℎ𝑒 𝐷)×𝑃 𝑑𝑜𝑔 𝑁)×𝑃 𝑙𝑎𝑢𝑔ℎ𝑠 𝑉)
X1
E1
X2
E2
X3
E3
X3 X5X0
Example– wordtaggingbytrigramHMM
• Assumeweknowtransitionprobabilities P X< X<&', X<&.• AssumeweknowobservationfrequenciesP E< X<• (howcanweestimatethesefromlabelleddata?)
X1
E1
X2
E2
X3
E3
X3 X5X0
DecodingHMMs
Input:sentenceE1,E2,…,EtOutput:taggingaccordingtolabelsinK(N,V,…),i.e.thestatesX1,…,Xt
i.e.argmax_8:9
𝑃 𝑋*:$ = 𝑥*:$ 𝐸':$ = 𝑒':$
= argmax_8:9
𝑃 𝑋*:$ = 𝑥*:$ , 𝐸':$ = 𝑒':$ ×'
a(bc:defc:d)
BythetrigramMarkovassumption,wehave:
𝑃(𝑥*:$, 𝑒':$) = g 𝑃 𝑥h 𝑥h&',𝑥h&.he'$i$
g 𝑃(𝑒h |𝑥h)he'$i$
Why?
DecodingHMMs
𝑃 𝑥*:$, 𝑒':$ =
= 𝑃 𝑥':$ ×𝑃 𝑒*:$|𝑥':$ (completeprobability)= ∏ 𝑃 𝑥h|𝑥':h&'he'$i$ ×∏ 𝑃 𝑒h|𝑥':h&', 𝑒':$he'$i$ (chainrule)= ∏ 𝑃 𝑥h|𝑥h&',𝑥h&.he'$i$ × ∏ 𝑃 𝑒h|𝑥':h&', 𝑒':$he'$i$ (2ndorderMC)= ∏ 𝑃 𝑥h|𝑥h&',𝑥h&.he'$i$ × ∏ 𝑃 𝑒h|𝑥hhe'$i$ (cond.independence)
DecodingHMMs– Viterbialgorithm
Let
𝑓(𝑋*:1) = g 𝑃 𝑋h 𝑋h&',𝑋h&.he'$i1
g 𝑃(𝑒h|𝑋h)he'$i1
Anddefine𝜋1 𝑢, 𝑣 = max
n8:opq𝑓(𝑋*:1&.,𝑢, 𝑣)
Recall:wewanttocompute:
argmaxr8:9
𝑃 𝑥*:$,𝑒':$ = argmax𝑓(𝑥*:$)
DecodingHMMs– Viterbialgorithm
Let
𝑓(𝑋*:1) = g 𝑃 𝑋h 𝑋h&',𝑋h&.he'$i1
g 𝑃(𝑒h|𝑋h)he'$i1
Anddefine𝜋1 𝑢, 𝑣 = max
n8:opq𝑓(𝑋*:1&.,𝑢, 𝑣)
Mainlemma:𝜋1 𝑢, 𝑣 = max
s{𝜋1&' 𝑤,𝑢 ×𝑃 𝑣 𝑤,𝑢 ×𝑃(𝑒1|𝑣) }
Nowthealgorithmisstraightforward:computethisrecursively!(a.k.a.dynamicprogramming)
Viterbi:explicitpseudocode
Input:observationse1,…,etOutput:mostlikelyvariableassignmentsx0,…,xtInitialize:setx0,x-1 tobe“*”Fork=1,2,…,tdo:• Foru,v inKdo:
1. 𝜋1 𝑢, 𝑣 = maxs{𝜋1&' 𝑤,𝑢 ×𝑃 𝑣 𝑤, 𝑢 ×𝑃(𝑒1|𝑣) }
2. Savethe𝜋1 𝑢, 𝑣 valueandtheassignmentswhichmeetsit
• endReturnmax
x,y{𝜋$ 𝑢, 𝑣 ×𝑃 𝑆𝑇𝑂𝑃 𝑢, 𝑣 }andassignmentswhichmeetsit
Computationalcomplexity?
HiddenMarkovModels– anotherview
X1
E1
X2
E2
X3
E3
X3 X5
Hiddenvariables:2states
Evidencevariables(observed, 2options)
X0
HiddenMarkovModels– anotherview
X=0
X=1
Markovchainwith:1. Transitionprobabilities thatgovernstate
change2. Distribution oversignals/observations from
eachstate
Transitionmatrix:
Observationmatrices:
P00 =0.2E=“a”
P10 =0.3E=“b”
P11 =0.7E=“a”
P01 =0.8E=“c” 0.2 0.8
0.3 0.7
0.2 0
0 0.7
0 0
0 0.3
0.8 0
0 0
P(“a”|xt) P(“b”|xt) P(“c”|xt)
“forwardalgorithm” X=0
X=1
P00 =0.2E=“a”
P10 =0.3E=“b”
P11 =0.7E=“a”
P01 =0.8E=“c”Tocompute𝑃 𝑋$0'|𝑒':$0' ,recursiveformula
(similartowhatwedid)
𝑃 𝑋$0'|𝑒':$0' = 𝛼𝑃 𝑒$0'|𝑋$0' {𝑃 𝑋$0' 𝑥$rd
𝑃(𝑥$|𝑒':$)
“forwardalgorithm” X=0
X=1
P00 =0.2E=“a”
P10 =0.3E=“b”
P11 =0.7E=“a”
P01 =0.8E=“c”Tocompute𝑃 𝑋$0'|𝑒':$0' ,recursiveformula
(similartowhatwedid)
𝑃 𝑋$0'|𝑒':$0' = 𝛼𝑃 𝑒$0'|𝑋$0' {𝑃 𝑋$0' 𝑥$rd
𝑃(𝑥$|𝑒':$)
Derivation𝑃 𝑋$0'|𝑒':$0' = 𝑃 𝑋$0'|𝑒':$ , 𝑒$0'= '
a fd|c|fc:d𝑃 𝑒$0'|𝑋$0', 𝑒':$ 𝑃 𝑋$0' 𝑒':$ (Bayes)
= 𝛼𝑃 𝑒$0'|𝑋$0' 𝑃 𝑋$0' 𝑒':$ (Markovassumption)
= 𝛼𝑃 𝑒$0'|𝑋$0' {𝑃 𝑋$0' 𝑥$rd
𝑃(𝑥$|𝑒':$)
“forwardalgorithm” X=0
X=1
P00 =0.2E=“a”
P10 =0.3E=“b”
P11 =0.7E=“a”
P01 =0.8E=“c”
0.2 0.8
0.3 0.7
0.2 0
0 0.7
0 0
0 0.3
0.8 0
0 0
P(“a”|xt) P(“b”|xt)
P(“c”|xt)
Tocompute𝑃 𝑋$0'|𝑒':$0' ,recursiveformula(similartowhatwedid)
𝑃 𝑋$0'|𝑒':$0' = 𝛼𝑃 𝑒$0'|𝑋$0' {𝑃 𝑋$0' 𝑥$rd
𝑃(𝑥$|𝑒':$)
Orinmatrixform,if𝑓$ isthevectoroff< x = 𝑃 𝑋$ = 𝑥, 𝑒':$ :
𝑓$0' = 𝛼𝑂$0'𝑇~𝑓$
𝑂𝑡 - observationmatrixcorrespondingtoEt.𝛼 - normalizingconstantto1(equalto '
a(fc:d)).
“backwardalgorithm” X=0
X=1
P00 =0.2E=“a”
P10 =0.3E=“b”
P11 =0.7E=“a”
P01 =0.8E=“c”
0.2 0.8
0.3 0.7
0.2 0
0 0.7
0 0
0 0.3
0.8 0
0 0
P(“a”|xt) P(“b”|xt)
P(“c”|xt)
Letb$ isthevectorof𝑏�:< x = 𝑃 𝑒1:$, 𝑋1&' :
𝑏10':$ = 𝑇𝑂10'𝑏10.:$
𝑂𝑡 - observationmatrixcorrespondingtoEt.
Summary
• HMMs- usefultomodeltime-dependentvariables/problems(e.g.treatingpatientswithchangingbiometricsovertime)
• Example- texttagging
• Viterbialgorithm(dynamicprogramming)tofindthemostlikelyassignmenttothehiddenvariables.(assumingthetransitionprobabilitiesareknown)
• Independenceassumptionsallow“forward”+“backward”computationsofconditionalprobabilities