Lecture 9: Hidden Markov Model - Computer Sciencekc2wc/teaching/NLP16/slides/10-HMM.pdf · Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia [email protected]

Lecture 9: Hidden Markov Model

Kai-Wei ChangCS @ University of Virginia

[email protected]

Couse webpage: http://kwchang.net/teaching/NLP16

1CS6501 Natural Language Processing

This lecture

vHidden Markov ModelvDifferent views of HMMvHMM in supervised learning setting

2CS6501 Natural Language Processing

CS6501 Natural Language Processing 3

Recap: Parts of Speech

vTraditional parts of speechv~ 8 of them


Recap: Tagset

vPenn TreeBank tagset”, 45 tags: vPRP$, WRB, WP$, VBGvPenn POS annotations:

The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT number/NN of/IN other/JJ topics/NNS ./.

vUniversal Tag set, 12 tagsv NOUN, VERB, ADJ, ADV, PRON, DET, ADP, NUM,

CONJ, PRT, “.”, X


Recap: POS Tagging v.s. Word clustering

vWords often have more than one POS: backvThe back door = JJvOn my back = NNvWin the voters back = RBvPromised to back the bill = VB

vSyntax v.s. Semantics (details later)

These examples from Dekang Lin

Recap: POS tag sequences

vSome tag sequences more likely occur than others

vPOS Ngram viewhttps://books.google.com/ngrams/graph?content=_ADJ_+_NOUN_%2C_ADV_+_NOUN_%2C+_ADV_+_VERB_


ExistingmethodsoftenmodelPOStaggingasasequencetagging problem

Evaluation

vHow many words in the unseen test data can be tagged correctly?

vUsually evaluated on Penn TreebankvState of the art ~97% vTrivial baseline (most likely tag) ~94%vHuman performance ~97%


Building a POS tagger

vSupervised learningvAssume linguistics have annotated several

examples


The/DT grand/JJjury/NN commented/VBDon/INa/DT number/NN of/IN other/JJtopics/NNS ./.

Tag set:DT,JJ,NN,VBD…

POSTagger

POS induction

vUnsupervised learningvAssume we only have an unannotated corpus


Thegrand jurycommentedon anumberofother topics.

Tag set:DT,JJ,NN,VBD…

POSTagger

TODAY: Hidden Markov Model

vWe focus on supervised learning setting vWhat is the most likely sequence of tags

for the given sequence of words w

vWe will talk about other ML models for this type of prediction tasks later.


Let’s try

a/DT d6g/NN 0s/VBZ chas05g/VBG a/DTcat/NN ./.

a/DT f6x/NN0s/VBZ9u5505g/VBG./.

a/DT b6y/NN0s/VBZ s05g05g/VBG ./.a/DT ha77y/JJ b09d/NN

What is the POS tag sequence of the following sentence?

a ha77y cat was s05g05g .


Don’tworry!Thereisnoproblemwithyoureyesorcomputer.

Let’s try

v a/DT d6g/NN 0s/VBZ chas05g/VBG a/DTcat/NN ./.a/DT dog/NN is/VBZ chasing/VBG a/DT cat/NN ./.

v a/DT f6x/NN 0s/VBZ 9u5505g/VBG ./.a/DT fox/NN is/VBZ running/VBG ./.

v a/DT b6y/NN 0s/VBZ s05g05g/VBG ./.a/DT boy/NN is/VBZ singing/VBG ./.

v a/DT ha77y/JJ b09d/NNa/DT happy/JJ bird/NN

v a ha77y cat was s05g05g .

a happy cat was singing .


How you predict the tags?

vTwo types of information are usefulvRelations between words and tagsvRelations between tags and tags

v DT NN, DT JJ NN…


Statistical POS tagging

vWhat is the most likely sequence of tags for the given sequence of words w


P(DTJJNN|asmartdog)= P(DDJJNNasmartdog)/P(asmartdog)∝ P(DDJJNNasmartdog)= P(DDJJNN)P(asmartdog|DDJJNN)

Transition Probability

v Joint probability 𝑃(𝒕,𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕)v𝑃 𝒕 = 𝑃 𝑡+, 𝑡,, … 𝑡.

= 𝑃 𝑡+ 𝑃 𝑡, ∣ 𝑡+ 𝑃 𝑡1 ∣ 𝑡,, 𝑡+ …𝑃 𝑡. 𝑡+ … 𝑡.2+∼ P t+ P t, 𝑡+ 𝑃 𝑡1 𝑡, …𝑃(𝑡. ∣ 𝑡.2+)= Π78+. 𝑃 𝑡7 ∣ 𝑡72+

v Bigram model over POS tags!(similarly, we can define a n-gram model over POS tags, usually we called high-order HMM)


Markovassumption

Emission Probability

v Joint probability 𝑃(𝒕,𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕)vAssumewordsonlydependontheirPOS-tagv𝑃 𝒘 𝒕 ∼ 𝑃 𝑤+ 𝑡+ 𝑃 𝑤, 𝑡, …𝑃(𝑤. ∣ 𝑡.)= Π78+. 𝑃 𝑤7 𝑡7

i.e., P(a smart dog | DD JJ NN )= P(a | DD) P(smart | JJ ) P( dog | NN )


Independentassumption

Put them together

v Joint probability 𝑃(𝒕,𝒘) = 𝑃 𝒕 𝑃(𝒘|𝒕)v 𝑃 𝒕,𝒘

= P t+ P t, 𝑡+ 𝑃 𝑡1 𝑡, …𝑃 𝑡. 𝑡.2+𝑃 𝑤+ 𝑡+ 𝑃 𝑤, 𝑡, …𝑃(𝑤. ∣ 𝑡.)

= Π78+. 𝑃 𝑤7 𝑡7 𝑃 𝑡7 ∣ 𝑡72+e.g., P(a smart dog , DD JJ NN )

= P(a | DD) P(smart | JJ ) P( dog | NN )

P(DD | start) P(JJ | DD) P(NN | JJ )


Put them together

vTwo independent assumptionsvApproximate P(t) by a bi(or N)-gram model vAssume each word depends only on its POStag


initialprobability𝑝(𝑡+)

HMMs as probabilistic FSA


JuliaHockenmaier: IntrotoNLP

Table representation


Let𝜆 = {𝐴, 𝐵, 𝜋} representsallparameters

21

v States T = t1, t2…tN;

v Observations W= w1, w2…wN; v Each observation is a symbol from a vocabulary V =

{v1,v2,…vV}v Transition probabilities

v Transition probability matrix A = {aij}

v Observation likelihoodsv Output probability matrix B={bi(k)}

v Special initial probability vector π

Hidden Markov Models (formal)

𝑎7V = 𝑃 𝑡7 = 𝑗 𝑡72+ = 𝑖 1 ≤ 𝑖, 𝑗 ≤ 𝑁

𝑏7(𝑘) = 𝑃 𝑤7 = 𝑣_ 𝑡7 = 𝑖

𝜋7 = 𝑃 𝑡+ = 𝑖 1 ≤ 𝑖 ≤ 𝑁CS6501 Natural Language Processing

How to build a second-order HMM?

vSecond-order HMMvTrigram model over POS tagsv𝑃 𝒕 = Π78+. 𝑃 𝑡7 ∣ 𝑡72+, 𝑡72,v𝑃 𝒘, 𝒕 = Π78+. 𝑃 𝑡7 ∣ 𝑡72+, 𝑡72, 𝑃(𝑤7 ∣ 𝑡7)


Probabilistic FSA for second-order HMM


JuliaHockenmaier: IntrotoNLP

Prediction in generative model

v Inference: What is the most likely sequence of tags for the given sequence of words w

vWhat are the latent states that most likely generate the sequence of word w


initialprobability𝑝(𝑡+)


Example: The Verb “race”

vSecretariat/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NR

vPeople/NNS continue/VB to/TO inquire/VBthe/DT reason/NN for/IN the/DT race/NN for/INouter/JJ space/NN

vHow do we pick the right tag?

26

Disambiguating “race”

CS6501 Natural Language Processing

27

Disambiguating “race”

v P(NN|TO) = .00047v P(VB|TO) = .83v P(race|NN) = .00057v P(race|VB) = .00012v P(NR|VB) = .0027v P(NR|NN) = .0012v P(VB|TO)P(NR|VB)P(race|VB) = .00000027v P(NN|TO)P(NR|NN)P(race|NN)=.00000000032v So we (correctly) choose the verb reading,


28

Jason and his Ice Creams

vYou are a climatologist in the year 2799vStudying global warmingvYou can’t find any records of the weather in

Baltimore, MA for summer of 2007vBut you find Jason Eisner’s diaryvWhich lists how many ice-creams Jason

ate every date that summervOur job: figure out how hot it was

http://videolectures.net/hltss2010_eisner_plm/http://www.cs.jhu.edu/~jason/papers/eisner.hmm.xls


(C)old day v.s. (H)ot day


p(…|C) p(…|H) p(…|START)p(1|…) 0.7 0.1p(2|…) 0.2 0.2p(3|…) 0.1 0.7p(C|…) 0.8 0.1 0.5p(H|…) 0.1 0.8 0.5

p(STOP|…) 0.1 0.1 0

Scroll to the bottom to see a graph of what states and transitions the model thinks are likely on each day. Those likely states and transitions can be used to reestimate the red probabilities (this is the "forward-backward" or Baum-Welch algorithm), increasing the likelihood of the

#cones

0

0.5

1

1.5

2

2.5

3

3.5

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

Diary Day

Weather States that Best Explain Ice Cream ConsumptionIce Creamsp(H)

Three basic problems for HMMs

vLikelihood of the input:vCompute 𝑃(𝒘 ∣ 𝜆) for the input 𝒘 and HMM 𝜆

vDecoding (tagging) the input:vFind the best tag sequence

𝑎𝑟𝑔𝑚𝑎𝑥d𝑃(𝒕 ∣ 𝒘, 𝜆)

vEstimation (learning):vFind the best model parameters

v Case 1: supervised – tags are annotatedv Case 2: unsupervised -- only unannotated text


Howlikelythesentence”Ilovecat”occurs

POStagsof”Ilovecat”occurs

Howtolearnthemodel?


vLikelihood of the input:vForward algorithm

vDecoding (tagging) the input:vViterbi algorithm


v Case 1: supervised – tags are annotatedvMaximum likelihood estimation (MLE)

v Case 2: unsupervised -- only unannotated textvForward-backward algorithm




Howtolearnthemodel?










Howtolearnthemodel?

Learning from Labeled Data

v Let play a game!

vWe count how often we see 𝑡72+𝑡7 and we𝑡7then normalize.











Howtolearnthemodel?

Weneeddynamicprogrammingfortheotherproblems

Documents

Lecture 9: Hidden Markov Model - Computer Sciencekc2wc/teaching/NLP16/slides/10-HMM.pdf · Lecture 9: Hidden Markov Model Kai-Wei Chang CS @ University of Virginia [email protected]