Corpora and Statistical Methods Lecture 8

Corpora and Statistical Methods Lecture 15

Albert GattCorpora and Statistical MethodsLecture 8Markov and Hidden Markov Models: Conceptual IntroductionPart 2In this lectureWe focus on (Hidden) Markov Modelsconceptual intro to Markov Modelsrelevance to NLPHidden Markov ModelsalgorithmsAcknowledgementSome of the examples in this lecture are taken from a tutorial on HMMs by Wolgang MaassTalking about the weatherSuppose we want to predict tomorrows weather. The possible predictions are:sunnyfoggyrainy

We might decide to predict tomorrows outcome based on earlier weatherif its been sunny all week, its likelier to be sunny tomorrow than if it had been rainy all weekhow far back do we want to go to predict tomorrows weather?Statistical weather modelNotation:S: the state space, a set of possible values for the weather: {sunny, foggy, rainy}(each state is identifiable by an integer i)X: a sequence of random variables, each taking a value from Sthese model weather over a sequence of dayst is an integer standing for time

(X1, X2, X3, ... XT) models the value of a series of random variables each takes a value from S with a certain probability P(X=si)the entire sequence tells us the weather over T daysStatistical weather modelIf we want to predict the weather for day t+1, our model might look like this:

E.g. P(weather tomorrow = sunny), conditional on the weather in the past t days.Problem: the larger t gets, the more calculations we have to make.

Markov Properties I: Limited horizonThe probability that were in state si at time t+1 only depends on where we were at time t:

Given this assumption, the probability of any sequence is just:

Markov Properties II: Time invarianceThe probability of being in state si given the previous state does not change over time:

Concrete instantiationDay tDay t+1sunnyrainyfoggysunny0.80.050.15rainy0.20.60.2foggy0.20.30.5This is essentially a transition matrix, which gives us probabilities of going from one state to the other.

We can denote state transition probabilities as aij (prob. of going from state i to state j)Graphical viewComponents of the model:states (s)transitionstransition probabilitiesinitial probability distribution for states

Essentially, a non-deterministic finite state automaton.

Example continuedIf the weather today (Xt) is sunny, whats the probability that tomorrow (Xt+1) is sunny and the day after (Xt+2) is rainy?

Markov assumptionFormal definitionA Markov Model is a triple (S, , A) where:S is the set of states are the probabilities of being initially in some stateA are the transition probabilitiesHidden Markov ModelsA slight variation on the exampleYoure locked in a room with no windowsYou cant observe the weather directlyYou only observe whether the guy who brings you food is carrying an umbrella or not

Need a model telling you the probability of seeing the umbrella, given the weatherdistinction between observations and their underlying emitting state.

Define:Ot as an observation at time tK = {+umbrella, -umbrella} as the possible outputs

Were interested in P(Ot=k|Xt=si)i.e. p. of a given observation at t given that the underlying weather state at t is siSymbol emission probabilitiesweatherProbability of umbrellasunny0.1rainy0.8foggy0.3This is the hidden model, telling us the probability that Ot = k given that Xt = siWe assume that each underlying state Xt = si emits an observation with a given probability.Using the hidden modelModel gives:P(Ot=k|Xt=si)Then, by Bayes Rule we can compute: P(Xt=si|Ot=k)

Generalises easily to an entire sequence

HMM in graphicsCircles indicate statesArrows indicate probabilistic dependencies between states

HMM in graphicsGreen nodes are hidden statesEach hidden state depends only on the previous state (Markov assumption)

Why HMMs?HMMs are a way of thinking of underlying events probabilistically generating surface events.

Example: Parts of speecha POS is a class or set of wordswe can think of language as an underlying Markov Chain of parts of speech from which actual words are generated (emitted)So what are our hidden states here, and what are the observations?HMMs in POS TaggingADJNVDETHidden layer (constructed through training)Models the sequence of POSs in the training corpusHMMs in POS TaggingADJtallNladyVisDETtheObservations are words.They are emitted by their corresponding hidden state.The state depends on its previous state.Why HMMsThere are efficient algorithms to train HMMs using Expectation MaximisationGeneral idea: training data is assumed to have been generated by some HMM (parameters unknown)try and learn the unknown parameters in the data

Similar idea is used in finding the parameters of some n-gram models, especially those that use interpolation.Formalisation of a Hidden Markov modelCrucial ingredients (familiar)Underlying states: S = {s1,,sN}

Output alphabet (observations): K = {k1,,kM}

State transition probabilities:A = {aij}, i,j S

State sequence: X = (X1,,XT+1)+ a function mapping each Xt to a state s

Output sequence: O = (O1,,OT)where each ot K

Crucial ingredients (additional)Initial state probabilities: = {i}, i S(tell us the initial probability of each state)

Symbol emission probabilities:B = {bijk}, i,j S, k K(tell us the probability b of seeing observation Ot=k, given that Xt=si and Xt+1 = sj)Trellis diagram of an HMMs1s2s3a1,1a1,2a1,3Trellis diagram of an HMMs1s2s3a1,1a1,2a1,3o1o2o3Obs. seq:time:t1t2t3Trellis diagram of an HMMs1s2s3a1,1a1,2a1,3o1o2o3Obs. seq:time:t1t2t3b1,1,kb1,1,kb1,2,kb1,3,kThe fundamental questions for HMMsGiven a model = (A, B, ), how do we compute the likelihood of an observation P(O| )?

Given an observation sequence O, and model , which is the state sequence (X1,,Xt+1) that best explains the observations?This is the decoding problem

Given an observation sequence O, and a space of possible models = (A, B, ), which model best explains the observed data?Application of question 1 (ASR)Given a model = (A, B, ), how do we compute the likelihood of an observation P(O| )?

Input of an ASR system: a continuous stream of sound waves, which is ambiguousNeed to decode it into a sequence of phones.is the input the sequence [n iy d] or [n iy]?which sequence is the most probable?Application of question 2 (POS Tagging)Given an observation sequence O, and model , which is the state sequence (X1,,Xt+1) that best explains the observations?this is the decoding problem

Consider a POS TaggerInput observation sequence: I can read

need to find the most likely sequence of underlying POS tags:e.g. is can a modal verb, or the noun?how likely is it that can is a noun, given that the previous word is a pronoun?SummaryHMMs are a way of representing:sequences of observations arising fromsequences of statesstates are the variables of interest, giving rise to the observations

Next up:algorithms for answering the fundamental questions about HMMs

Documents

Corpora and Statistical Methods Lecture 8