33
Albert Gatt Corpora and Statistical Methods Lecture 8

Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Albert Gatt

Corpora and Statistical Methods

Lecture 8

Page 2: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Markov and Hidden Markov Models: Conceptual

Introduction

Part 2

Page 3: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

In this lecture

We focus on (Hidden) Markov Models

conceptual intro to Markov Models

relevance to NLP

Hidden Markov Models

algorithms

Page 4: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Acknowledgement

Some of the examples in this lecture are taken from a tutorial

on HMMs by Wolgang Maass

Page 5: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Talking about the weather Suppose we want to predict tomorrow’s weather. The

possible predictions are: sunny foggy rainy

We might decide to predict tomorrow’s outcome based on earlier weather if it’s been sunny all week, it’s likelier to be sunny tomorrow than if it

had been rainy all week how far back do we want to go to predict tomorrow’s weather?

Page 6: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Statistical weather model Notation:

S: the state space, a set of possible values for the weather: {sunny, foggy, rainy}

(each state is identifiable by an integer i)

X: a sequence of random variables, each taking a value from S

these model weather over a sequence of days

t is an integer standing for time

(X1, X2, X3, ... XT) models the value of a series of random variables

each takes a value from S with a certain probability P(X=si)

the entire sequence tells us the weather over T days

Page 7: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Statistical weather model

If we want to predict the weather for day t+1, our model might

look like this:

E.g. P(weather tomorrow = sunny), conditional on the weather

in the past t days.

Problem: the larger t gets, the more calculations we have to

make.

)...|( 11 tkt XXsXP

Page 8: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Markov Properties I: Limited horizon

The probability that we’re in state si at time t+1 only

depends on where we were at time t:

Given this assumption, the probability of any sequence is

just:

)|()...|( 111 tittit XsXPXXsXP

)|(),...,( 1

1

1

ii

T

i

T XXPXXP

Page 9: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Markov Properties II: Time invariance

The probability of being in state si given the previous state does

not change over time:

)|()|( 121 XsXPXsXP itit

Page 10: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Concrete instantiation

Day t Day t+1

sunny rainy foggy

sunny 0.8 0.05 0.15

rainy 0.2 0.6 0.2

foggy 0.2 0.3 0.5

This is essentially a transition matrix, which gives us probabilities of going from one state to the other.

We can denote state transition probabilities as aij (prob. of going from state i to state j)

Page 11: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Graphical view

Components of the model:

1. states (s)

2. transitions

3. transition probabilities

4. initial probability distribution

for states

Essentially, a non-deterministic finite

state automaton.

Page 12: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Example continued

If the weather today (Xt) is sunny, what’s the probability that

tomorrow (Xt+1) is sunny and the day after (Xt+2) is rainy?

04.0

8.005.0

)|()|(

)|(),|(

)|,(

112

112

21

sXsXPsXrXP

sXsXPsXsXrXP

sXrXsXP

tttt

ttttt

ttt

Markov assumption

Page 13: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Formal definition

A Markov Model is a triple (S, , A) where:

S is the set of states

are the probabilities of being initially in some state

A are the transition probabilities

Page 14: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Hidden Markov Models

Page 15: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

A slight variation on the example You’re locked in a room with no windows

You can’t observe the weather directly

You only observe whether the guy who brings you food is carrying an umbrella or not

Need a model telling you the probability of seeing the umbrella, given the weather

distinction between observations and their underlying emitting state.

Define:

Ot as an observation at time t

K = {+umbrella, -umbrella} as the possible outputs

We’re interested in P(Ot=k|Xt=si) i.e. p. of a given observation at t given that the underlying weather state at t is si

Page 16: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Symbol emission probabilities

weather Probability of umbrella

sunny 0.1

rainy 0.8

foggy 0.3

This is the hidden model, telling us the probability that Ot = k given that Xt = si

We assume that each underlying state Xt = si emits an observation with a given probability.

Page 17: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Using the hidden model

Model gives:P(Ot=k|Xt=si)

Then, by Bayes’ Rule we can compute: P(Xt=si|Ot=k)

Generalises easily to an entire sequence

)(

)()|()|(

kOP

sXPsXkOPkOsXP

t

ititttit

Page 18: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

HMM in graphics

Circles indicate states

Arrows indicate probabilistic dependencies between states

Page 19: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

HMM in graphics

Green nodes are hidden states

Each hidden state depends only on the previous state (Markov assumption)

Page 20: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Why HMMs?

HMMs are a way of thinking of underlying events

probabilistically generating surface events.

Example: Parts of speech

a POS is a class or set of words

we can think of language as an underlying Markov Chain of

parts of speech from which actual words are generated

(“emitted”)

So what are our hidden states here, and what are the

observations?

Page 21: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

HMMs in POS Tagging

ADJ N VDET

Hidden layer (constructed through training)

Models the sequence of POSs in the training corpus

Page 22: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

HMMs in POS Tagging

ADJ

tall

N

lady

V

is

DET

the

Observations are words.

They are “emitted” by their corresponding hidden state.

The state depends on its previous state.

Page 23: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Why HMMs

There are efficient algorithms to train HMMs using

Expectation Maximisation

General idea:

training data is assumed to have been generated by some HMM

(parameters unknown)

try and learn the unknown parameters in the data

Similar idea is used in finding the parameters of some n-gram

models, especially those that use interpolation.

Page 24: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Formalisation of a Hidden Markov model

Page 25: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Crucial ingredients (familiar) Underlying states: S = {s1,…,sN}

Output alphabet (observations): K = {k1,…,kM}

State transition probabilities:

A = {aij}, i,j Є S

State sequence: X = (X1,…,XT+1)

+ a function mapping each Xt to a state s

Output sequence: O = (O1,…,OT) where each ot Є K

Page 26: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Crucial ingredients (additional) Initial state probabilities:

Π = {πi}, i Є S

(tell us the initial probability of each state)

Symbol emission probabilities:

B = {bijk}, i,j Є S, k Є K

(tell us the probability b of seeing observation Ot=k, given that Xt=si and Xt+1 = sj)

Page 27: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Trellis diagram of an HMM

s1

s2

s3

a1,1

a1,2

a1,3

Page 28: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Trellis diagram of an HMM

s1

s2

s3

a1,1

a1,2

a1,3

o1 o2 o3Obs. seq:

time: t1 t2t3

Page 29: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Trellis diagram of an HMM

s1

s2

s3

a1,1

a1,2

a1,3

o1 o2 o3Obs. seq:

time: t1 t2t3

b1,1,k b1,1,k

b1,2,k

b1,3,k

Page 30: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

The fundamental questions for HMMs

1. Given a model μ = (A, B, Π), how do we compute the likelihood of an observation P(O| μ)?

2. Given an observation sequence O, and model μ, which is the state sequence (X1,…,Xt+1) that best explains the observations?

This is the decoding problem

3. Given an observation sequence O, and a space of possible models μ = (A, B, Π), which model best explains the observed data?

Page 31: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Application of question 1 (ASR)

Given a model μ = (A, B, Π), how do we compute the

likelihood of an observation P(O| μ)?

Input of an ASR system: a continuous stream of sound

waves, which is ambiguous

Need to decode it into a sequence of phones.

is the input the sequence [n iy d] or [n iy]?

which sequence is the most probable?

Page 32: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Application of question 2 (POS Tagging) Given an observation sequence O, and model μ, which is the state sequence

(X1,…,Xt+1) that best explains the observations?

this is the decoding problem

Consider a POS Tagger

Input observation sequence:

I can read

need to find the most likely sequence of underlying POS tags:

e.g. is can a modal verb, or the noun?

how likely is it that can is a noun, given that the previous word is a pronoun?

Page 33: Corpora and Statistical Methods Lecture 15staff.um.edu.mt/albert.gatt/teaching/dl/statLecture8b.pdf · 2019. 4. 29. · Albert Gatt Corpora and Statistical Methods Lecture 8. Markov

Summary

HMMs are a way of representing:

sequences of observations arising from

sequences of states

states are the variables of interest, giving rise to the

observations

Next up:

algorithms for answering the fundamental questions about

HMMs