55
Eric Xing © Eric Xing @ CMU, 2006-2010 1 Machine Learning Mixture Model, HMM, and Expectation Maximization Eric Xing Lecture 9, August 14, 2010 Reading:

Lecture9 xing

Embed Size (px)

Citation preview

Page 1: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 1

Machine Learning

Mixture Model, HMM, and Expectation Maximization

Eric Xing

Lecture 9, August 14, 2010

Reading:

Page 2: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 2

Data log-likelihood

MLE

What if we do not know zn?

Cxzz

xN

zxpzpxzpD

n kkn

kn

n kk

kn

n

zk

kn

n k

zk

nnn

nn

nn

kn

kn

+=

+=

== ∏

∑∑∑∑

∑ ∏∑ ∏

)-(-log

),;(loglog

),,|()|(log),(log);(

22

12 µπ

σµπ

σµπ

σ

θl

Gaussian Discriminative Analysis

zi

xiN

),;(maxargˆ , DMLEk θlππ =

);(maxargˆ , DMLEk θlµµ =

);(maxargˆ , DMLEk θlσσ =∑∑

,ˆ ⇒ n

kn

n nkn

MLEk z

xz=µ

{ }{ }∑ −−

−−==

'

2'2

12/12'

22

12/12

)(exp)2(

1

)(exp)2(

1

),,|1(2

2

kknk

knk

nknyp

µπσ

π

µπσ

πσµ

σ

σ

x

xx

Page 3: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 3

Clustering

Page 4: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 4

Unobserved Variables A variable can be unobserved (latent) because:

it is an imaginary quantity meant to provide some simplified and abstractive view of the data generation process e.g., speech recognition models, mixture models …

it is a real-world object and/or phenomena, but difficult or impossible to measure e.g., the temperature of a star, causes of a disease, evolutionary ancestors …

it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors; or was measure with a noisy channel, etc. e.g., traffic radio, aircraft signal on a radar screen,

Discrete latent variables can be used to partition/cluster data into sub-groups (mixture models, forthcoming).

Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc., later lectures).

Page 5: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 5

Mixture Models A density model p(x) may be multi-modal. We may be able to model it as a mixture of uni-modal

distributions (e.g., Gaussians). Each mode may correspond to a different sub-population

(e.g., male and female).

Page 6: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 6

Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components:

Z is a latent class indicator vector:

X is a conditional Gaussian variable with a class-specific mean/covariance

The likelihood of a sample:

( )∏):(multi)(k

zknn

knzzp ππ ==

{ })-()-(-exp)(

),,|( // knkT

knk

mknn xxzxp µµ

πµ 1

21

212211 −ΣΣ

=Σ=

( )( ) ∑∑ ∏∑

Σ=Σ=

Σ===Σ

k kkkz kz

kknz

k

kkk

n

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,|,()|(),(

µπµπ

µπµ 11mixture proportion

mixture component

Z

X

Page 7: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 7

Gaussian Mixture Models (GMMs) Consider a mixture of K Gaussian components:

This model can be used for unsupervised clustering. This model (fit by AutoClass) has been used to discover new kinds of stars in

astronomical data, etc.

∑ Σ=Σk kkkn xNxp ),|,(),( µπµ

mixture proportion mixture component

Page 8: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 8

Learning mixture models Given data

Likelihood:

( )∏ ∑∏ Σ=Σ=Σn

k kkkn

n xNxpDL ),|,(),,();,,( µπµπµπ

{ } ),,,(maxarg**,*, DL Σ=Σ µπµπ

Page 9: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 9

Why is Learning Harder? In fully observed iid settings, the log likelihood decomposes

into a sum of local terms.

With latent variables, all the parameters become coupled together via marginalization

),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l

∑∑ ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( θθθθl

Page 10: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 10

Recall MLE for completely observed data

Data log-likelihood

MLE

What if we do not know zn?

Cxzz

xN

zxpzpxzpD

n kkn

kn

n kk

kn

n

zk

kn

n k

zk

nnn

nn

nn

kn

kn

+=

+=

== ∏

∑∑∑∑

∑ ∏∑ ∏

)-(-log

),;(loglog

),,|()|(log),(log);(

22

12 µπ

σµπ

σµπ

σ

θl

Toward the EM algorithm

zi

xiN

),;(maxargˆ , DMLEk θlππ =

);(maxargˆ , DMLEk θlµµ =

);(maxargˆ , DMLEk θlσσ =∑∑

,ˆ ⇒ n

kn

n nkn

MLEk z

xz=µ

Page 11: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 11

Recall K-means Start:

"Guess" the centroid µk and coveriance Σk of each of the K clusters

Loop For each point n=1 to N,

compute its cluster label:

For each cluster k=1:K

)()(minarg )()(1)()( tkn

tk

Ttknk

tn xxz µµ −Σ−= −

∑∑=+

nt

n

n nt

ntk kz

xkz),(

),()(

)()(

δδ

µ 1 ...)( =Σ +1tk

Page 12: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 12

Expectation-Maximization Start:

"Guess" the centroid µk and coveriance Σk of each of the K clusters

Loop

Page 13: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 13

─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., π and µ).

Here we are essentially doing inference

∑ ),|,(),|,(),,|( )()()(

)()()()()()(

)(

i

ti

tin

ti

tk

tkn

tkttk

nqkn

tkn xN

xNxzpz t ΣΣ

=Σ===µπ

µπµτ 1

E-step

Zn

XnN

Page 14: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 14

─ Maximization step: compute the parameters under current results of the expected value of the hidden variables

This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")

M-step

Zn

XnN

s.t. ,⇒ ,)( ⇒ ,)(maxarg

)(*

k∂∂*

)(

Nn

NNz

kll

kntk

nn qkn

k

kcck

t

k

===

===

∑ τπ

ππ π 10θθ

∑∑

)(

)()1(* ⇒ ,)(maxarg

ntk

n

n ntk

ntkk

xl

τ

τµµ == +θ

∑∑

)(

)()()()(*

))(( ⇒ ,)(maxarg

ntk

n

nTt

knt

kntk

ntkk

xxl

τ

µµτ 111

+++

−−=Σ=Σ θ

Page 15: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 15

How is EM derived? A mixture of K Gaussians:

Z is a latent class indicator vector

X is a conditional Gaussian variable with a class-specific mean/covariance

The likelihood of a sample:

The “complete” likelihood

Zn

XnN

( )∏):(multi)(k

zknn

knzzp ππ ==

{ })-()-(-exp)(

),,|( // knkT

knk

mknn xxzxp µµ

πµ 1

21

212211 −ΣΣ

=Σ=

( )( ) ∑∑ ∏∑

Σ=Σ=

Σ===Σ

k kkkz kz

kknz

k

kk

nk

nn

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,1|,()|1(),(

µπµπ

µπµ

),|,(),,1|,()|1(),1,( kkkk

nk

nknn xNzxpzpzxp Σ=Σ===Σ= µπµπµ

But this is itself a random variable! Not good as objective function

[ ]∏ Σ=Σk

zkkknn

knxNzxp ),|,(),,( µπµ

Page 16: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 16

How is EM derived? The complete log likelihood:

The expected complete log likelihood

We maximize iteratively using the above iterative procedure:

Zn

XnN

( )∑∑∑∑

∑∑

log)()(21log

),,|(log)|(log),;()|()|(

n kkknk

Tkn

kn

n kk

kn

nxzpnn

nxzpnc

Cxxzz

zxpzpzx

+Σ+−Σ−−=

Σ+=

− µµπ

µπ

1

θl

Cxzz

xN

zxpzpxzpD

n kkn

kn

n kk

kn

n

zk

kn

n k

zk

nnn

nn

nn

kn

kn

+=

+=

== ∏

∑∑∑∑

∑ ∏∑ ∏

)-(-log

),;(loglog

),,|()|(log),(log);(

22

12 µπ

σµπ

σµπ

σ

θl

)(θcl

Page 17: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 17

Compare: K-means The EM algorithm for mixtures of Gaussians is like a "soft

version" of the K-means algorithm. In the K-means “E-step” we do hard assignment:

In the K-means “M-step” we update the means as the weighted sum of the data, but now the weights are 0 or 1:

)()(maxarg )()()()( tkn

tk

Ttknk

tn xxz µµ −Σ−= −1

∑∑=+

nt

n

n nt

ntk kz

xkz),(

),()(

)()(

δδ

µ 1

=+

∑∑

)(

)()1(

ntk

n

n ntk

ntk

x

τ

τµ

( ))(

)(tq

kn

tkn z=τ

Page 18: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 18

Theory underlying EM What are we doing?

Recall that according to MLE, we intend to learn the model parameter that would have maximize the likelihood of the data.

But we do not observe z, so computing

is difficult!

What shall we do?

∑∑ ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( θθθθl

Page 19: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 19

Complete & Incomplete Log Likelihoods Complete log likelihood

Let X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then

Usually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models).

Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately.

But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.

Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:

This objective won't decouple

)|,(log),;(def

θθ zxpzxc =l

∑==z

c zxpxpx )|,(log)|(log);( θθθl

Page 20: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 20

Expected Complete Log Likelihood

∑=z

qc zxpxzqzx )|,(log),|(),;(def

θθθl

=

=

=

z

z

z

xzqzxpxzq

xzqzxpxzq

zxpxpx

)|()|,(log)|(

)|()|,()|(log

)|,(log

)|(log);(

θ

θ

θ

θθl

qqc Hzxx +≥⇒ ),;();( θθ ll

For any distribution q(z), define expected complete log likelihood:

A deterministic function of θ Linear in lc() --- inherit its factorizabiility Does maximizing this surrogate yield a maximizer of the likelihood?

Jensen’s inequality

Page 21: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 21

Lower Bounds and Free Energy For fixed data x, define a functional called the free energy:

The EM algorithm is coordinate-ascent on F : E-step:

M-step:

);()|(

)|,(log)|(),(def

xxzq

zxpxzqqFz

θθ

θ l≤= ∑

),(maxarg tq

t qFq θ=+1

),(maxarg ttt qF θθθ

11 ++ =

Page 22: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 22

E-step: maximization of expected lc w.r.t. q Claim:

This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform classification).

Proof (easy): this setting attains the bound l(θ;x)≥F(q,θ )

Can also show this result using variational calculus or the fact that

),|(),(maxarg ttq

t xzpqFq θθ ==+1

);()|(log

)|(log),(

),()|,(log),()),,((

xxp

xpxzp

xzpzxpxzpxzpF

ttz

tt

zt

tttt

θθ

θθ

θθθθθ

l==

=

=

( )),|(||KL),();( θθθ xzpqqFx =−l

Page 23: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 23

E-step ≡ plug in posterior expectation of latent variables Without loss of generality: assume that p(x,z|θ) is a

generalized exponential family distribution:

Special cases: if p(X|Z) are GLIMs, then

The expected complete log likelihood under is

)(),(

)()|,(log),|(),;(

),|(θθ

θθθθ

θAzxf

Azxpxzqzx

ixzqi

ti

z

ttq

tc

t

t

−=

−=

∑+1l

= ∑i

ii zxfzxhZ

zxp ),(exp),()(

),( θθ

θ 1

)()(),( xzzxf iTii ξη=

),|( tt xzpq θ=+1

)()()(),|(

GLIM~

θξηθθ

Axzi

ixzqiti

p

t −= ∑

Page 24: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 24

M-step: maximization of expected lc w.r.t. θ Note that the free energy breaks into two terms:

The first term is the expected complete log likelihood (energy) and the second term, which does not depend on θ, is the entropy.

Thus, in the M-step, maximizing with respect to θ for fixed qwe only need to consider the first term:

Under optimal qt+1, this is equivalent to solving a standard MLE of fully observed model p(x,z|θ), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,θ).

qqc

zz

z

Hzx

xzqxzqzxpxzqxzq

zxpxzqqF

+=

−=

=

∑∑

),;(

)|(log)|()|,(log)|()|()|,(log)|(),(

θ

θ

θθ

l

∑== ++

zqc

t zxpxzqzx t )|,(log)|(maxarg),;(maxarg θθθθθ

11 l

Page 25: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 25

Summary: EM Algorithm A way of maximizing likelihood function for latent variable

models. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:

1. Estimate some “missing” or “unobserved” data from observed data and current parameters.

2. Using this “complete” data, find the maximum likelihood parameter estimates.

Alternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess: E-step: M-step:

In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.

),(maxarg tq

t qFq θ=+1

),(maxarg ttt qF θθθ

11 ++ =

Page 26: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 26

From static to dynamic mixture models

Dynamic mixture

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

Static mixture

AX1

Y1

NThe sequence:

The underlying source:

Phonemes,

Speech signal,

sequence of rolls,

dice,

Page 27: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 27

Chromosomes of tumor cell:

Predicting Tumor Cell States

Page 28: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 28

Copy number profile for chromosome1 from 600 MPE cell line

Copy number profile for chromosome8 from COLO320 cell line

60-70 fold amplification of CMYC region

Copy number profile for chromosome 8in MDA-MB-231 cell line

deletion

DNA Copy number aberration types in breast cancer

Page 29: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 29

A real CGH run

Page 30: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 30

Hidden Markov Model Observation space

Alphabetic set:Euclidean space:

Index set of hidden states

Transition probabilities between any two states

or

Start probabilities

Emission probabilities associated with each state

or in general:

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

{ }Kccc ,,, 21=CdR

{ }M,,, 21=I

,)|( ,jii

tj

t ayyp === − 11 1

( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miiii

tt 211 1

( ).,,,lMultinomia~)( Myp πππ 211

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiii

tt 211

( ) .,|f~)|( I∈∀⋅= iyxp ii

tt θ1

Graphical model

K

1

2

State automata

Page 31: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 31

The Dishonest Casino

A casino has two dice: Fair die

P(1) = P(2) = P(3) = P(5) = P(6) = 1/6 Loaded die

P(1) = P(2) = P(3) = P(5) = 1/10P(6) = 1/2

Casino player switches back-&-forth between fair and loaded die once every 20 turns

Game:1. You bet $12. You roll (always with a fair die)3. Casino player rolls (maybe with fair die,

maybe with loaded die)4. Highest number wins $2

Page 32: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 32

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

The Dishonest Casino Model

Page 33: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 33

Puzzles Regarding the Dishonest Casino

GIVEN: A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTION How likely is this sequence, given our model of how the casino

works? This is the EVALUATION problem in HMMs

What portion of the sequence was generated with the fair die, and what portion with the loaded die? This is the DECODING question in HMMs

How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded, and back? This is the LEARNING question in HMMs

Page 34: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 34

Joint Probability

1245526462146146136136661664661636616366163616515615115146123562344

Page 35: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 35

Probability of a Parse Given a sequence x = x1……xT

and a parse y = y1, ……, yT, To find how likely is the parse:

(given our HMM and the sequence)

p(x, y) = p(x1……xT, y1, ……, yT) (Joint probability)= p(y1) p(x1 | y1) p(y2 | y1) p(x2 | y2) … p(yT | yT-1) p(xT | yT)= p(y1) P(y2 | y1) … p(yT | yT-1) × p(x1 | y1) p(x2 | y2) … p(xT | yT)

Marginal probability:

Posterior probability:

∑ ∑ ∑ ∑ ∏ ∏= =

−==

yyxx

1 2 112 1

y y y

T

t

T

tttyyy

N ttyxpapp )|(),()( ,π

)(/),()|( xyxxy ppp =

A AA Ax2 x3x1 xT

y2 y3y1 yT...

...

Page 36: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 36

Example: the Dishonest Casino Let the sequence of rolls be:

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

Then, what is the likelihood of y = Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair, Fair?

(say initial probs a0Fair = ½, aoLoaded = ½)

½ × P(1 | Fair) P(Fair | Fair) P(2 | Fair) P(Fair | Fair) … P(4 | Fair) =

½ × (1/6)10 × (0.95)9 = .00000000521158647211 = 5.21 × 10-9

Page 37: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 37

Example: the Dishonest Casino So, the likelihood the die is fair in all this run

is just 5.21 × 10-9

OK, but what is the likelihood of π = Loaded, Loaded, Loaded, Loaded, Loaded, Loaded, Loaded,

Loaded, Loaded, Loaded?

½ × P(1 | Loaded) P(Loaded | Loaded) … P(4 | Loaded) =

½ × (1/10)8 × (1/2)2 (0.95)9 = .00000000078781176215 = 0.79 × 10-9

Therefore, it is after all 6.59 times more likely that the die is fair all the way, than that it is loaded all the way

Page 38: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 38

Example: the Dishonest Casino Let the sequence of rolls be:

x = 1, 6, 6, 5, 6, 2, 6, 6, 3, 6

Now, what is the likelihood π = F, F, …, F? ½ × (1/6)10 × (0.95)9 = 0.5 × 10-9, same as before

What is the likelihood y = L, L, …, L?

½ × (1/10)4 × (1/2)6 (0.95)9 = .00000049238235134735 = 5 × 10-7

So, it is 100 times more likely the die is loaded

Page 39: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 39

Three Main Questions on HMMs1. Evaluation

GIVEN an HMM M, and a sequence x,FIND Prob (x | M)ALGO. Forward

2. DecodingGIVEN an HMM M, and a sequence x ,FIND the sequence y of states that maximizes, e.g., P(y | x , M),

or the most probable subsequence of statesALGO. Viterbi, Forward-backward

3. LearningGIVEN an HMM M, with unspecified transition/emission probs.,

and a sequence x,FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ)ALGO. Baum-Welch (EM)

Page 40: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 40

Applications of HMMs Some early applications of HMMs

finance, but we never saw them speech recognition modelling ion channels

In the mid-late 1980s HMMs entered genetics and molecular biology, and they are now firmly entrenched.

Some current applications of HMMs to biology mapping chromosomes aligning biological sequences predicting sequence structure inferring evolutionary relationships finding genes in DNA sequence

Page 41: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 41

The Forward Algorithm We want to calculate P(x), the likelihood of x, given the HMM

Sum over all possible ways of generating x:

To avoid summing over an exponential number of paths y, define

(the forward probability)

The recursion:

),,...,()(def

11 1 ==== ktt

kt

kt yxxPy αα

∑ −==i

kiit

ktt

kt ayxp ,)|( 11 αα

∑=k

kTP α)(x

∑ ∑ ∑ ∑ ∏ ∏= =

−==

yyxx

1 2 112 1

y y y

T

t

T

tttyyy

N ttyxpapp )|(),()( ,π

Page 42: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 42

The Forward Algorithm –derivation Compute the forward probability:

),,,...,( 111 == −k

tttkt yxxxPα

),,...,,|(),...,,|(),,...,( 111111111 111

−−−−−− ===∑−

ttk

ttttk

ty tt yxxyxPxxyyPyxxPt

)|()|(),,...,( 11 11111

=== −−−∑−

kttt

kty tt yxPyyPyxxP

t

)|(),,...,()|( 1111 1111 ===== −−−∑ it

kti

itt

ktt yyPyxxPyxP

kiiit

ktt ayxP ,)|( ∑ −== 11 α

AA xtx1

yty1 ...

Axt-1

yt-1

...

...

...

),|()|()(),,( :ruleChain BACPABPAPCBAP =

Page 43: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 43

The Forward Algorithm We can compute for all k, t, using dynamic programming!

Initialization:

Iteration:

Termination:

ktα

kkk yxP πα )|( 1111 ==

kk

kk

kk

yxPyPyxP

yxP

π

α

)|(

)()|(

),(

111

1

11

111

111

==

===

==

kiiit

ktt

kt ayxP ,)|( ∑ −== 11 αα

∑=k

kTP α)(x

Page 44: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 44

The Backward Algorithm We want to compute ,

the posterior probability distribution on the t th position, given x

We start by computing

The recursion:

)|( x1=ktyP

Forward, αtk Backward,

),...,,,,...,(),( Ttk

ttk

t xxyxxPyP 11 11 +=== x

)|...()...(

),,...,|,...,(),,...,(

, 1111

11

111

===

===

+

+

ktTt

ktt

kttTt

ktt

yxxPyxxPyxxxxPyxxP

)|,...,( 11 == +k

tTtk

t yxxPβ

∑ +++ ==i

it

ittik

kt yxpa 111, )1|( ββ

A Axt+1 xT

yt+1 yT...

Axt

yt

...

...

...

Page 45: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 45

Example:

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

kiiit

ktt

kt ayxP ,)|( ∑ −== 11 αα

it

itti ik

kt yxPa 111 1 +++ ==∑ ββ )|(,

Page 46: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 46

Alpha (actual)0.0833 0.05000.0136 0.00520.0022 0.00060.0004 0.00010.0001 0.00000.0000 0.00000.0000 0.00000.0000 0.00000.0000 0.00000.0000 0.0000

Beta (actual)0.0000 0.00000.0000 0.00000.0000 0.00000.0000 0.00000.0001 0.00010.0007 0.00060.0045 0.00550.0264 0.01120.1633 0.10331.0000 1.0000

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

kiiit

ktt

kt ayxP ,)|( ∑ −== 11 αα

it

itti ik

kt yxPa 11 1 +++ ==∑ ββ )|(,

Page 47: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 47

Alpha (logs)-2.4849 -2.9957-4.2969 -5.2655-6.1201 -7.4896-7.9499 -9.6553-9.7834 -10.1454

-11.5905 -12.4264-13.4110 -14.6657-15.2391 -15.2407-17.0310 -17.5432-18.8430 -19.8129

Beta (logs)-16.2439 -17.2014-14.4185 -14.9922-12.6028 -12.7337-10.8042 -10.4389-9.0373 -9.7289-7.2181 -7.4833-5.4135 -5.1977-3.6352 -4.4938-1.8120 -2.2698

0 0

FAIR LOADED

0.05

0.05

0.950.95

P(1|F) = 1/6P(2|F) = 1/6P(3|F) = 1/6P(4|F) = 1/6P(5|F) = 1/6P(6|F) = 1/6

P(1|L) = 1/10P(2|L) = 1/10P(3|L) = 1/10P(4|L) = 1/10P(5|L) = 1/10P(6|L) = 1/2

x = 1, 2, 1, 5, 6, 2, 1, 6, 2, 4

kiiit

ktt

kt ayxP ,)|( ∑ −== 11 αα

it

itti ik

kt yxPa 11 1 +++ ==∑ ββ )|(,

Page 48: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 48

What is the probability of a hidden state prediction?

Page 49: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 49

Posterior decoding We can now calculate

Then, we can ask What is the most likely state at position t of sequence x:

Note that this is an MPA of a single hidden state, what if we want to a MPA of a whole hidden state sequence?

Posterior Decoding:

This is different from MPA of a whole sequence of hidden states

This can be understood as bit error ratevs. word error rate

)()(),()|(

xxxx

PPyPyP

kt

kt

ktk

tβα

==

==11

)|(maxarg* x1== ktkt yPk

{ } : * Tty tk

t 11 ==

Example:MPA of X ?MPA of (X, Y) ?

x y P(x,y)0 0 0.350 1 0.051 0 0.31 1 0.3

Page 50: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 50

Viterbi decoding GIVEN x = x1, …, xT, we want to find y = y1, …, yT, such that

P(y|x) is maximized:y* = argmaxy P(y|x) = argmaxπ P(y,x)

Let

= Probability of most likely sequence of states ending at state yt = k The recursion:

Underflows are a significant problem

These numbers become extremely small – underflow Solution: Take the logs of all values:

),,...,,,...,(max ,--},...{ -1111111

== kttttyy

kt yxyyxxPV

t

itkii

ktt

kt VayxpV 11 −== ,max)|(

x1 x2 x3 ……………………...……..xN

State 12

K

x1 x2 x3 ……………………...……..xN

State 12

K

x1 x2 x3 ……………………...……..xN

State 12

K

x1 x2 x3 ……………………...……..xN

State 12

K

Vi(t)k

tV

tttt xyxyyyyyytt bbaayyxxp ,,,,),,,,,( 11121111 −

= π

( )( )itkii

ktt

kt VayxpV 11 −++== ,logmax)|(log

Page 51: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 51

Computational Complexity and implementation details What is the running time, and space required, for Forward,

and Backward?

Time: O(K2N); Space: O(KN).

Useful implementation technique to avoid underflows Viterbi: sum of logs Forward/Backward: rescaling at each position by multiplying by a constant

∑ −==i

kiit

ktt

kt ayxp ,1)1|( αα

it

itt

iik

kt yxpa 111, )1|( +++ ==∑ ββ

itkii

ktt

kt VayxpV 1,max)1|( −==

Page 52: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 52

(Homework!)

Learning HMM Given x = x1…xN for which the true state path y = y1…yN is

known,

Define:Aij = # times state transition i→j occurs in yBik = # times state i in y emits k in x

We can show that the maximum likelihood parameters θ are:

What if y is continuous? We can treat as N×Tobservations of, e.g., a Gaussian, and apply learning rules for Gaussian …

∑∑ ∑∑ ∑ ==

•→→

== −

= −

' ',

,,

)(#)(#

j ij

ij

nTt

itn

jtnn

Tt

itnML

ij AA

yyy

ijia

2 1

2 1

∑∑ ∑∑ ∑ ==

•→→

==

=

' ',

,,

)(#)(#

k ik

ik

nTt

itn

ktnn

Tt

itnML

ik BB

yxy

ikib

1

1

( ){ }NnTtyx tntn :,::, ,, 11 ==

(Homework!)

∏ ∏∏

==

==−

n

T

ttntn

T

ttntnn xxpyypypp

1,,

21,,1, )|()|()(log),(log),;( yxyxθl

Page 53: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 53

Unsupervised ML estimation Given x = x1…xN for which the true state path y = y1…yN is

unknown,

EXPECTATION MAXIMIZATION

0. Starting with our best guess of a model M, parameters θ:1. Estimate Aij , Bik in the training data

How? , , How? (homework)

2. Update θ according to Aij , Bik Now a "supervised learning" problem

3. Repeat 1 & 2, until convergence

This is called the Baum-Welch Algorithm

We can get to a provably more (or equally) likely parameter set θ each iteration

ktntn

itnik xyB ,, ,∑=∑ −= tn

jtn

itnij yyA

, ,, 1

Page 54: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 54

The Baum Welch algorithm The complete log likelihood

The expected complete log likelihood

EM The E step

The M step ("symbolically" identical to MLE)

∏ ∏∏

==

==−

n

T

ttntn

T

ttntnnc xxpyypypp

1211 )|()|()(log),(log),;( ,,,,,yxyxθl

∑∑∑∑∑==

+

+

=

− n

T

tkiyp

itn

ktn

n

T

tjiyyp

jtn

itn

niyp

inc byxayyy

ntnntntnnn 1211

11,)|(,,,)|,(,,)|(, logloglog),;(

,,,, xxxyxθ πl

)|( ,,, nitn

itn

itn ypy x1===γ

)|,( ,,,,,, n

jtn

itn

jtn

itn

jitn yypyy x1111 ==== −−ξ

∑ ∑∑ ∑

=

==n

Tt

itn

nTt

jitnML

ija 1

1

2

,

,,

γ

ξ

∑ ∑∑ ∑

=

==n

Tt

itn

ktnn

Tt

itnML

ikx

b 1

1

1

,

,,

γ

γ

Nn

inML

i∑= 1,γ

π

Page 55: Lecture9 xing

Eric Xing © Eric Xing @ CMU, 2006-2010 55

Summary Modeling hidden transitional trajectories (in discrete state

space, such as cluster label, DNA copy number, dice-choice, etc.) underlying observed sequence data (discrete, such as dice outcomes; or continuous, such as CGH signals)

Useful for parsing, segmenting sequential data Important HMM computations:

The joint likelihood of a parse and data can be written as a product to local terms (i.e., initial prob, transition prob, emission prob.)

Computing marginal likelihood of the observed sequence: forward algorithm Predicting a single hidden state: forward-backward Predicting an entire sequence of hidden states: viterbi Learning HMM parameters: an EM algorithm known as Baum-Welch