Expectation Expectation Maximization:Maximization: Mixture …epxing/Class/10701-12f/Lecture/lecture11... · 2012-10-22 · Machine Learning 1010--701/15701/15--781, Fall 781, Fall

Machine LearningMachine Learning

1010--701/15701/15--781, Fall 781, Fall 20122012

Expectation Expectation Maximization:Maximization:Mixture model and HMMMixture model and HMMMixture model and HMMMixture model and HMM

Eric XingEric Xing

Lecture 12, October 22, 2012ectu e , Octobe , 0

Reading: Chap. 9, 13, C.B book1© Eric Xing @ CMU, 2006-2012

Clustering and partiallyobservable probabilistic modelsobservable probabilistic models

2© Eric Xing @ CMU, 2006-2012

Unobserved VariablesUnobserved VariablesA variable can be unobserved (latent) because:( )

it is an imaginary quantity meant to provide some simplified and abstractive view of the data generation process

e.g., speech recognition models, mixture models …

it is a real-world object and/or phenomena, but difficult or impossible to measuree.g., the temperature of a star, causes of a disease, evolutionary ancestors …

it is a real-world object and/or phenomena, but sometimes wasn’t measured, because of faulty sensors; or was measure with a noisy channel, etc.because of faulty sensors; or was measure with a noisy channel, etc.

e.g., traffic radio, aircraft signal on a radar screen,

Discrete latent variables can be used to partition/cluster data into sub-groups (mixture models, forthcoming).

Continuous latent variables (factors) can be used for dimensionality reduction (factor analysis, etc., later lectures).

3© Eric Xing @ CMU, 2006-2012

Mixture ModelsMixture ModelsA density model p(x) may be multi-modal.y p( ) yWe may be able to model it as a mixture of uni-modal distributions (e.g., Gaussians).E h d d t diff t b l tiEach mode may correspond to a different sub-population (e.g., male and female).

⇒⇒

4© Eric Xing @ CMU, 2006-2012

Gaussian Mixture Models (GMMs)Gaussian Mixture Models (GMMs)Consider a mixture of K Gaussian components:p

∑ Σ=Σk kkkn xNxp ),|,(),( µπµ

mixture proportion mixture component

This model can be used for unsupervised clustering.p gThis model (fit by AutoClass) has been used to discover new kinds of stars in astronomical data, etc.

5© Eric Xing @ CMU, 2006-2012

GGM derivationsGGM derivationsConsider a mixture of K Gaussian components: Zp

Z is a latent class indicator vector:

( )∏):(multi)(k

zknn

knzzp ππ ==

Z

X

X is a conditional Gaussian variable with a class-specific mean/covariance

{ }1 { })-()-(-exp)(

),,|( // knkT

knk

mknn xxzxp µµ

πµ 1

21

212211 −Σ

Σ=Σ=

The likelihood of a sample:

( )∑ Σ===Σ k

kkn zxpzpxp ),,|,()|(),( µπµ 11

mixture proportionmixture component

( )( ) ∑∑ ∏ Σ=Σ= k kkkz kz

kknz

k xNxNn

kn

kn ),|,(),:( µπµπ

6© Eric Xing @ CMU, 2006-2012

Learning mixture modelsLearning mixture models

7© Eric Xing @ CMU, 2006-2012

Why is Learning Harder?Why is Learning Harder?In fully observed iid settings, the log likelihood decomposes y g g pinto a sum of local terms.

),|(log)|(log)|,(log);( xzc zxpzpzxpD θθθθ +==l

With latent variables, all the parameters become coupled together via marginalization

∑∑ )|()|(l)|(l)( θθθθl ∑∑ ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( θθθθl

8© Eric Xing @ CMU, 2006-2012

Gradient Learning for mixture modelsmodels

We can learn mixture densities using gradient descent on the g glog likelihood. The gradients are quite interesting:

∑==k

kkk pp θπθθl )x(log)|x(log)(

∑∂

∂∂

=∂∂

k

kkk

k

p

pp

θ

θθ

πθθ

l

)(l

)x()|x(

1

∑∑

∑∂

=∂

=

∂∂

=

kk

kkkkk

k

kkkk

k

rpp

ppp

θθπ

θθ

θθ

π

l)x(log)x(

)x(log)x(

)|x(

In other words, the gradient is the responsibility weighted sum f

∑∑ ∂∂ k kk

k kk r

p θθθπ

)|x(

of the individual log likelihood gradients.Can pass this to a conjugate gradient routine.

9© Eric Xing @ CMU, 2006-2012

Parameter ConstraintsParameter ConstraintsOften we have constraints on the parameters, e.g. Σkπk = 1, Σ p g k kbeing symmetric positive definite (hence Σii > 0).We can use constrained optimization, or we can reparameterize in terms of unconstrained valuesreparameterize in terms of unconstrained values.

For normalized weights, use the softmax transform:

For covariance matrices use the Cholesky decomposition:For covariance matrices, use the Cholesky decomposition:

where A is upper diagonal with positive diagonal:

AAT=Σ−1

pp g p g

the parameters γi, λi, ηij ∈ R are unconstrained.

( ) )( )( exp ijij ijijijiii <=>=>= 00 AAA ηλ

Use chain rule to compute . ,A∂

∂∂∂ llπ

10© Eric Xing @ CMU, 2006-2012

The Expectation-Maximization (EM) Algorithm(EM) Algorithm

11© Eric Xing @ CMU, 2006-2012

EM algorithm for GMMEM algorithm for GMME.g., A mixture of K Gaussians: Zg

Z is a latent class indicator vector

Zn

XnN

X is a conditional Gaussian variable with a class-specific

( )∏):(multi)(k

zknn

knzzp ππ ==

pmean/covariance

{ })-()-(-exp)(

),,|( // knkT

knk

mknn xxzxp µµ

πµ 1

21

212211 −Σ

Σ=Σ=


)( kπ2 Σ

∑ ΣΣ kk )|()|()( 11

( )( ) ∑∑ ∏∑

Σ=Σ=

Σ===Σ

k kkkz kz

kknz

k

kkk

n

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,|,()|(),(

µπµπ

µπµ 11

12© Eric Xing @ CMU, 2006-2012

EM algorithm for GMMRecall MLE for completely observed data

EM algorithm for GMM

zp y

Data log-likelihoodzxpzpxzpD ∏∏ )|()|(log)(log);( σµπθl

zi

xiN

xN

zxpzpxzpD

n

zk

kn

n k

zk

nnn

nn

nn

kn

kn +=

== ∏

∑ ∏∑ ∏

∏

),;(loglog

),,|()|(log),(log);(

σµπ

σµπθl

MLE

Cxzzn k

knkn

n kk

kn += ∑∑∑∑ )-(-log 2

21

2 µπσ

),;(maxargˆ DMLEk θlπ =MLE

),;(maxarg, DMLEk θlππ

);(maxargˆ , DMLEk θlµµ =

);(maxargˆ , DMLEk θlσσ =∑∑

,ˆ ⇒ n

kn

n nkn

MLEk z

xz=µ

What if we do not know zn?

,

13© Eric Xing @ CMU, 2006-2012

),,|1( )()( ttknn xzpz Σ=→ µ

EM algorithm for GMMEM algorithm for GMMStart:

"Guess" the centroid µk and coveriance Σk of each of the K clusters

Loop

14© Eric Xing @ CMU, 2006-2012

Comparing to K meansComparing to K-meansStart:

"Guess" the centroid µk and coveriance Σk of each of the K clusters

LoopFor each point n=1 to NFor each point n=1 to N,compute its cluster label:

)()(maxarg )()()()( tkn

tk

Ttknk

tn xxz µµ −Σ−= −1

For each cluster k=1:K

∑∑=+

tn n

tnt

k

xkz ),()(

)()( δ

µ 1 ...)( =Σ +1tk∑n

tn

k kz ),( )(δµ ...Σk

15© Eric Xing @ CMU, 2006-2012

Notes on EM AlgorithmNotes on EM AlgorithmEM is an optimization strategy for objective functions that can p gy jbe interpreted as likelihoods in the presence of missing data.It is much simpler than gradient methods:

No need to choose step sizeNo need to choose step size.Enforces constraints automatically.Calls inference and fully observed learning as subroutines.

EM i It ti l ith ith t li k d tEM is an Iterative algorithm with two linked steps:E-step: fill-in hidden values using inference, p(z|x, θt).M-step: update parameters t+1 using standard MLE/MAP method applied to completed datacompleted data

We will prove that this procedure monotonically improves (or leaves it unchanged). Thus it always converges to a local

ti f th lik lih doptimum of the likelihood.

16© Eric Xing @ CMU, 2006-2012

IdentifiabilityIdentifiabilityA mixture model induces a multi-modal likelihood.Hence gradient ascent can only find a local maximum.Mixture models are unidentifiable, since we can always switch th hidd l b l ith t ff ti th lik lih dthe hidden labels without affecting the likelihood.Hence we should be careful in trying to interpret the “meaning” of latent variables.g

17© Eric Xing @ CMU, 2006-2012

How is EM derived?How is EM derived?A mixture of K Gaussians: Z

Z is a latent class indicator vector

X is a conditional Gaussian variable with a class specific mean/covariance

Zn

XnN

( )∏):(multi)(k

zknn

knzzp ππ ==

X is a conditional Gaussian variable with a class-specific mean/covariance


{ })-()-(-exp)(

),,|( // knkT

knk

mknn xxzxp µµ

πµ 1

21

212211 −Σ

Σ=Σ=


( )( ) ∑∑ ∏∑

Σ=Σ=

Σ===Σ

k kkkz kz

kknz

k

kk

nk

nn

xNxN

zxpzpxp

n

kn

kn ),|,(),:(

),,1|,()|1(),(

µπµπ

µπµ

The “complete” likelihood

( )kn

),|,(),,1|,()|1(),1,( kkkk

nk

nknn xNzxpzpzxp Σ=Σ===Σ= µπµπµ

But this is itself a random variable! Not good as objective function

[ ]∏ Σ=Σk

zkkknn

knxNzxp ),|,(),,( µπµ

18© Eric Xing @ CMU, 2006-2012

How is EM derived?How is EM derived?The complete log likelihood: Zp g Zn

XnN

zxpzpxzpD nnn

nn

nn

kk

== ∏

∑ ∏∑ ∏

∏ ),,|()|(log),(log);( σµπθl

Cxzz

xN

n kkn

kn

n kk

kn

n

zk

kn

n k

zk

kn

kn

+=

+=

∑∑∑∑

∑ ∏∑ ∏

)-(-log

),;(loglog

22

12 µπ

σµπ

σ

The expected complete log likelihood

n kn k

( )∑∑∑∑

∑∑

log)()(1log

),,|(log)|(log),;()|()|(

kknkT

knknk

kn

nxzpnn

nxzpnc

Cxxzz

zxpzpzx

+Σ+−Σ−−=

Σ+=

− µµπ

µπ

1

θl

( )∑∑∑∑ log)()(2

logn k

kknkknnn k

kn Cxxzz +Σ+Σ µµπ

19© Eric Xing @ CMU, 2006-2012

E stepWe maximize iteratively using the following )(θcl

E-step

Zy g giterative procedure:

)(c Zn

XnN

─ Expectation step: computing the expected value of the sufficient statistics of the hidden variables (i.e., z) given current est. of the parameters (i.e., π and µ). p ( µ)

∑ ),|,(),|,(),,|( )()()(

)()()()()()(

)( ti

tin

ti

tk

tkn

tkttk

nqkn

tkn xN

xNxzpz t ΣΣ

=Σ===µπ

µπµτ 1

Here we are essentially doing inference

∑ ),|,(i

iini xN Σµπ

20© Eric Xing @ CMU, 2006-2012

M stepWe maximize iteratively using the following )(θcl

M-step

Zy g giterative procudure:

─ Maximization step: compute the parameters under t lt f th t d l f th hidd i bl

)(c Zn

XnN

current results of the expected value of the hidden variables s.t. ,∀ ,)( ⇒ ,)(maxarg

)(

k∂∂*

∑

∑

nz

kll

tkk

kcck k===

∑ τ

ππ π 10θθ

⇒ * ∑ )(

Nn

NNz

kn nn qnk

t=== ∑ τ

π

∑∑

)(

)()(* ⇒ ,)(maxarg tk

n

n ntk

ntkk

xl

τ

τµµ == +1θ :Fact

1

This is isomorphic to MLE except that the variables that are hidden are

∑n nτ

∑∑

)(

)()()()(*

))(( ⇒ ,)(maxarg

ntk

n

nTt

knt

kntk

ntkk

xxl

τ

µµτ 111

+++

−−=Σ=Σ θ T

T

T

xxxx=

∂∂

=∂

∂−

−

AA

A A

Alog1

1

This is isomorphic to MLE except that the variables that are hidden are replaced by their expectations (in general they will by replaced by their corresponding "sufficient statistics")

21© Eric Xing @ CMU, 2006-2012

Compare: K meansCompare: K-meansThe EM algorithm for mixtures of Gaussians is like a "soft gversion" of the K-means algorithm.In the K-means “E-step” we do hard assignment:

1

In the K-means “M-step” we update the means as the

)()(maxarg )()()()( tkn

tk

Ttknk

tn xxz µµ −Σ−= −1

weighted sum of the data, but now the weights are 0 or 1:

∑∑=+

tn n

tnt

k kzxkz)(

),()(

)()(

δδ

µ 1

∑n n kz ),( )(δ

22© Eric Xing @ CMU, 2006-2012

Theory underlying EMTheory underlying EMWhat are we doing?g

Recall that according to MLE, we intend to learn the model t th t ld h i i th lik lih d f thparameter that would have maximize the likelihood of the

data.

But we do not observe z, so computing

∑∑ == zxpzpzxpD )|()|(log)|(log);( θθθθl

is difficult!

∑∑ ==z

xzz

c zxpzpzxpD ),|()|(log)|,(log);( θθθθl

What shall we do?23© Eric Xing @ CMU, 2006-2012

Complete & Incomplete Log LikelihoodsLikelihoods

Complete log likelihoodp gLet X denote the observable variable(s), and Z denote the latent variable(s). If Z could be observed, then

)|(log);(def

θθ zxpzx =lUsually, optimizing lc() given both z and x is straightforward (c.f. MLE for fully observed models).Recalled that in this case the objective for e g MLE decomposes into a sum of

)|,(log),;( θθ zxpzxc =l

Recalled that in this case the objective for, e.g., MLE, decomposes into a sum of factors, the parameter for each factor can be estimated separately.But given that Z is not observed, lc() is a random quantity, cannot be maximized directly.

Incomplete log likelihoodWith z unobserved, our objective becomes the log of a marginal probability:

This objective won't decouple

∑==z

c zxpxpx )|,(log)|(log);( θθθl

24© Eric Xing @ CMU, 2006-2012

Expected Complete Log LikelihoodLikelihood

For any distribution q(z), define expected complete log likelihood:

∑=z

qc zxpxzqzx )|,(log),|(),;(def

θθθl

y q( ) p p g

A deterministic function of θLinear in lc() --- inherit its factorizabiility Does maximizing this surrogate yield a maximizer of the likelihood?

Jensen’s inequality

∑=

=

zxpxpx

)|,(log

)|(log);(

θ

θθl

Jensen s inequality

∑

∑

=z

z

xzqzxpxzq

zxp

)|()|,()|(log

)|,(log

θ

θ

∑≥z xzq

zxpxzq)|()|,(log)|( θ

qqc Hzxx +≥⇒ ),;();( θθ ll

25© Eric Xing @ CMU, 2006-2012

Lower Bounds and Free EnergyLower Bounds and Free EnergyFor fixed data x, define a functional called the free energy:gy

);()|()|,(log)|(),(

defx

xzqzxpxzqqF

zθ

θθ l≤= ∑

The EM algorithm is coordinate-ascent on F :E-step:

M step:

),(maxarg tq

t qFq θ=+1

M-step: ),(maxarg ttt qF θθθ

11 ++ =

26© Eric Xing @ CMU, 2006-2012

E-step: maximization of expected l w r t qlc w.r.t. q

Claim: )|()( ttt xzpqFq θθ+1

This is the posterior distribution over the latent variables given the data and the parameters. Often we need this at test time anyway (e.g. to perform

),|(),(maxarg ttq

t xzpqFq θθ ==+1

p y y ( g pclassification).

Proof (easy): this setting attains the bound l(θ;x)≥F(q,θ )

)|(l)(

),()|,(log),()),,((

xzpzxpxzpxzpF

tt

zt

tttt

θθ

θθθθθ =

∑

∑

Can also show this result using variational calculus or the fact);()|(log

)|(log),(

xxp

xpxzp

ttz

tt

θθ

θθ

l==

= ∑

Can also show this result using variational calculus or the fact that ( )),|(||KL),();( θθθ xzpqqFx =−l

27© Eric Xing @ CMU, 2006-2012

E-step ≡ plug in posterior expectation of latent variablesexpectation of latent variables

Without loss of generality: assume that p(x,z|θ) is a g y p( | )generalized exponential family distribution:

⎭⎬⎫

⎩⎨⎧

= ∑i

ii zxfzxhZ

zxp ),(exp),()(

),( θθ

θ 1

Special cases: if p(X|Z) are GLIMs, then

⎭⎩ iZ )(θ

)()(),( xzzxf iTii ξη=

The expected complete log likelihood under is

)()|,(log),|(),;( θθθθ Azxpxzqzx tttc t −= ∑1l

),|( tt xzpq θ=+1

)(),(

)()|,(g),|(),;(

),|(θθ

θAzxf

zpzqz

ixzqi

ti

zqc

t

t

−= ∑

∑+1

)()()(),|(

GLIM~

θξηθθ

Axzi

ixzqiti

p

t −= ∑28© Eric Xing @ CMU, 2006-2012

M-step: maximization of expected l w r t θlc w.r.t. θ

Note that the free energy breaks into two terms:gy

z

xzqxzqzxpxzqxzq

zxpxzqqF =

∑∑

∑)|(log)|()|(log)|(

)|()|,(

log)|(),(

θ

θθ

The first term is the expected complete log likelihood (energy) and the second

qqc

zz

Hzx

xzqxzqzxpxzq

+=

−= ∑∑),;(

)|(log)|()|,(log)|(

θ

θ

l

The first term is the expected complete log likelihood (energy) and the second term, which does not depend on θ, is the entropy.

Thus, in the M-step, maximizing with respect to θ for fixed qwe only need to consider the first term:we only need to consider the first term:

Under optimal qt+1 this is equivalent to solving a standard MLE of fully observed

∑== ++

zqc

t zxpxzqzx t )|,(log)|(maxarg),;(maxarg θθθθθ

11 l

Under optimal q , this is equivalent to solving a standard MLE of fully observed model p(x,z|θ), with the sufficient statistics involving z replaced by their expectations w.r.t. p(z|x,θ).

29© Eric Xing @ CMU, 2006-2012

Summary: EM AlgorithmSummary: EM AlgorithmA way of maximizing likelihood function for latent variable y gmodels. Finds MLE of parameters when the original (hard) problem can be broken up into two (easy) pieces:

1. Estimate some “missing” or “unobserved” data from observed data and current gparameters.

2. Using this “complete” data, find the maximum likelihood parameter estimates.

Alt t b t filli i th l t t i bl i th b tAlternate between filling in the latent variables using the best guess (posterior) and updating the parameters based on this guess:

tt 1E-step: M-step:

I th M t ti i l b d th lik lih d I

),(maxarg tq

t qFq θ=+1

),(maxarg ttt qF θθθ

11 ++ =

In the M-step we optimize a lower bound on the likelihood. In the E-step we close the gap, making bound=likelihood.

30© Eric Xing @ CMU, 2006-2012

EM VariantsEM VariantsSparse EM:pDo not re-compute exactly the posterior probability on each data point under all models, because it is almost zero. Instead keep an “active list” which you update every once in a whilekeep an active list which you update every once in a while.

Generalized (Incomplete) EM: ( p )It might be hard to find the ML parameters in the M-step, even given the completed data. We can still make progress by doing an M step that improves the likelihood a bit (e gdoing an M-step that improves the likelihood a bit (e.g. gradient step). Recall the IRLS step in the mixture of experts model.

31© Eric Xing @ CMU, 2006-2012

A Report Card for EMA Report Card for EMSome good things about EM:g g

no learning rate (step-size) parameterautomatically enforces parameter constraintsvery fast for low dimensionseach iteration guaranteed to improve likelihood

Some bad things about EM:Some bad things about EM:can get stuck in local minimacan be slower than conjugate gradient (especially near convergence)

i i i f trequires expensive inference stepis a maximum likelihood/MAP method

32© Eric Xing @ CMU, 2006-2012

From static to dynamic mixture modelsmodels

Dynamic mixtureDynamic mixtureStatic mixtureStatic mixture

Y2 Y3Y1 YT... Y1

A AA AX2 X3X1 XT... AX1N

33© Eric Xing @ CMU, 2006-2012

Hidden Markov ModelsHidden Markov Models

Y YY YThe underlying source:

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

... The sequence:

The underlying source:genomic entities, dice,

2 31 TThe sequence:CGH signal, sequence of rolls,

Markov property:

34© Eric Xing @ CMU, 2006-2012

This problem in IMPORTANT!!! -☺☺

An experience in a casinoAn experience in a casino

Game:1 You bet $11. You bet $12. You roll (always with a fair die)3. Casino player rolls (maybe with fair

die, maybe with loaded die), y )4. Highest number wins $2

Question:

1245526462146146136136661664661636616366163616515615115146123562344

Which die is being used in each play?

35© Eric Xing @ CMU, 2006-2012

A more serious question …

Naturally, data points arrive one at a timey pDoes the ordering index carry (additional) clustering information besides the data value itself ?

Example: Chromosomes of tumor cell:

C b tCopy number measurements (known as CGH)

36© Eric Xing @ CMU, 2006-2012

Array CGH (comparative genomic hybridization)hybridization)

The basic assumption of a CGH experiment is that the ratio of the binding of test and control DNA is proportional to th ti f th bthe ratio of the copy numbers of sequences in the two samples.But various kinds of noises make the true observations less easy to interpret …

37© Eric Xing @ CMU, 2006-2012

DNA Copy number aberration types in breast cancer

60-70 fold amplification of CMYC region

types in breast cancer

Copy number profile for chromosome1 from 600 MPE cell line

Copy number profile for chromosome8 from COLO320 cell line

Copy number profile for chromosome 8in MDA-MB-231 cell line

deletion

38© Eric Xing @ CMU, 2006-2012

Suppose you were told about the following story before heading to Vegasfollowing story before heading to Vegas…

The Dishonest Casino !!!The Dishonest Casino !!!

A casino has two dice:Fair dieP(1) = P(2) = P(3) = P(5) = P(6) = 1/6L d d diLoaded dieP(1) = P(2) = P(3) = P(5) = 1/10P(6) = 1/2

Casino player switches back-&-forthCasino player switches back-&-forth between fair and loaded die once every 20 turns

39© Eric Xing @ CMU, 2006-2012

Puzzles Regarding the Dishonest CasinoCasino

GIVEN: A sequence of rolls by the casino player

1245526462146146136136661664661636616366163616515615115146123562344

QUESTIONHow likely is this sequence, given our model of how the casino works?

This is the EVALUATION problem

What portion of the sequence was generated with the fair die, and what portion with the loaded die?p

This is the DECODING question

How “loaded” is the loaded die? How “fair” is the fair die? How often does the casino player change from fair to loaded and back?does the casino player change from fair to loaded, and back?

This is the LEARNING question

40© Eric Xing @ CMU, 2006-2012

Definition (of HMM)Definition (of HMM)Observation spaceObservation space

{ }Alphabetic set:Euclidean space:

Index set of hidden statesIndex set of hidden statesA AA Ax2 x3x1 xT

y2 y3y1 yT...

...

{ }Kccc ,,, L21=CdR

{ }Transition probabilitiesTransition probabilities between any two statesbetween any two states

A AA Ax2 x3x1 xT{ }M,,, L21=I

,)|( ,jii

tj

t ayyp === − 11 1

Graphical model

or

Start probabilitiesStart probabilities

,jp( ) .,,,,lMultinomia~)|( ,,, I∈∀=− iaaayyp Miii

itt K211 1

( )lMultinomia~)( Myp πππ 211

1 2

Emission probabilitiesEmission probabilities associated with each stateassociated with each state

i l

( ).,,,lMultinomia)( Myp πππ K211

( ) .,,,,lMultinomia~)|( ,,, I∈∀= ibbbyxp Kiiii

tt K211 K …

or in general:( ) .,|f~)|( I∈∀⋅= iyxp i

itt θ1 State automata

41© Eric Xing @ CMU, 2006-2012

Three Main Questions on HMMsThree Main Questions on HMMs1.1. EvaluationEvaluation

GIVEN an HMM M, and a sequence x,FIND Prob (x | M)ALGO. ForwardForward

2.2. DecodingDecodingGIVEN an HMM M, and a sequence x ,FIND the sequence of states that maximizes e g P(y | x M)FIND the sequence y of states that maximizes, e.g., P(y | x , M),

or the most probable subsequence of statesALGO. ViterbiViterbi, Forward, Forward--backward backward

33 LearningLearning3.3. LearningLearningGIVEN an HMM M, with unspecified transition/emission probs.,

and a sequence x,FIND parameters θ = (π a η ) that maximize P(x | θ)FIND parameters θ = (πi, aij, ηik) that maximize P(x | θ)ALGO. BaumBaum--Welch (EM)Welch (EM)

42© Eric Xing @ CMU, 2006-2012

Learning HMM: two scenariosLearning HMM: two scenariosSupervised learning: estimation when the “right answer” is g gknown

Examples: GIVEN: a genomic region x = x1…x1,000,000 where we have good, ,

(experimental) annotations of the CpG islandsGIVEN: the casino player allows us to observe him one evening,

as he changes dice and produces 10,000 rolls

Unsupervised learning: estimation when the “right answer” is unknown

Examples:GIVEN: the porcupine genome; we don’t know how frequent are the

CpG islands there, neither do we know their compositionGIVEN: 10,000 rolls of the casino player, but we don’t see when he

changes dice

QUESTION: Update the parameters θ of the model to maximize P(x|θ) --- Maximal likelihood (ML) estimation

43© Eric Xing @ CMU, 2006-2012

Supervised ML estimation ctdSupervised ML estimation, ctd.Intuition:

When we know the underlying states, the best estimate of θ is the average frequency of transitions & emissions that occur in the training data

Drawback:Given little data, there may be overfitting:

P(x|θ) is maximized, but θ is unreasonable( | )0 probabilities – VERY BAD

Example:Gi 10 i ll bGiven 10 casino rolls, we observe

x = 2, 1, 5, 6, 1, 2, 3, 6, 2, 3y = F, F, F, F, F, F, F, F, F, F

Then: aFF = 1; aFL = 0Then: aFF = 1; aFL = 0bF1 = bF3 = .2; bF2 = .3; bF4 = 0; bF5 = bF6 = .1

44© Eric Xing @ CMU, 2006-2012

PseudocountsPseudocountsSolution for small training sets:g

Add pseudocountsAij = # times state transition i→j occurs in y + RijBik = # times state i in y emits k in x + SikBik y k Sik

Rij, Sij are pseudocounts representing our prior beliefTotal pseudocounts: Ri = ΣjRij , Si = ΣkSik ,

--- "strength" of prior belief, --- total number of imaginary instances in the prior

Larger total pseudocounts ⇒ strong prior beliefLarger total pseudocounts ⇒ strong prior belief

Small total pseudocounts: just to avoid 0 probabilities ---Small total pseudocounts: just to avoid 0 probabilities smoothing

45© Eric Xing @ CMU, 2006-2012

Unsupervised ML estimationUnsupervised ML estimationGiven x = x1…xN for which the true state path y = y1…yN is unknown,

EXPECTATION MAXIMIZATION

0. Starting with our best guess of a model M, parameters θ:1. Estimate Aij , Bik in the training data1. Estimate Aij , Bik in the training data

How? , , Update θ according to Aij , BikNow a "supervised learning" problem

ktntn

itnik xyB ,, ,∑=∑ −= tn

jtn

itnij yyA

, ,, 1

2. Repeat 1 & 2, until convergence

This is called the Baum-Welch Algorithm

We can get to a provably more (or equally) likely parameter set θ each iteration

46© Eric Xing @ CMU, 2006-2012

The Baum Welch algorithmThe Baum Welch algorithmThe complete log likelihoodp g


∏ ∏∏ ⎟⎟⎠

⎞⎜⎜⎝

⎛==

==−

n

T

ttntn

T

ttntnnc xxpyypypp

1211 )|()|()(log),(log),;( ,,,,,yxyxθl


EM

∑∑∑∑∑==

− ⎟⎠⎞⎜

⎝⎛+⎟

⎠⎞⎜

⎝⎛+⎟

⎠⎞⎜

⎝⎛=

− n

T

tkiyp

itn

ktn

n

T

tjiyyp

jtn

itn

niyp

inc byxayyy

ntnntntnnn 1211

11,)|(,,,)|,(,,)|(, logloglog),;(

,,,, xxxyxθ πl

EMThe E step

)|( ,,, nitn

itn

itn ypy x1===γ

The M step ("symbolically" identical to MLE)

,,, ntntntn ypy

)|,( ,,,,,, n

jtn

itn

jtn

itn

jitn yypyy x1111 ==== −−ξ

∑ ∑∑ ∑

−

=

==n

Tt

itn

nTt

jitnML

ija 1

1

2

,

,,

γ

ξ

∑ ∑∑ ∑

−

=

==n

Tt

itn

ktnn

Tt

itnML

ikx

b 1

1

1

,

,,

γ

γ

Nn

inML

i∑= 1,γ

π

47© Eric Xing @ CMU, 2006-2012

The Baum-Welch algorithm --commentscomments

Time Complexity:

# iterations × O(K2N)

Guaranteed to increase the log likelihood of the modelGuaranteed to increase the log likelihood of the model

Not guaranteed to find globally best parameters

C t l l ti d di i iti l ditiConverges to local optimum, depending on initial conditions

Too many parameters / too large model: Overt-fitting

48© Eric Xing @ CMU, 2006-2012

Documents

Expectation Expectation Maximization:Maximization: Mixture …epxing/Class/10701-12f/Lecture/lecture11... · 2012-10-22 · Machine Learning 1010--701/15701/15--781, Fall 781, Fall