Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan

Beam Sampling for the Infinite Hidden Markov Model

by

Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani

(ICML 2008)

Presented by Lihan He

ECE, Duke University

Nov 14, 2008

Introduction

Infinite HMM

Beam sampler

Experimental results

Conclusion

Outline

2/14

Introduction:HMM

HMM: hidden Markov model

3/14

s0 s1 s2 sT

y1 y2 yT

…

π0 π

Model parameters Hidden state sequence s={s1, s2, …, sT},

Observation sequence y={y1, y2, …, yT}

π0i = p(s1=i)

πij = p(st=j|st-1=i)

},,,{ 0 K},...,2,1{ Kst

)(~| sttt Fsy

Complete likelihood

Number of states

4/14

Introduction:HMM Inference

Inference of HMM: forward-backward algorithm

Maximum likelihood: overfitting problem

Bayesian learning: VB or MCMC

If we don’t know K a priori

Model selection: inference for all K; computationally expensive.

Nonparametric Bayesian model: iHMM (HMM with an infinite number of states)

With iHMM framework

The forward-backward algorithm cannot be applied since the number of states K

is infinite.

Gibbs sampling can be used, but convergence is very slow due to the strong

dependencies between consecutive time steps.

Beam sampling = slice sampling + dynamic programming

5/14

Introduction:Beam Sampling

Slice sampling: limit the number of states considered at each time step to a

finite number

Dynamic programming: sample whole state trajectory efficiently

Advantages:

Converges in much fewer iterations than Gibbs sampling

Actual complexity per iteration is only marginally more than the Gibbs sampling

Mixes well regardless of strong correlations in the data

More robust with respect to varying initialization and prior distribution

Implemented via HDP

6/14

Infinite HMM

),(~0 HDPG

),(~ 0GDPGk

In the stick-breaking representation

Infinite hidden Markov model

Transition probability

Emission distribution parameter

7/14

Beam Sampler

Intuitive thought: only consider the states with large transition probabilities so that the number of possible states in each time step is finite.

Approximation

How to define “large transition probability”?

Might change distributions of other variables

Idea: introduce auxiliary variable u such that conditioned on u the number of trajectories with positive probability is finite.

The auxiliary variables do not change the marginal distribution over other

variables so MCMC sampling will converge to true posterior

8/14

Beam Sampler

Sampling u: for each t we introduce an auxiliary variable ut with conditional distribution (conditional on π, st-1 and st)

Sampling s: we sample the whole trajectory s given u and other variables using a form of forward filtering-backward sampling.

Forward filtering: compute

Backward sampling: sample st sequentially for t = T, T-1, …, 2, 1

sequentially for t = 1, 2, …, T

Only trajectories s with for all t will have non-zero probability given u

9/14

Beam Sampler

• Computing p(st|-) only needs to sum up a finite part of p(st-1|-) • We only need to compute p(st|y1:t , u1:t) for the finitely many st values belonging

to some trajectory with positive probability.

Forward filtering

Backward sampling

Sample sT from

Sample st given the sample for st+1:

Sampling φ, π, β: directly from the theory of HDPs

10/14

Experiments

Toy example 1: examining convergence speed & sensitivity of prior setting

Transition: 1-2-3-4-1-2-3-…, p=0.01 self-transition

)(FObservation: discrete HMM:

Strong / vague / fixed prior settings for α and γ

# st

ates

sum

med

up

11/14

Experiments

Toy example 2: examining performance for positive correlation data

Self transition =

-3 -2 -1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

)(F

-2 -1.5 -1 -0.5 0 0.5 1 1.50

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

)(F

12/14

Experiments

Real example 1: changepoint detection (Well data)

State partition from one beam sampling iteration

Probability that two datapoints are in one segment

Gibbs sampling: • slow convergence• harder decision

Beam sampling: • fast convergence• softer decision

13/14

Experiments

Real example 2: text prediction (Alice’s Adventures in Wonderland)

iHMM by Gibbs sampling & beam sampling:

• have similar results; • converge to around K=16 states.

VB HMM: • model selection: around K=16• worse than iHMM

14/14

Conclusion

The beam sampler is introduced for the iHMM inference

Beam sampler combines slice sampling and dynamic programming

Slice sampling limits the number of states considered at each time step to a

finite number

Dynamic programming samples whole hidden state trajectories efficiently

Advantages of beam sampler:

converges faster than Gibbs sampler

mixes well regardless of strong correlations in the data

more robust with respect to varying initialization and prior

distribution

Documents

Beam Sampling for the Infinite Hidden Markov Model by Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani (ICML 2008) Presented by Lihan