Upload
walter-mclaughlin
View
216
Download
1
Embed Size (px)
Citation preview
Beam Sampling for the Infinite Hidden Markov Model
by
Jurgen Van Gael, Yunus Saatic, Yee Whye Teh and Zoubin Ghahramani
(ICML 2008)
Presented by Lihan He
ECE, Duke University
Nov 14, 2008
Introduction
Infinite HMM
Beam sampler
Experimental results
Conclusion
Outline
2/14
Introduction:HMM
HMM: hidden Markov model
3/14
s0 s1 s2 sT
y1 y2 yT
…
π0 π
Model parameters Hidden state sequence s={s1, s2, …, sT},
Observation sequence y={y1, y2, …, yT}
π0i = p(s1=i)
πij = p(st=j|st-1=i)
},,,{ 0 K},...,2,1{ Kst
)(~| sttt Fsy
Complete likelihood
Number of states
4/14
Introduction:HMM Inference
Inference of HMM: forward-backward algorithm
Maximum likelihood: overfitting problem
Bayesian learning: VB or MCMC
If we don’t know K a priori
Model selection: inference for all K; computationally expensive.
Nonparametric Bayesian model: iHMM (HMM with an infinite number of states)
With iHMM framework
The forward-backward algorithm cannot be applied since the number of states K
is infinite.
Gibbs sampling can be used, but convergence is very slow due to the strong
dependencies between consecutive time steps.
Beam sampling = slice sampling + dynamic programming
5/14
Introduction:Beam Sampling
Slice sampling: limit the number of states considered at each time step to a
finite number
Dynamic programming: sample whole state trajectory efficiently
Advantages:
Converges in much fewer iterations than Gibbs sampling
Actual complexity per iteration is only marginally more than the Gibbs sampling
Mixes well regardless of strong correlations in the data
More robust with respect to varying initialization and prior distribution
Implemented via HDP
6/14
Infinite HMM
),(~0 HDPG
),(~ 0GDPGk
In the stick-breaking representation
Infinite hidden Markov model
Transition probability
Emission distribution parameter
7/14
Beam Sampler
Intuitive thought: only consider the states with large transition probabilities so that the number of possible states in each time step is finite.
Approximation
How to define “large transition probability”?
Might change distributions of other variables
Idea: introduce auxiliary variable u such that conditioned on u the number of trajectories with positive probability is finite.
The auxiliary variables do not change the marginal distribution over other
variables so MCMC sampling will converge to true posterior
8/14
Beam Sampler
Sampling u: for each t we introduce an auxiliary variable ut with conditional distribution (conditional on π, st-1 and st)
Sampling s: we sample the whole trajectory s given u and other variables using a form of forward filtering-backward sampling.
Forward filtering: compute
Backward sampling: sample st sequentially for t = T, T-1, …, 2, 1
sequentially for t = 1, 2, …, T
Only trajectories s with for all t will have non-zero probability given u
9/14
Beam Sampler
• Computing p(st|-) only needs to sum up a finite part of p(st-1|-) • We only need to compute p(st|y1:t , u1:t) for the finitely many st values belonging
to some trajectory with positive probability.
Forward filtering
Backward sampling
Sample sT from
Sample st given the sample for st+1:
Sampling φ, π, β: directly from the theory of HDPs
10/14
Experiments
Toy example 1: examining convergence speed & sensitivity of prior setting
Transition: 1-2-3-4-1-2-3-…, p=0.01 self-transition
)(FObservation: discrete HMM:
Strong / vague / fixed prior settings for α and γ
# st
ates
sum
med
up
11/14
Experiments
Toy example 2: examining performance for positive correlation data
Self transition =
-3 -2 -1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
)(F
-2 -1.5 -1 -0.5 0 0.5 1 1.50
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
)(F
12/14
Experiments
Real example 1: changepoint detection (Well data)
State partition from one beam sampling iteration
Probability that two datapoints are in one segment
Gibbs sampling: • slow convergence• harder decision
Beam sampling: • fast convergence• softer decision
13/14
Experiments
Real example 2: text prediction (Alice’s Adventures in Wonderland)
iHMM by Gibbs sampling & beam sampling:
• have similar results; • converge to around K=16 states.
VB HMM: • model selection: around K=16• worse than iHMM
14/14
Conclusion
The beam sampler is introduced for the iHMM inference
Beam sampler combines slice sampling and dynamic programming
Slice sampling limits the number of states considered at each time step to a
finite number
Dynamic programming samples whole hidden state trajectories efficiently
Advantages of beam sampler:
converges faster than Gibbs sampler
mixes well regardless of strong correlations in the data
more robust with respect to varying initialization and prior
distribution