37
An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systems Carnegie Mellon University Select Lab Byron Boots and Geoffrey J. Gordon AAAI 2011

An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

An Online Spectral Learning Algorithm forPartially Observable Nonlinear Dynamical Systems

Carnegie Mellon UniversitySelect Lab

Byron Boots and Geoffrey J. Gordon

AAAI 2011

Page 2: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

. . . . . .ot+2ot+1otot−1ot−2

What is out there?

Page 3: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

. . . . . .ot+2ot+1otot−1ot−2

Past Future

State

Dynamical System = A recursive rule for updating statebased on observations

What is out there? Dynamical Systems

Page 4: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Learning a Dynamical System

. . . . . .ot+2ot+1otot−1ot−2

Past Future

State

we would like to learn a model of a dynamical system

Page 5: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Learning a Dynamical System

. . . . . .ot+2ot+1otot−1ot−2

Past Future

State

we would like to learn a model of a dynamical system

today I will focus on Spectral Learning Algorithms for Predictive State Representations

Page 6: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Predictive State Representations (PSRs)

AO

x1 ∈ Rd

Mao ∈ Rd×d

e ∈ Rd

set of actionsset of observations

initial stateset of transition matrices

normalization vector

comprised of:

Page 7: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Predictive State Representations (PSRs)

P (o | xt, do(at)) = e�Mat,oxt

AO

x1 ∈ Rd

Mao ∈ Rd×d

e ∈ Rd

set of actionsset of observations

initial stateset of transition matrices

normalization vector

comprised of:

at each time step can predict:

Page 8: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Predictive State Representations (PSRs)

P (o | xt, do(at)) = e�Mat,oxt

AO

x1 ∈ Rd

Mao ∈ Rd×d

e ∈ Rd

set of actionsset of observations

initial stateset of transition matrices

normalization vector

comprised of:

at each time step can predict:

Page 9: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Predictive State Representations (PSRs)

P (o | xt, do(at)) = e�Mat,oxt

AO

x1 ∈ Rd

Mao ∈ Rd×d

e ∈ Rd

set of actionsset of observations

initial stateset of transition matrices

normalization vector

comprised of:

at each time step can predict:

Page 10: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Predictive State Representations (PSRs)

xt+1 = Mat,otxt/P (ot | xt, do(at))

P (o | xt, do(at)) = e�Mat,oxt

AO

x1 ∈ Rd

Mao ∈ Rd×d

e ∈ Rd

set of actionsset of observations

initial stateset of transition matrices

normalization vector

comprised of:

at each time step can predict:

and update state:

Page 11: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Predictive State Representations (PSRs)

xt+1 = Mat,otxt/P (ot | xt, do(at))

P (o | xt, do(at)) = e�Mat,oxt

AO

x1 ∈ Rd

Mao ∈ Rd×d

e ∈ Rd

set of actionsset of observations

initial stateset of transition matrices

normalization vector

comprised of:

at each time step can predict:

and update state:

Page 12: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Predictive State Representations (PSRs)

xt+1 = Mat,otxt/P (ot | xt, do(at))

P (o | xt, do(at)) = e�Mat,oxt

AO

x1 ∈ Rd

Mao ∈ Rd×d

e ∈ Rd

set of actionsset of observations

initial stateset of transition matrices

normalization vector

comprised of:

at each time step can predict:

and update state:

Page 13: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Predictive State Representations

S ∈ Rd×d

Mao → S−1MaoS

e → S�e

parameters are only determined up to a similarity transform

x1 → S−1x1

the resulting PSR makes exactly the same predictions as the original one

if we replace

P (o | xt, do(at)) = e�SS−1Mat,oSS−1xte.g.

Page 14: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

Reduced-Rank HMMs & Reduced Rank POMDPS

HMMs & POMDPS

Predictive State Representations

for fixed latent dimension d

PSRs Are Very Expressive

Page 15: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Learning PSRs

can use fast,

statistically consistent,

spectral methods

to learn PSR parameters

Page 16: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact A General Principle

compress expand

bottleneck

predict

data about past(many samples)

data about future(many samples)

stat

e

Page 17: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact A General Principle

If bottleneck = rank constraint, then get a spectral method

compress expand

bottleneck

predict

data about past(many samples)

data about future(many samples)

stat

e

Page 18: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Why Spectral Methods?

• Maximum Likelihood via Expectation Maximization, Gradient Descent, ...• Bayesian inference via Gibbs, Metropolis Hastings, ...

There are many ways to learn a dynamical system

In contrast to these methods, spectral learning algorithms give

• No local optima:‣ Huge gain in computational efficiency

• Slight loss in statistical efficiency

Page 19: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Spectral Learning for PSRs

ΣT ,AO,H “trivariance” tensor of features of the future, present, and past

ΣT ,H covariance matrix of features of the future and past

ΣAO,AO covariance matrix of features present

moments of directly observable features

Page 20: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Spectral Learning for PSRs

ΣT ,AO,H “trivariance” tensor of features of the future, present, and past

ΣT ,H covariance matrix of features of the future and past

ΣAO,AO covariance matrix of features present

U left d singular vectors of ΣT ,H

moments of directly observable features

Page 21: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Spectral Learning for PSRs

ΣT ,AO,H “trivariance” tensor of features of the future, present, and past

ΣT ,H covariance matrix of features of the future and past

ΣAO,AO covariance matrix of features present

the other parameters can be found analogously

S−1MaoS := ΣT ,AO,H ×1 U� ×2 φ(ao)

�(ΣAO,AO)−1 ×3 (Σ

�T ,HU)†

moments of directly observable features

U left d singular vectors of ΣT ,H

Page 22: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Spectral Learning for PSRs

Spectral Learning Algorithm:• Estimate , , and from data• Find by SVD• Plug in to recover PSR parameters

�U

• Learning is Statistically Consistent• Only requires Linear Algebra

ΣT ,AO,H ΣT ,H ΣAO,AO

B. Boots, S. M. Siddiqi, and G. Gordon. Closing the learning-planning loop with predictive state representations. RSS, 2010.

For details, see:

Page 23: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Infinite Features

• Can extend the learning algorithm to infinite feature spaces ‣ Kernels

• Learning algorithm that we have seen is linear algebra‣ works just fine in an arbitrary RKHS‣ Can rewrite all of the formulas in terms of Gram matrices‣ Uses kernel SVD instead of SVD

Result: Hilbert Space Embeddings of Dynamical Systems

• handles near arbitrary observation distributions• good prediction performance

L. Song, B. Boots, S. M. Siddiqi, G. Gordon, and A. J. Smola. Hilbert space embeddings of hidden Markov models.  ICML, 2010.

For details, see:

Page 24: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact An Experiment

Hilbert Space Embeddings of Hidden Markov Models

A. B.

0 10 20 30 40 50 60 70 80 90 100

3

4

5

6

7

8x 106

Prediction HorizonA

vg. P

redic

tion E

rr.

2

1

IMUSlo

t C

ar

0

Racetrack

RR-HMMLDS

HMMMeanLast

Embedded

Figure 4. Slot car inertial measurement data. (A) The slot

car platform and the IMU (top) and the racetrack (bot-

tom). (B) Squared error for prediction with different esti-

mated models and baselines.

this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.

We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.

For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.

4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes

Latent Space Dimensionality

Acc

ura

cy (

%)

HMM

LDS

RR!HMM

Embedded

10 20 30 40 5060

70

80

90

85

75

65

Figure 5. Accuracies and 95% confidence intervals for Hu-

man vs. Non-human audio event classification, comparing

embedded HMMs to other common sequential models at

different latent state space sizes.

11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.

For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.

For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.

�tτ=1 log

��ϕ(xτ ), µ̂Xτ |x1:τ−1

�F

�.

For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.

5. Conclusion

We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The

Hilbert Space Embeddings of Hidden Markov Models

A. B.

0 10 20 30 40 50 60 70 80 90 100

3

4

5

6

7

8x 106

Prediction Horizon

Avg. P

redic

tion E

rr.

2

1

IMUSlo

t C

ar

0

Racetrack

RR-HMMLDS

HMMMeanLast

Embedded

Figure 4. Slot car inertial measurement data. (A) The slot

car platform and the IMU (top) and the racetrack (bot-

tom). (B) Squared error for prediction with different esti-

mated models and baselines.

this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.

We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.

For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.

4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes

Latent Space Dimensionality

Acc

ura

cy (

%)

HMM

LDS

RR!HMM

Embedded

10 20 30 40 5060

70

80

90

85

75

65

Figure 5. Accuracies and 95% confidence intervals for Hu-

man vs. Non-human audio event classification, comparing

embedded HMMs to other common sequential models at

different latent state space sizes.

11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.

For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.

For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.

�tτ=1 log

��ϕ(xτ ), µ̂Xτ |x1:τ−1

�F

�.

For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.

5. Conclusion

We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The

thanks to Dieter Fox’s lab

Hilbert Space Embeddings of Hidden Markov Models

A. B.

0 10 20 30 40 50 60 70 80 90 100

3

4

5

6

7

8x 106

Prediction Horizon

Avg. P

redic

tion E

rr.

2

1

IMUSlo

t C

ar

0

Racetrack

RR-HMMLDS

HMMMeanLast

Embedded

Figure 4. Slot car inertial measurement data. (A) The slot

car platform and the IMU (top) and the racetrack (bot-

tom). (B) Squared error for prediction with different esti-

mated models and baselines.

this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.

We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.

For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.

4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes

Latent Space Dimensionality

Acc

ura

cy (

%)

HMM

LDS

RR!HMM

Embedded

10 20 30 40 5060

70

80

90

85

75

65

Figure 5. Accuracies and 95% confidence intervals for Hu-

man vs. Non-human audio event classification, comparing

embedded HMMs to other common sequential models at

different latent state space sizes.

11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.

For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.

For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.

�tτ=1 log

��ϕ(xτ ), µ̂Xτ |x1:τ−1

�F

�.

For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.

5. Conclusion

We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The

horizon

MS

E

PSR

L. Song, B. Boots, S. M. Siddiqi, G. Gordon, and A. J. Smola. Hilbert space embeddings of hidden Markov models.  ICML, 2010.

Page 25: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

• Bottleneck: SVD of Gram or Covariance matrix

‣ G: (# time steps)2

‣ C: (# features × window length) × (# time steps)

• E.g., 1 hr video, 24 fps, 300×300, features of past and future are all pixels in 2 s windows

‣ G: (3600 × 24) × (3600 × 24) ≈ 1010

Batch Methods

Page 26: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

• Two techniques

‣ online learning

‣ random projections

• Neither one new, but combination with spectral learning for PSRs is, and makes huge difference in practice

Making it Fast

Page 27: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Online Learning

S−1MaoS := ΣT ,AO,H ×1 U� ×2 φ(ao)

�(ΣAO,AO)−1 ×3 (Σ

�T ,HU)†

U left d singular vectors of ΣT ,H

Page 28: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Online Learning

S−1MaoS := ΣT ,AO,H ×1 U� ×2 φ(ao)

�(ΣAO,AO)−1 ×3 (Σ

�T ,HU)†

U left d singular vectors of ΣT ,H

• With each new observation, rank-1 update of:

‣ SVD (Brand)

‣ inverse (Sherman-Morrison)

• n features; latent dimension d; T steps

‣ space = O(nd): may fit in cache!

‣ time = O(nd2T): bounded time per example

• Problem: no rank-1 update of kernel SVD!

‣ can use random projections [Rahimi & Recht, 2007]

Page 29: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact Random Projections

S−1MaoS := ΣT ,AO,H ×1 U� ×2 φ(ao)

�(ΣAO,AO)−1 ×3 (Σ

�T ,HU)†

U left d singular vectors of ΣT ,H

• With each new observation, rank-1 update of:

‣ SVD (Brand)

‣ inverse (Sherman-Morrison)

• n features; latent dimension d; T steps

‣ space = O(nd): may fit in cache!

‣ time = O(nd2T): bounded time per example

• Problem: no rank-1 update of kernel SVD!

‣ can use random projections [Rahimi & Recht, 2007]

Page 30: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

0 10 20 30 40 50 60 70 80 90 100

345678 x 106

Prediction HorizonM

SE

210

HMMMean Observation

HSE-HMMOnline with Rand. FeaturesEmbedded

Hilbert Space Embeddings of Hidden Markov Models

A. B.

0 10 20 30 40 50 60 70 80 90 100

3

4

5

6

7

8x 106

Prediction HorizonA

vg. P

redic

tion E

rr.

2

1

IMUSlo

t C

ar

0

Racetrack

RR-HMMLDS

HMMMeanLast

Embedded

Figure 4. Slot car inertial measurement data. (A) The slot

car platform and the IMU (top) and the racetrack (bot-

tom). (B) Squared error for prediction with different esti-

mated models and baselines.

this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.

We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.

For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.

4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes

Latent Space Dimensionality

Acc

ura

cy (

%)

HMM

LDS

RR!HMM

Embedded

10 20 30 40 5060

70

80

90

85

75

65

Figure 5. Accuracies and 95% confidence intervals for Hu-

man vs. Non-human audio event classification, comparing

embedded HMMs to other common sequential models at

different latent state space sizes.

11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.

For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.

For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.

�tτ=1 log

��ϕ(xτ ), µ̂Xτ |x1:τ−1

�F

�.

For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.

5. Conclusion

We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The

Hilbert Space Embeddings of Hidden Markov Models

A. B.

0 10 20 30 40 50 60 70 80 90 100

3

4

5

6

7

8x 106

Prediction Horizon

Avg. P

redic

tion E

rr.

2

1

IMUSlo

t C

ar

0

Racetrack

RR-HMMLDS

HMMMeanLast

Embedded

Figure 4. Slot car inertial measurement data. (A) The slot

car platform and the IMU (top) and the racetrack (bot-

tom). (B) Squared error for prediction with different esti-

mated models and baselines.

this data while the slot car circled the track controlledby a constant policy. The goal of this experiment wasto learn a model of the noisy IMU data, and, afterfiltering, to predict future IMU readings.

We trained a 20-dimensional embedded HMM usingAlgorithm 1 with sequences of 150 consecutive obser-vations (Section 3.8). The bandwidth parameter ofthe Gaussian RBF kernels is set with ‘median trick’.The regularization parameter λ is set of 10−4. Forcomparison, a 20-dimensional RR-HMM with Parzenwindows is learned also with sequences of 150 observa-tions; a 20-dimensional LDS is learned using SubspaceID with Hankel matrices of 150 time steps; and finally,a 20-state discrete HMM (with 400 level of discretiza-tion for observations) is learned using EM algorithmrun until convergence.

For each model, we performed filtering for differentextents t1 = 100, 101, . . . , 250, then predicted an im-age which was a further t2 steps in the future, fort2 = 1, 2..., 100. The squared error of this predictionin the IMU’s measurement space was recorded, andaveraged over all the different filtering extents t1 toobtain means which are plotted in Figure 4(B). Againthe embedded HMM learned by the kernel spectral al-gorithm yields lower prediction error compared to eachof the alternatives consistently for the duration of theprediction horizon.

4.3. Audio Event ClassificationOur final experiment concerns an audio classificationtask. The data, recently presented in (Ramos et al.,2010), consisted of sequences of 13-dimensional Mel-Frequency Cepstral Coefficients (MFCC) obtainedfrom short clips of raw audio data recorded usinga portable sensor device. Six classes of labeled au-dio clips were present in the data, one being HumanSpeech. For this experiment we grouped the latter fiveclasses into a single class of Non-human sounds to for-mulate a binary Human vs. Non-human classificationtask. Since the original data had a disproportionatelylarge amount of Human Speech samples, this groupingresulted in a more balanced dataset with 40 minutes

Latent Space Dimensionality

Acc

ura

cy (

%)

HMM

LDS

RR!HMM

Embedded

10 20 30 40 5060

70

80

90

85

75

65

Figure 5. Accuracies and 95% confidence intervals for Hu-

man vs. Non-human audio event classification, comparing

embedded HMMs to other common sequential models at

different latent state space sizes.

11 seconds of Human and 28 minutes 43 seconds ofNon-human audio data. To reduce noise and trainingtime we averaged the data every 100 timesteps (corre-sponding to 1 second) and downsampled.

For each of the two classes, we trained embeddedHMMs with 10, 20, . . . , 50 latent dimensions usingspectral learning and Gaussian RBF kernels withbandwidth set with the ‘median trick’. The regulariza-tion parameter λ is set at 10−1. For efficiency we usedrandom features for approximating the kernel (Rahimi& Recht, 2008). For comparison, regular HMMs withaxis-aligned Gaussian observation models, LDSs andRR-HMMs were trained using multi-restart EM (toavoid local minima), stable Subspace ID and the spec-tral algorithm of (Siddiqi et al., 2009) respectively, alsowith 10, . . . , 50 latent dimensions or states.

For RR-HMMs, regular HMMs and LDSs, the class-conditional data sequence likelihood is the scoringfunction for classification. For embedded HMMs, thescoring function for a test sequence x1:t is the log ofthe product of the compatibility scores for each obser-vation, i.e.

�tτ=1 log

��ϕ(xτ ), µ̂Xτ |x1:τ−1

�F

�.

For each model size, we performed 50 random 2:1partitions of data from each class and used the re-sulting datasets for training and testing respectively.The mean accuracy and 95% confidence intervals overthese 50 randomizations are reported in Figure 5. Thegraph indicates that embedded HMMs have higher ac-curacy and lower variance than other standard alter-natives at every model size. Though other learningalgorithms for HMMs and LDSs exist, our experimentshows this to be a non-trivial sequence classificationproblem where embedded HMMs significantly outper-form commonly used sequential models trained usingtypical learning and model selection methods.

5. Conclusion

We proposed a Hilbert space embedding of HMMsthat extends traditional HMMs to structured and non-Gaussian continuous observation distributions. The

thanks to Dieter Fox’s lab

Experiment (Revisited)

Page 31: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

31

A.

A.

0

0

C.

Table

B.

After 100 sam

ples

samples

After 600

samples

Online Learning Example

Conference Room

Page 32: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

32

A.

A.

0

0

C.

Table

B.

After 100 sam

ples

samples

After 600

samples

‣ online+random: 100k features, 11k frames, limit = avail. data‣ offline: 2k frames, compressed &

subsampled, compute-limited

Online Learning Example

Conference Room

Page 33: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

33A

.A

.

0

0

C.

Table

B.

After 100 sam

ples

samples

After 600

samples

100

step

s

A.

A.

0

0

C.

Table

B.

After 100 sam

ples

samples

After 600

samples

‣ online+random: 100k features, 11k frames, limit = avail. data‣ offline: 2k frames, compressed &

subsampled, compute-limited

Conference Room

Online Learning Example

Page 34: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

34

A.A.

0

0

C.

Tabl

e

B. After 100 samples

samples

After 600samples

350

step

s

A.

A.

0

0

C.

Table

B.

After 100 sam

ples

samples

After 600

samples

‣ online+random: 100k features, 11k frames, limit = avail. data‣ offline: 2k frames, compressed &

subsampled, compute-limited

Conference Room

Online Learning Example

Page 35: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

35

600

step

s

A.

A.

0

0

C.

Table

B.

After 100 sam

ples

samples

After 600

samples

A. A.

0

0

C.

Table

B.After 100 samples

samples

After 600samples

‣ online+random: 100k features, 11k frames, limit = avail. data‣ offline: 2k frames, compressed &

subsampled, compute-limited

Conference Room

Online Learning Example

Page 36: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

36

A.A.

0

0

C.Ta

ble

B. After 100 samples

samples

After 600samples

final

em

bedd

ing

(col

ors

= 3

rd d

im)

600

step

s

A.

A.

0

0

C.

Table

B.

After 100 sam

ples

samples

After 600

samples

A. A.

0

0

C.

Table

B.After 100 samples

samples

After 600samples

‣ online+random: 100k features, 11k frames, limit = avail. data‣ offline: 2k frames, compressed &

subsampled, compute-limited

Conference Room

Online Learning Example

Page 37: An Online Spectral Learning Algorithm for Partially Observable Nonlinear Dynamical Systemsbboots/files/AAAI-2011.pdf · 2019-09-09 · T ,AO,H “trivariance” tensor of features

sense learnact

• We present spectral learning algorithms for PSR models of partially observable nonlinear dynamical systems.

• We show how to update parameters of the estimated PSR model given new data

‣ efficient online spectral learning algorithm

• We show how to use random projections to approximate kernel-based learning algorithms

Paper Summary