1
Scalable GP-LSTMs with Semi-Stochastic Gradients Maruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu Eric P. Xing X SAILING LAB Summary GP-LSTM: a Gaussian process model which fully encapsulate the structural properties of LSTMs. Semi-stochastic gradient descent: new provably convergent op- timization procedure for learning recurrent kernels. State-of-the-art results on sequence-to-reals regression. Predictive uncertainty for autonomous driving and other tasks. Full paper: https://arxiv.org/abs/1610.08936 Code: https://github.com/alshedivat/kgp 1. Background & Motivation Regression on sequences X 1 X 2 X 3 X 4 X 5 Z 1 Z 2 Z 3 Z 4 Z 5 X 11 X 12 X 13 X 14 X 15 ... Input sequences: X = { ¯ x i } n i=1 , ¯ x i =[x 1 i , x 2 i , ··· , x l i i ], x j i ∈X . Real-valued targets: y = {y i } n i=1 , y i R. Applications: Predictions of the expensive sensory data in robotic systems. Temporal modeling of medical biomarkers for predictive healthcare. Energy / power forecasting. Financial markets. Recurrent neural networks Advantages: A flexible framework for composing complex models. Accounts for long-range nonlinear correlations. Trainable by backprop. Drawbacks: Easily overfits to noise. Gaussian processes Advantages: Enjoys Bayesian nonparametric flexibility of the GPs. Robust to overfitting. Drawbacks: Accounts only for pairwise linear correlations between the time points. Our proposal: Combine GPs with LSTMs to get flexible recurrent Bayesian nonparameteric models that are robust to noise and are able to provide predictive uncertainty estimates. 2. Gaussian Processes with Recurrent Kernels y 2 h 2 h 3 h 1 h 4 y 1 x 2 x 3 y 3 x 1 y 4 x 4 (a) LSTM g g g g h 2 h 3 h 1 h 4 y 1 x 2 x 3 y 3 x 1 y 4 x 4 y 2 (b) GP-LSTM y ¯ x φ g (c) Graphical model view LSTM-structured kernel Let φ w : X L 7→H be a recurrent transformation. Define the kernel as: ˜ k θ x, ¯ x 0 )= k γ (φ w x)w x 0 )), where k γ : X × X 7→ R is the base kernel, ¯ x, ¯ x 0 ∈X L are L-length sequences, ˜ k θ : X L ×X L 7R, and θ = {γ, w}. Negative log marginal likelihood objective L(K θ )= - model fit z }| { y > (K θ + σ 2 I ) -1 y - complexity penalty z }| { log det(K θ + σ 2 I ) +const θ L = 1 2 tr ( K -1 θ yy > K -1 θ - K -1 θ ) θ K θ Advantages: Enjoys Bayesian nonparametric flexibility of the GPs. Robust to overfitting. Drawbacks: Does not factorize over the data. Involves kernel matrix inverses. Algorithm: semi-stochastic asynchronous gradient descent Input: Data: (X, y), kernel: k γ (·, ·), recurrent transformation: φ w (·) 1. Initialize γ and w; compute initial K 2. repeat 3. γ γ + update γ (X, w,K ). 4. for all mini-batches X b in X 5. w w + update w (X b , w, K stale ) 6. endfor 7. Until: convergence Output: Optimal γ * and w * Theorem [Al-Shedivat et al., 2016]. Semi-stochastic asynchronous gradient descent with τ -delayed kernel updates converges to a fixed point when the learning rate, λ t , decays as Θ(1/τt 1+δ 2 ) for any δ (0, 1]. 3. Experiments Qualitative: Lane prediction for autonomous driving 0.0 0.2 0.4 0.6 0.8 1.0 East, mi 0.0 0.2 0.4 0.6 0.8 1.0 North, mi 5 0 5 10 15 20 30 20 10 0 5 0 5 10 15 20 30 20 10 0 0 4 8 12 16 20 24 28 Speed, mi/s Figure 2: Left: Train and test routes of the autonomous car. Right: Point-wise esti- mation of the lanes. Dashed – ground truth, blue – LSTM, red – GP-LSTM. Quantitative performance evaluation 0 20 40 60 80 100 Epoch number 10 -2 10 -1 10 0 10 1 Test RMSE LSTM-1H LSTM-2H GP-LSTM-1H GP-LSTM-2H 0 5 10 15 20 Number of training pts, 10 3 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Test RMSE LSTM-1H GP-LSTM-1H 10 0 10 1 10 2 10 3 Number of hidden units 0.02 0.04 0.06 0.08 0.10 0.12 0.14 Test RMSE LSTM-1H LSTM-2H GP-LSTM-1H GP-LSTM-2H 10 0 10 1 10 2 10 3 Number of hidden units 2500 5000 7500 10000 Train NLML GP-LSTM-1H GP-LSTM-2H Figure 3: RMSE/NLML vs. the number points/parameters per layer. Data Task NARX RNN LSTM RGP GP-NARX GP-RNN GP-LSTM Drives system ident. 0.423 0.408 0.382 0.249 0.403 0.332 0.225 Actuator 0.482 0.771 0.381 0.368 0.891 0.492 0.347 GEF power pred. 0.529 0.622 0.465 0.780 0.242 0.158 wind est. 0.837 0.818 0.781 0.835 0.792 0.764 Car lane seq. 0.128 0.331 0.078 0.101 0.472 0.055 lead vehicle pos. 0.410 0.452 0.400 0.341 0.412 0.312 Table 1: Performance of the models in terms of RMSE. Convergence of the optimization 0 10 20 30 40 50 Epoch number 10 -2 10 -1 10 0 10 1 Test RMSE full, 16 full, 64 mini, 16 mini, 64 0 10 20 30 40 50 Epoch number 0 5000 10000 15000 Train NLML full, 16 full, 64 mini, 16 mini, 64 2 1 2 2 2 3 2 4 2 5 Number of hidden units 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Test RMSE full mini 2 0 2 1 2 2 2 3 2 4 2 5 Number of batches per epoch 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Test RMSE lr=0.1 lr=0.01 lr=0.001 Figure 4: Convergence of the optimization with full-/mini-batches. Scalability 0 20 40 60 80 100 120 Number of training pts, 10 3 0 100 200 300 400 500 Time per epoch, s 100 pts 200 pts 400 pts 100 200 300 400 Number of inducing pts 0 100 200 300 400 500 600 700 Time per epoch, s 10 20 40 80 10 20 30 40 50 60 Number of training pts, 10 3 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Time per test pt, ms 100 pts 200 pts 400 pts 100 200 300 400 Number of inducing pts 4.5 5.0 5.5 6.0 6.5 7.0 7.5 8.0 Time per test pt, ms 10 20 40 80 Figure 5: Scalability of learning and inference of GP-LSTM. Workshop on Bayesian Deep Learning, NIPS 2016, Barcelona, Spain Correspondence: {[email protected], [email protected]}

Scalable GP-LSTMs with Semi-Stochastic Gradients · 2019-10-17 · Scalable GP-LSTMs with Semi-Stochastic Gradients Maruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Scalable GP-LSTMs with Semi-Stochastic Gradients · 2019-10-17 · Scalable GP-LSTMs with Semi-Stochastic Gradients Maruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu

Scalable GP-LSTMs with Semi-Stochastic GradientsMaruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu Eric P. Xing

X

SAILING LAB

Summary

• GP-LSTM: a Gaussian process model which fully encapsulate thestructural properties of LSTMs.

• Semi-stochastic gradient descent: new provably convergent op-timization procedure for learning recurrent kernels.

• State-of-the-art results on sequence-to-reals regression.• Predictive uncertainty for autonomous driving and other tasks.

Full paper: https://arxiv.org/abs/1610.08936Code: https://github.com/alshedivat/kgp

1. Background & Motivation

Regression on sequences

X1 X2 X3 X4 X5

Z1 Z2 Z3 Z4 Z5

X11 X12 X13 X14 X15...

• Input sequences: X = {x̄i}ni=1, x̄i = [x1i ,x

2i , · · · ,xlii ], xji ∈ X .

• Real-valued targets: y = {yi}ni=1, yi ∈ R.

Applications:• Predictions of the expensive sensory data in robotic systems.• Temporal modeling of medical biomarkers for predictive healthcare.• Energy / power forecasting.• Financial markets.

Recurrent neural networksAdvantages: • A flexible framework for composing complex models.

• Accounts for long-range nonlinear correlations.• Trainable by backprop.

Drawbacks: • Easily overfits to noise.

Gaussian processesAdvantages: • Enjoys Bayesian nonparametric flexibility of the GPs.

• Robust to overfitting.Drawbacks: • Accounts only for pairwise linear correlations between

the time points.

Our proposal: Combine GPs with LSTMs to get flexible recurrentBayesian nonparameteric models that are robust to noise and are ableto provide predictive uncertainty estimates.

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

2. Gaussian Processes with Recurrent Kernels

y2

h2 h3h1 h4

y1

x2 x3

y3

x1

y4

x4

(a) LSTM

gggg

h2 h3h1 h4

y1

x2 x3

y3

x1

y4

x4

y2

(b) GP-LSTM

y

x̄φ

g

(c) Graphical model view

LSTM-structured kernel

Let φw : X L 7→ H be a recurrent transformation. Define the kernel as:

k̃θ(x̄, x̄′) = kγ(φw(x̄), φw(x̄′)),

where kγ : X × X 7→ R is the base kernel, x̄, x̄′ ∈ X L are L-lengthsequences, k̃θ : X L ×X L 7→ R, and θ = {γ,w}.

Negative log marginal likelihood objective

L(Kθ) = −model fit︷ ︸︸ ︷

y>(Kθ + σ2I)−1y−complexity penalty︷ ︸︸ ︷

log det(Kθ + σ2I) +const

∇θL =1

2tr[(K−1θ yy>K−1

θ −K−1θ

)∇θKθ

]Advantages: • Enjoys Bayesian nonparametric flexibility of the GPs.

• Robust to overfitting.

Drawbacks: • Does not factorize over the data.• Involves kernel matrix inverses.

Algorithm: semi-stochastic asynchronous gradient descent

Input: Data: (X,y), kernel: kγ(·, ·), recurrent transformation: φw(·)1. Initialize γ and w; compute initial K2. repeat3. γ ← γ + updateγ(X,w, K).4. for all mini-batches Xb in X

5. w← w + updatew(Xb,w, Kstale)

6. endfor7. Until: convergence

Output: Optimal γ∗ and w∗

Theorem [Al-Shedivat et al., 2016]. Semi-stochastic asynchronousgradient descent with τ -delayed kernel updates converges to a fixedpoint when the learning rate, λt, decays as Θ(1/τt

1+δ2 ) for any δ ∈ (0, 1].

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

3. Experiments

Qualitative: Lane prediction for autonomous driving

0.0 0.2 0.4 0.6 0.8 1.0

East, mi

0.0

0.2

0.4

0.6

0.8

1.0

Nor

th, m

i

5 0 5 10 15 20

30

20

10

0

5 0 5 10 15 20

30

20

10

0

0

4

8

12

16

20

24

28

Spe

ed, m

i/s

0

10

20

30

40

50

0

10

20

30

40

50

Figure 2: Left: Train and test routes of the autonomous car. Right: Point-wise esti-mation of the lanes. Dashed – ground truth, blue – LSTM, red – GP-LSTM.

Quantitative performance evaluation

0 20 40 60 80 100

Epoch number

10−2

10−1

100

101

Test

RM

SE

LSTM-1HLSTM-2HGP-LSTM-1HGP-LSTM-2H

0 5 10 15 20

Number of training pts, 103

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Test

RM

SE

LSTM-1HGP-LSTM-1H

100 101 102 103

Number of hidden units

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Test

RM

SE

LSTM-1HLSTM-2HGP-LSTM-1HGP-LSTM-2H

100 101 102 103

Number of hidden units

2500

5000

7500

10000

Trai

nN

LML

GP-LSTM-1HGP-LSTM-2H

Figure 3: RMSE/NLML vs. the number points/parameters per layer.

Data Task NARX RNN LSTM RGP GP-NARX GP-RNN GP-LSTM

Drivessystem ident.

0.423 0.408 0.382 0.249 0.403 0.332 0.225Actuator 0.482 0.771 0.381 0.368 0.891 0.492 0.347

GEFpower pred. 0.529 0.622 0.465 — 0.780 0.242 0.158wind est. 0.837 0.818 0.781 — 0.835 0.792 0.764

Carlane seq. 0.128 0.331 0.078 — 0.101 0.472 0.055lead vehicle pos. 0.410 0.452 0.400 — 0.341 0.412 0.312

Table 1: Performance of the models in terms of RMSE.

Convergence of the optimization

0 10 20 30 40 50

Epoch number

10−2

10−1

100

101

Test

RM

SE

full, 16full, 64mini, 16mini, 64

0 10 20 30 40 50

Epoch number

0

5000

10000

15000

Trai

nN

LML

full, 16full, 64

mini, 16mini, 64

21 22 23 24 25

Number of hidden units

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Test

RM

SE

fullmini

20 21 22 23 24 25

Number of batches per epoch

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Test

RM

SE

lr=0.1lr=0.01lr=0.001

Figure 4: Convergence of the optimization with full-/mini-batches.

Scalability

0 20 40 60 80 100 120

Number of training pts, 103

0

100

200

300

400

500

Tim

epe

repo

ch,s 100 pts

200 pts400 pts

100 200 300 400

Number of inducing pts

0

100

200

300

400

500

600

700

Tim

epe

repo

ch,s 10

204080

10 20 30 40 50 60

Number of training pts, 103

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Tim

epe

rtes

tpt,

ms

100 pts200 pts400 pts

100 200 300 400

Number of inducing pts

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Tim

epe

rtes

tpt,

ms

1020

4080

Figure 5: Scalability of learning and inference of GP-LSTM.

Workshop on Bayesian Deep Learning, NIPS 2016, Barcelona, Spain Correspondence: {[email protected], [email protected]}