Scalable GP-LSTMs with Semi-Stochastic Gradients · 2019-10-17 · Scalable GP-LSTMs with Semi-Stochastic Gradients Maruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu

Scalable GP-LSTMs with Semi-Stochastic GradientsMaruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu Eric P. Xing

X

SAILING LAB

Summary

• GP-LSTM: a Gaussian process model which fully encapsulate thestructural properties of LSTMs.

• Semi-stochastic gradient descent: new provably convergent op-timization procedure for learning recurrent kernels.

• State-of-the-art results on sequence-to-reals regression.• Predictive uncertainty for autonomous driving and other tasks.

Full paper: https://arxiv.org/abs/1610.08936Code: https://github.com/alshedivat/kgp

1. Background & Motivation

Regression on sequences

X1 X2 X3 X4 X5

Z1 Z2 Z3 Z4 Z5

X11 X12 X13 X14 X15...

• Input sequences: X = {x̄i}ni=1, x̄i = [x1i ,x

2i , · · · ,xlii ], xji ∈ X .

• Real-valued targets: y = {yi}ni=1, yi ∈ R.

Applications:• Predictions of the expensive sensory data in robotic systems.• Temporal modeling of medical biomarkers for predictive healthcare.• Energy / power forecasting.• Financial markets.

Recurrent neural networksAdvantages: • A flexible framework for composing complex models.

• Accounts for long-range nonlinear correlations.• Trainable by backprop.

Drawbacks: • Easily overfits to noise.

Gaussian processesAdvantages: • Enjoys Bayesian nonparametric flexibility of the GPs.

• Robust to overfitting.Drawbacks: • Accounts only for pairwise linear correlations between

the time points.

Our proposal: Combine GPs with LSTMs to get flexible recurrentBayesian nonparameteric models that are robust to noise and are ableto provide predictive uncertainty estimates.

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

2. Gaussian Processes with Recurrent Kernels

y2

h2 h3h1 h4

y1

x2 x3

y3

x1

y4

x4

(a) LSTM

gggg

h2 h3h1 h4

y1

x2 x3

y3

x1

y4

x4

y2

(b) GP-LSTM

y

x̄φ

g

(c) Graphical model view

LSTM-structured kernel

Let φw : X L 7→ H be a recurrent transformation. Define the kernel as:

k̃θ(x̄, x̄′) = kγ(φw(x̄), φw(x̄′)),

where kγ : X × X 7→ R is the base kernel, x̄, x̄′ ∈ X L are L-lengthsequences, k̃θ : X L ×X L 7→ R, and θ = {γ,w}.

Negative log marginal likelihood objective

L(Kθ) = −model fit︷︸︸︷

y>(Kθ + σ2I)−1y−complexity penalty︷︸︸︷

log det(Kθ + σ2I) +const

∇θL =1

2tr[(K−1θ yy>K−1

θ −K−1θ

)∇θKθ

]Advantages: • Enjoys Bayesian nonparametric flexibility of the GPs.

• Robust to overfitting.

Drawbacks: • Does not factorize over the data.• Involves kernel matrix inverses.

Algorithm: semi-stochastic asynchronous gradient descent

Input: Data: (X,y), kernel: kγ(·, ·), recurrent transformation: φw(·)1. Initialize γ and w; compute initial K2. repeat3. γ ← γ + updateγ(X,w, K).4. for all mini-batches Xb in X

5. w← w + updatew(Xb,w, Kstale)

6. endfor7. Until: convergence

Output: Optimal γ∗ and w∗

Theorem [Al-Shedivat et al., 2016]. Semi-stochastic asynchronousgradient descent with τ -delayed kernel updates converges to a fixedpoint when the learning rate, λt, decays as Θ(1/τt

1+δ2 ) for any δ ∈ (0, 1].

•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••

3. Experiments

Qualitative: Lane prediction for autonomous driving

0.0 0.2 0.4 0.6 0.8 1.0

East, mi

0.0

0.2

0.4

0.6

0.8

1.0

Nor

th, m

i

5 0 5 10 15 20

30

20

10

0

5 0 5 10 15 20

30

20

10

0

0

4

8

12

16

20

24

28

Spe

ed, m

i/s

0

10

20

30

40

50

0

10

20

30

40

50

Figure 2: Left: Train and test routes of the autonomous car. Right: Point-wise esti-mation of the lanes. Dashed – ground truth, blue – LSTM, red – GP-LSTM.

Quantitative performance evaluation

0 20 40 60 80 100

Epoch number

10−2

10−1

100

101

Test

RM

SE

LSTM-1HLSTM-2HGP-LSTM-1HGP-LSTM-2H

0 5 10 15 20

Number of training pts, 103

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

Test

RM

SE

LSTM-1HGP-LSTM-1H

100 101 102 103

Number of hidden units

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Test

RM

SE

LSTM-1HLSTM-2HGP-LSTM-1HGP-LSTM-2H

100 101 102 103


2500

5000

7500

10000

Trai

nN

LML

GP-LSTM-1HGP-LSTM-2H

Figure 3: RMSE/NLML vs. the number points/parameters per layer.

Data Task NARX RNN LSTM RGP GP-NARX GP-RNN GP-LSTM

Drivessystem ident.

0.423 0.408 0.382 0.249 0.403 0.332 0.225Actuator 0.482 0.771 0.381 0.368 0.891 0.492 0.347

GEFpower pred. 0.529 0.622 0.465 — 0.780 0.242 0.158wind est. 0.837 0.818 0.781 — 0.835 0.792 0.764

Carlane seq. 0.128 0.331 0.078 — 0.101 0.472 0.055lead vehicle pos. 0.410 0.452 0.400 — 0.341 0.412 0.312

Table 1: Performance of the models in terms of RMSE.

Convergence of the optimization

0 10 20 30 40 50

Epoch number

10−2

10−1

100

101

Test

RM

SE

full, 16full, 64mini, 16mini, 64

0 10 20 30 40 50

Epoch number

0

5000

10000

15000

Trai

nN

LML

full, 16full, 64

mini, 16mini, 64

21 22 23 24 25


0.0

0.1

0.2

0.3

0.4

0.5

0.6

Test

RM

SE

fullmini

20 21 22 23 24 25

Number of batches per epoch

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Test

RM

SE

lr=0.1lr=0.01lr=0.001

Figure 4: Convergence of the optimization with full-/mini-batches.

Scalability

0 20 40 60 80 100 120


0

100

200

300

400

500

Tim

epe

repo

ch,s 100 pts

200 pts400 pts

100 200 300 400

Number of inducing pts

0

100

200

300

400

500

600

700

Tim

epe

repo

ch,s 10

204080

10 20 30 40 50 60


4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Tim

epe

rtes

tpt,

ms

100 pts200 pts400 pts

100 200 300 400

Number of inducing pts

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Tim

epe

rtes

tpt,

ms

1020

4080

Figure 5: Scalability of learning and inference of GP-LSTM.

Workshop on Bayesian Deep Learning, NIPS 2016, Barcelona, Spain Correspondence: {[email protected], [email protected]}

https://arxiv.org/abs/1610.08936

https://github.com/alshedivat/kgp

Documents

Scalable GP-LSTMs with Semi-Stochastic Gradients · 2019-10-17 · Scalable GP-LSTMs with Semi-Stochastic Gradients Maruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu