Upload
others
View
6
Download
0
Embed Size (px)
Citation preview
Scalable GP-LSTMs with Semi-Stochastic GradientsMaruan Al-Shedivat Andrew Gordon Wilson Yunus Saatchi Zhiting Hu Eric P. Xing
X
SAILING LAB
Summary
• GP-LSTM: a Gaussian process model which fully encapsulate thestructural properties of LSTMs.
• Semi-stochastic gradient descent: new provably convergent op-timization procedure for learning recurrent kernels.
• State-of-the-art results on sequence-to-reals regression.• Predictive uncertainty for autonomous driving and other tasks.
Full paper: https://arxiv.org/abs/1610.08936Code: https://github.com/alshedivat/kgp
1. Background & Motivation
Regression on sequences
X1 X2 X3 X4 X5
Z1 Z2 Z3 Z4 Z5
X11 X12 X13 X14 X15...
• Input sequences: X = {x̄i}ni=1, x̄i = [x1i ,x
2i , · · · ,xlii ], xji ∈ X .
• Real-valued targets: y = {yi}ni=1, yi ∈ R.
Applications:• Predictions of the expensive sensory data in robotic systems.• Temporal modeling of medical biomarkers for predictive healthcare.• Energy / power forecasting.• Financial markets.
Recurrent neural networksAdvantages: • A flexible framework for composing complex models.
• Accounts for long-range nonlinear correlations.• Trainable by backprop.
Drawbacks: • Easily overfits to noise.
Gaussian processesAdvantages: • Enjoys Bayesian nonparametric flexibility of the GPs.
• Robust to overfitting.Drawbacks: • Accounts only for pairwise linear correlations between
the time points.
Our proposal: Combine GPs with LSTMs to get flexible recurrentBayesian nonparameteric models that are robust to noise and are ableto provide predictive uncertainty estimates.
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
2. Gaussian Processes with Recurrent Kernels
y2
h2 h3h1 h4
y1
x2 x3
y3
x1
y4
x4
(a) LSTM
gggg
h2 h3h1 h4
y1
x2 x3
y3
x1
y4
x4
y2
(b) GP-LSTM
y
x̄φ
g
(c) Graphical model view
LSTM-structured kernel
Let φw : X L 7→ H be a recurrent transformation. Define the kernel as:
k̃θ(x̄, x̄′) = kγ(φw(x̄), φw(x̄′)),
where kγ : X × X 7→ R is the base kernel, x̄, x̄′ ∈ X L are L-lengthsequences, k̃θ : X L ×X L 7→ R, and θ = {γ,w}.
Negative log marginal likelihood objective
L(Kθ) = −model fit︷ ︸︸ ︷
y>(Kθ + σ2I)−1y−complexity penalty︷ ︸︸ ︷
log det(Kθ + σ2I) +const
∇θL =1
2tr[(K−1θ yy>K−1
θ −K−1θ
)∇θKθ
]Advantages: • Enjoys Bayesian nonparametric flexibility of the GPs.
• Robust to overfitting.
Drawbacks: • Does not factorize over the data.• Involves kernel matrix inverses.
Algorithm: semi-stochastic asynchronous gradient descent
Input: Data: (X,y), kernel: kγ(·, ·), recurrent transformation: φw(·)1. Initialize γ and w; compute initial K2. repeat3. γ ← γ + updateγ(X,w, K).4. for all mini-batches Xb in X
5. w← w + updatew(Xb,w, Kstale)
6. endfor7. Until: convergence
Output: Optimal γ∗ and w∗
Theorem [Al-Shedivat et al., 2016]. Semi-stochastic asynchronousgradient descent with τ -delayed kernel updates converges to a fixedpoint when the learning rate, λt, decays as Θ(1/τt
1+δ2 ) for any δ ∈ (0, 1].
•••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••••
3. Experiments
Qualitative: Lane prediction for autonomous driving
0.0 0.2 0.4 0.6 0.8 1.0
East, mi
0.0
0.2
0.4
0.6
0.8
1.0
Nor
th, m
i
5 0 5 10 15 20
30
20
10
0
5 0 5 10 15 20
30
20
10
0
0
4
8
12
16
20
24
28
Spe
ed, m
i/s
0
10
20
30
40
50
0
10
20
30
40
50
Figure 2: Left: Train and test routes of the autonomous car. Right: Point-wise esti-mation of the lanes. Dashed – ground truth, blue – LSTM, red – GP-LSTM.
Quantitative performance evaluation
0 20 40 60 80 100
Epoch number
10−2
10−1
100
101
Test
RM
SE
LSTM-1HLSTM-2HGP-LSTM-1HGP-LSTM-2H
0 5 10 15 20
Number of training pts, 103
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
Test
RM
SE
LSTM-1HGP-LSTM-1H
100 101 102 103
Number of hidden units
0.02
0.04
0.06
0.08
0.10
0.12
0.14
Test
RM
SE
LSTM-1HLSTM-2HGP-LSTM-1HGP-LSTM-2H
100 101 102 103
Number of hidden units
2500
5000
7500
10000
Trai
nN
LML
GP-LSTM-1HGP-LSTM-2H
Figure 3: RMSE/NLML vs. the number points/parameters per layer.
Data Task NARX RNN LSTM RGP GP-NARX GP-RNN GP-LSTM
Drivessystem ident.
0.423 0.408 0.382 0.249 0.403 0.332 0.225Actuator 0.482 0.771 0.381 0.368 0.891 0.492 0.347
GEFpower pred. 0.529 0.622 0.465 — 0.780 0.242 0.158wind est. 0.837 0.818 0.781 — 0.835 0.792 0.764
Carlane seq. 0.128 0.331 0.078 — 0.101 0.472 0.055lead vehicle pos. 0.410 0.452 0.400 — 0.341 0.412 0.312
Table 1: Performance of the models in terms of RMSE.
Convergence of the optimization
0 10 20 30 40 50
Epoch number
10−2
10−1
100
101
Test
RM
SE
full, 16full, 64mini, 16mini, 64
0 10 20 30 40 50
Epoch number
0
5000
10000
15000
Trai
nN
LML
full, 16full, 64
mini, 16mini, 64
21 22 23 24 25
Number of hidden units
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Test
RM
SE
fullmini
20 21 22 23 24 25
Number of batches per epoch
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Test
RM
SE
lr=0.1lr=0.01lr=0.001
Figure 4: Convergence of the optimization with full-/mini-batches.
Scalability
0 20 40 60 80 100 120
Number of training pts, 103
0
100
200
300
400
500
Tim
epe
repo
ch,s 100 pts
200 pts400 pts
100 200 300 400
Number of inducing pts
0
100
200
300
400
500
600
700
Tim
epe
repo
ch,s 10
204080
10 20 30 40 50 60
Number of training pts, 103
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Tim
epe
rtes
tpt,
ms
100 pts200 pts400 pts
100 200 300 400
Number of inducing pts
4.5
5.0
5.5
6.0
6.5
7.0
7.5
8.0
Tim
epe
rtes
tpt,
ms
1020
4080
Figure 5: Scalability of learning and inference of GP-LSTM.
Workshop on Bayesian Deep Learning, NIPS 2016, Barcelona, Spain Correspondence: {[email protected], [email protected]}