Probabilistic ODE Solvers with Runge-Kutta Meansduvenaud/talks/odes_runge_kutta_nips.pdfProbabilistic ODE Solvers with Runge-Kutta Means Michael Schober∗, David Duvenaud‡, Philipp

Probabilistic ODE Solverswith Runge-Kutta Means

Michael Schober∗, David Duvenaud‡, Philipp Hennig∗

∗Research Group Elementary IntelligenceDepartment of Empirical InferenceMax Planck Institute for Intelligent SystemsTübingen, Germany

‡Computational and Biological Learning LabDepartment of EngineeringCambridge University

Can we assign a probability distributionover the solution to

an ordinary differential equation(initial value problem)?

x(t0) = x0 x′(t) = f(x(t), t)

1 ,

The Probabilistic View on Computationcomputing as the collection of information [Poincaré, 1896, Diaconis, 1988, O’Hagan, 1992]

A numerical methodestimates a function’s latent property

given the result of computations.

quadrature estimates ∫ ba f(x)dx given {f(xi)}linear algebra estimates x s.t. Ax = b given {As = y}

optimization estimates x s.t. ∇f(x) = 0 given {∇f(xi)}analysis estimates x(t) s.t. x′ = f(x, t), given {f(xi, ti)}

▸ computations yield “data” / “observations”▸ non-analytic quantities are “latent”▸ even deterministic quantities can be uncertain.

2 ,

Numerical Methods and Statistical Estimatorsseveral classic numerical algorithms identified precisely as maximum a-posteriori estimators

quadrature [Diaconis, 1988, O’Hagan, 1991]Gaussian quadrature Gaussian process regression

linear algebra [Hennig, 2015]conjugate gradients Gaussian conditioning

nonlinear optimization [Hennig & Kiefel, 2013]

BFGS autoregressive filtering

ordinary differential equations [Schober et al., 2014]Runge-Kutta Gauss-Markov extrapolation

3 ,

Runge-Kutta methodsare linear extrapolators of high convergence order [Hairer et al., 1987]

t0

t

x(t)

0

c1

c2

h

1

1 w11

1 w21 w22

1 b1 b2 b3

Y1 = f(1x0, t0 + 0)Y2 = f (1x0 +w11Y1, t0 + c1)Ys+1 = f (1x0 +∑s

i wsiYi, t0 + cs)x̂(t0 + h) = 1x0 +∑i biYi

4 ,


t0 t0 + c1 t0 + c2 t0 + ht

x(t)

0

c1

c2

h

1

1 w11

1 w21 w22

1 b1 b2 b3



4 ,


t0 t0 + c1 t0 + c2 t0 + ht

x(t)

0

c1

c2

h

1

1 w11

1 w21 w22

1 b1 b2 b3



4 ,


t0 t0 + c1 t0 + c2 t0 + ht

x(t)

0

c1

c2

h

1

1 w11

1 w21 w22

1 b1 b2 b3



4 ,

Gaussian process solversare also linear extrapolators

▸ Linear extrapolation suggests Gaussian process model▸ Gaussian process solvers previously studied

[Skilling (1991), Chrekbtii et al. (2014), Hennig & Hauberg (2014)]

5 ,

Some properties of Gaussian measuresThe only two equations you really need (in this group)

▸ closure under affine transformations (x ∈ RN ,y ∈ RM )

p(x) ∼N (m,P ), p(y∣x) ∼ N (Hx + ν,R)⇒ p([x

y]) ∼N ([ m

Hm + ν] , [ P PH⊺HP HPH⊺ +R])

▸ inference involves only linear algebra operations

p([xy]) ∼N ([m1

m2] , [P 1 C

C⊺ P 2])

p(x ∣y) ∼N (m1 +CP −12 (y −m2),P 1 −CP −1

2 C⊺)

⇒ sequential Gaussian inference at linear cost (‘filtering’)

6 ,

Gaussian process solversimplicitly define a Butcher tableau

t0 t0 + c1 t0 + c2 t0 + ht

x(t)

0

c1

c2

h

1

1 w11

1 w21 w22

1 b1 b2 b3

y1 = f (µ ∣x0(t0 + 0), t0 + 0)

y2 = f (µ ∣x0,y1(t0 + c1), t0 + c1)ys+1 = f (µ ∣x0,yi(t0 + cs), t0 + cs)

x̂(t0 + h) = µ ∣x0,yi(t0 + h)

µ ∣x0(t0) ∶= [k(t0, t0)] [k(t0, t0)]−1 (x0)

= 1x0 7 ,


t0 t0 + c1 t0 + c2 t0 + ht

x(t)

0

c1

c2

h

1

1 w11

1 w21 w22

1 b1 b2 b3

y1 = f (µ ∣x0(t0 + 0), t0 + 0)

y2 = f (µ ∣x0,y1(t0 + c1), t0 + c1)

ys+1 = f (µ ∣x0,yi(t0 + cs), t0 + cs)x̂(t0 + h) = µ ∣x0,yi(t0 + h)

µ ∣x0,y1(t0 + c1) ∶= [k(t0 + c1, t0) k∂(t0 + c1, t0)] [ k(t0, t0) k∂(t0, t0)

k∂ (t0, t0) k∂ ∂(t0, t0)]−1 (x0

y1)

= w10x0 +w11y1 7 ,


t0 t0 + c1 t0 + c2 t0 + ht

x(t)

0

c1

c2

h

1

1 w11

1 w21 w22

1 b1 b2 b3

y1 = f (µ ∣x0(t0 + 0), t0 + 0)


x̂(t0 + h) = µ ∣x0,yi(t0 + h)

µ ∣x0,yi(t0 + cs) ∶= [k(t0 + cs, t0) k∂(t0 + cs, t0 + ci)]K−1 (x0

yi)

= w20x0 +∑si=1w2iyi 7 ,


t0 t0 + c1 t0 + c2 t0 + ht

x(t)

0

c1

c2

h

1

1 w11

1 w21 w22

1 b1 b2 b3

y1 = f (µ ∣x0(t0 + 0), t0 + 0)


x̂(t0 + h) = µ ∣x0,yi(t0 + h)µ ∣x0,yi

(t0 + h) ∶= [k(t0 + h, t0) k∂(t0 + h, t0 + ci)]K−1 (x0

yi)

= b0x0 +∑si=1 biyi 7 ,

Gauss-Markov-Runge-Kutta methodsa GP solver whose mean matches RK exactly

▸ RK choose (c,w, b) such that ∥x̂(t0 + h) − x(t0 + h)∥ = O(hp)▸ polynomial form suggests integrated Wiener (polynomial spline)

process

p(x(t)) = GP(x(t); 0, ks(t, t′)) where

ks(t, t′) =[ t

τ[ t′

τmin(t̃, t̃′)dt̃ dt̃′

▸ τ _−∞: improper prior p(x(t)), proper posterior after sobservations.

▸ kth-times integrated Wiener process gives k-order RK solver!▸ Inherets RK guarantees. Gives closed-form solution for tableau (used

to use numerical search!)▸ a Markov (state-space) model, so inference is O(s) (as opposed to

usual O(s3) cost

8 ,

Calibrating Uncertaintywithin the parametrized class

▸ posterior mean µ ∣y = kK−1y invariant under k_ θ2k

▸ posterior covariance k ∣y = k − kK−1k scaled by θ2

▸ initial ideas for uncertainty calibration in paper (more to come)

9 ,

Multi-Step Extension[A. Nordsieck, 1962]

Naïve chaining Smoothing Probabilistic continuation

0.20.40.60.81

x

t0 +⋯ h 2h 3h 4h

0

2

4 ⋅10−2

t

x(t)−

f(t)

t0 +⋯ h 2h 3h 4h

⋅10−2

t

t0 +⋯ h 2h 3h 4h

⋅10−2

t

▸ probabilistic interpretation questions RK beyond s steps▸ ‘obvious’ solution is to continue filtering process▸ result very similar, though not identical, to multi-step methods

10 ,

Some Conceptual Open Questionsprecise interpretation of posterior measure still evolving

How precise can the connection to multi-step methods be?▸ order / stability conditions currently not fully understood▸ flexibility is also a design criterion▸ what about stiff problems?

What, precisely, does the posterior mean?▸ width of Gaussian posterior should be inferred from regularity of

‘observed’ gradients. How, precisely, should this be done? (We haveone particular solution)

▸ is the Gaussian family enough? How expensive is it to move beyondGauss?

11 ,

What we’ve done so far:▸ Numerical methods can be interpreted as performing statistical

inference from noise-free data▸ in some cases, e.g. Runge-Kutta, this link can be made precise▸ Inherets convergence guarantees, but also get extensibility &

uncertainty estimates

What we’re working on next:▸ understand the connection to multi-step methods▸ construct a robust probabilistic IVP solver▸ Continue finding model-based interpretations of numerical solvers.

12 ,

Bibliography

P. Diaconis. Bayesian numerical analysis. Statistical decision theory and related topics, IV(1):163–175,1988.

E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I – Nonstiff Problems.Springer, 1987.

S. Hauberg, M. Schober, M. Liptrot, P. Hennig, and A. Feragen. A random riemannian metric forprobabilistic shortest-path tractography. In Medical Image Computing and Computer AssistedIntervention–MICCAI 2015. Springer, 2015.

P. Hennig. Probabilistic interpretation of linear solvers. SIAM J on Optimization, 25(1):210–233, 2015.

P. Hennig and M. Kiefel. Quasi-Newton Methods – a new direction. Journal of Machine LearningResearch, 14:834–865, March 2013.

A. O’Hagan. Bayes–Hermite quadrature. J of Statistical Planning and Inference, 29(3):245–260, 1991.

A. O’Hagan. Some Bayesian Numerical Analysis. Bayesian Statistics, 4:345–363, 1992.

H. Poincaré. Calcul des probabilités. Gauthier-Villars, Paris, 1896.

S. Särkkä. Recursive Bayesian Inference on Stochastic Differential Equations. PhD thesis, HelsinkiUniversity of Technology, 2006.

M. Schober, D. Duvenaud, and P. Hennig. Probabilistic ODE Solvers with Runge-Kutta Means.Advances in Neural Information Processing Systems (NIPS), 2014.

M. Schober, N. Kasenburg, A. Feragen, P. Hennig, and S. Hauberg. Probabilistic shortest pathtractography in DTI using Gaussian Process ODE solvers. In Medical Image Computing andComputer-Assisted Intervention–MICCAI 2014. Springer, 2014.

13 ,

Documents

Probabilistic ODE Solvers with Runge-Kutta Meansduvenaud/talks/odes_runge_kutta_nips.pdfProbabilistic ODE Solvers with Runge-Kutta Means Michael Schober∗, David Duvenaud‡, Philipp