61
Stochastic optimal control theory Bert Kappen SNN Radboud University Nijmegen the Netherlands July 5, 2008 Bert Kappen

Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

  • Upload
    others

  • View
    21

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Stochastic optimal control theory

Bert KappenSNN Radboud UniversityNijmegen the Netherlands

July 5, 2008

Bert Kappen

Page 2: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Introduction

Optimal control theory: Optimize sum of a path cost and end cost. Result isoptimal control sequence and optimal trajectory.

Input: Cost function. Output: Optimal trajectory and controls.

Classical control theory: what control signal should I give to move a plant to adesired state?

Input: Desired state trajectory. Output: Optimal control trajectory.

Bert Kappen ICML, July 5 2008 1

Page 3: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Types of optimal control problems

Finite horizon (fixed horizon time)

t tf

controlled trajectory

environmentx

Dynamics and environment may depend explicitly on time.Optimal control depends explicitly on time.

Bert Kappen ICML, July 5 2008 2

Page 4: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Types of optimal control problems

Finite horizon (moving horizon)

t tf

Dynamics and environment are static.Optimal control is time independent.

Bert Kappen ICML, July 5 2008 3

Page 5: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Types of optimal control problems

Finite horizon (moving horizon)

t tft tf

Dynamics and environment are static.Optimal control is time independent.Similar to RL.

Bert Kappen ICML, July 5 2008 4

Page 6: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Types of optimal control problems

Other types of control problems:- minimum time- infinite horizon, average reward- infinite horizon, absorbing states

In addition one should distinguish:- discrete vs. continuous state- discrete vs. continuous time- observable vs. partial observable

Bert Kappen ICML, July 5 2008 5

Page 7: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Overview

Deterministic optimal control (Kappen, 30 min.)- Introduction of delayed reward problem in discrete time;- Dynamic programming solution and deterministic Bellman equations;- Solution in continuous time and states;- Example: Mass on a spring- Pontryagin maximum principle; Notion of an optimal (particle) trajectory- Again Mass on a spring

Bert Kappen ICML, July 5 2008 6

Page 8: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Overview

Stochastic optimal control, discrete case (Toussaint, 40 min.)- Stochastic Bellman equation (discrete state and time) and Dynamic Programming- Reinforcement learning (exact solution, value iteration, policy improvement);Actor critic networks;- Markov decision problems and probabilistic inference;- Example: robotic motion control and planning

Bert Kappen ICML, July 5 2008 7

Page 9: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Overview

Stochastic optimal control, continuous case (Kappen, 40 min.)- Stochastic differential equations- Hamilton-Jacobi-Bellman equation (continuous state and time)- LQ control, Ricatti equation;- Example of LQ control- Learning; Partial observability: Inference and control;- Certainty equivalence- Path integral control; the role of noise and symmetry breaking; efficientapproximate computation (MC, MF, BP, ...)- Examples: Double slit, delayed choice, n joint arm

Bert Kappen ICML, July 5 2008 8

Page 10: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Overview

Research issues (Toussaint, 30 min.)- Learning;- efficient methods to compute value functions/cost-to-go- control under partial observability (POMDPs)

Bert Kappen ICML, July 5 2008 9

Page 11: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Discrete time control

Consider the control of a discrete time deterministic dynamical system:

xt+1 = xt + f(t, xt, ut), t = 0, 1, . . . , T − 1

xt describes the state and ut specifies the control or action at time t.

Given xt=0 = x0 and u0:T−1 = u0, u1, . . . , uT − 1, we can compute x1:T .

Define a cost for each sequence of controls:

C(x0, u0:T−1) = φ(xT ) +T−1∑

t=0

R(t, xt, ut)

The problem of optimal control is to find the sequence u0:T−1 that minimizesC(x0, u0:T−1).

Bert Kappen ICML, July 5 2008 10

Page 12: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Dynamic programming

Find the minimal cost path from A to J.

C(J) = 0, C(H) = 3, C(I) = 4

C(F ) = min(6 + C(H), 3 + C(I))

Bert Kappen ICML, July 5 2008 11

Page 13: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Discrete time control

The optimal control problem can be solved by dynamic programming. Introducethe optimal cost-to-go:

J(t, xt) = minut:T−1

(φ(xT ) +

T−1∑

s=t

R(s, xs, us)

)

which solves the optimal control problem from an intermediate time t until thefixed end time T , for all intermediate states xt.

Then,

J(T, x) = φ(x)

J(0, x) = minu0:T−1

C(x, u0:T−1)

Bert Kappen ICML, July 5 2008 12

Page 14: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Discrete time control

One can recursively compute J(t, x) from J(t+ 1, x) for all x in the following way:

J(t, xt) = minut:T−1

(φ(xT ) +

T−1∑

s=t

R(s, xs, us)

)

= minut

(R(t, xt, ut) + min

ut+1:T−1

[φ(xT ) +

T−1∑

s=t+1

R(s, xs, us)

])

= minut

(R(t, xt, ut) + J(t+ 1, xt+1))

= minut

(R(t, xt, ut) + J(t+ 1, xt + f(t, xt, ut)))

This is called the Bellman Equation.

Computes u as a function of x, t for all intermediate t and all x.

Bert Kappen ICML, July 5 2008 13

Page 15: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Discrete time control

The algorithm to compute the optimal control u∗0:T−1, the optimal trajectory x∗1:T

and the optimal cost is given by

1. Initialization: J(T, x) = φ(x)

2. Backwards: For t = T − 1, . . . , 0 and for all x compute

u∗t (x) = arg minu{R(t, x, u) + J(t+ 1, x+ f(t, x, u))}

J(t, x) = R(t, x, u∗t ) + J(t+ 1, x+ f(t, x, u∗t ))

3. Forwards: For t = 0, . . . , T − 1 compute

x∗t+1 = x∗t + f(t, x∗t , u∗t (x∗t ))

NB: the backward computation requires u∗t (x) for all x.

Bert Kappen ICML, July 5 2008 14

Page 16: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Continuous limit

Replace t+ 1 by t+ dt with dt→ 0.

xt+dt = xt + f(xt, ut, t)dt

C(x0, u0→T ) = φ(xT ) +

∫ T

0

dτR(τ, x(τ), u(τ))

Assume J(x, t) is smooth.

J(t, x) = minu

(R(t, x, u)dt+ J(t+ dt, x+ f(x, u, t)dt))

≈ minu

(R(t, x, u)dt+ J(t, x) + ∂tJ(t, x)dt+ ∂xJ(t, x)f(x, u, t)dt)

−∂tJ(t, x) = minu

(R(t, x, u) + f(x, u, t)∂xJ(x, t))

with boundary condition J(x, T ) = φ(x).

Bert Kappen ICML, July 5 2008 15

Page 17: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Continuous limit

−∂tJ(t, x) = minu

(R(t, x, u) + f(x, u, t)∂xJ(x, t))

with boundary condition J(x, T ) = φ(x).

This is called the Hamilton-Jacobi-Bellman Equation.

Computes the anticipated potential J(t, x) from the future potential φ(x).

Bert Kappen ICML, July 5 2008 16

Page 18: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Example: Mass on a spring

The spring force Fz = −z towards the rest position and control force Fu = u.

Newton’s LawF = −z + u = mz

with m = 1.

Control problem: Given initial position and velocity z(0) = z(0) = 0 at time t = 0,find the control path −1 < u(0→ T ) < 1 such that z(T ) is maximal.

Bert Kappen ICML, July 5 2008 17

Page 19: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Example: Mass on a spring

Introduce x1 = z, x2 = z, then

x1 = x2

x2 = −x1 + u

The end cost is φ(x) = −x1; path cost R(x, u, t) = 0.

The HJB takes the form:

−∂tJ = minu

(−x2

∂J

∂x1+ x1

∂J

∂x2+∂J

∂x2u

)

= −x2∂J

∂x1+ x1

∂J

∂x2−∣∣∣∣∂J

∂x2

∣∣∣∣ , u = −sign

(∂J

∂x2

)

Bert Kappen ICML, July 5 2008 18

Page 20: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Example: Mass on a spring

The solution is

J(t, x1, x2) = − cos(t− T )x1 + sin(t− T )x2 + α(t)

u(t, x1, x2) = −sign(sin(t− T ))

As an example consider T = 2π. Then, the optimal control is

u = −1, 0 < t < π

u = 1, π < t < 2π

0 2 4 6 8−2

−1

0

1

2

3

4

t

x1

x2

Bert Kappen ICML, July 5 2008 19

Page 21: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Pontryagin minimum principle

The HJB equation is a PDE with boundary condition at future time. The PDE issolved using discretization of space and time.

The solution is an optimal cost-to-go for all x and t. From this we compute theoptimal trajectory and optimal control.

An alternative approach is a variational approach that directly finds the optimaltrajectory and optimal control.

Bert Kappen ICML, July 5 2008 20

Page 22: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Pontryagin minimum principle

We can write the optimal control problem as a constrained optimization problemwith independent variables u(0→ T ) and x(0→ T )

minu(0→T ),x(0→T )

φ(x(T )) +

∫ T

0

dtR(x(t), u(t), t)

subject to the constraintx = f(x, u, t)

and boundary condition x(0) = x0.

Introduce the Lagrange multiplier function λ(t):

C = φ(x(T )) +

∫ T

0

dt [R(t, x(t), u(t))− λ(t)(f(t, x(t), u(t))− x(t))]

= φ(x(T )) +

∫ T

0

dt[−H(t, x(t), u(t), λ(t)) + λ(t)x(t))]

−H(t, x, u, λ) = R(t, x, u)− λf(t, x, u)

Bert Kappen ICML, July 5 2008 21

Page 23: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Derivation PMP

The solution is found by extremizing C. This gives a necessary but not sufficientcondition for a solution.

If we vary the action wrt to the trajectory x, the control u and the Lagrangemultiplier λ, we get:

δC = φx(x(T ))δx(T )

+

∫ T

0

dt[−Hxδx(t)−Huδu(t) + (−Hλ + x(t))δλ(t) + λ(t)δx(t)]

= (φx(x(T )) + λ(T )) δx(T )

+

∫ T

0

dt[(−Hx − λ(t))δx(t)−Huδu(t) + (−Hλ + x(t))δλ(t)

]

For instance, Hx = ∂H(t,x(t),u(t),λ(t))∂x(t) .

Bert Kappen ICML, July 5 2008 22

Page 24: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

We can solve Hu(t, x, u, λ) = 0 for u and denote the solution as

u∗(t, x, λ)

Assumes H convex in u.

The remaining equations are

x = Hλ(t, x, u∗(t, x, λ), λ)

λ = −Hx(t, x, u∗(t, x, λ), λ)

with boundary conditions

x(0) = x0 λ(T ) = −φx(x(T ))

Mixed boundary value problem.

Bert Kappen ICML, July 5 2008 23

Page 25: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Again mass on a spring

Problem

x1 = x2, x2 = −x1 + u

R(x, u, t) = 0 φ(x) = −x1

Hamiltonian

H(t, x, u, λ) = −R(t, x, u) + λTf(t, x, u) = λ1x2 + λ2(−x1 + u)

H∗(t, x, λ) = λ1x2 − λ2x1 − |λ2| u∗ = −sign(λ2)

The Hamilton equations

x =∂H∗

∂λ⇒ x1 = x2, x2 = −x1 − sign(λ2)

λ = −∂H∗

∂x⇒ λ1 = −λ2, λ2 = λ1

with x(t = 0) = x0 and λ(t = T ) = 1.

Bert Kappen ICML, July 5 2008 24

Page 26: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Comments

The HJB method gives a sufficient (and often necessary) condition for optimality.The solution of the PDE is expensive.

The PMP method provides a necessary condition for optimal control. This meansthat it provides candidate solutions for optimality.

The PMP method is computationally less complicated than the HJB methodbecause it does not require discretization of the state space.

Optimal control in continuous space and time contains many complications relatedto the existence, uniqueness and smoothness of the solution, particular in theabsence of noise. In the presence of noise many of these intricacies disappear.

HJB generalizes to the stochastic case, PMP does not (at least not easy).

Bert Kappen ICML, July 5 2008 25

Page 27: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Stochastic differential equations

Consider the random walk on the line:

xt+1 = xt + ξt ξt = ±1

with x0 = 0. We can compute

xt =t∑

i=1

ξi

Since xt is a sum of random variables, xt becomes Gaussian distributed with

〈xt〉 =t∑

i=1

〈ξi〉 = 0

⟨x2t

⟩=

t∑

i,j=1

〈ξiξj〉 =t∑

i=1

⟨ξ2i

⟩+

t∑

i,j=1,j 6=i〈ξiξj〉 = t

Note, that the fluctuations ∝√t.

Bert Kappen ICML, July 5 2008 26

Page 28: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Stochastic differential equations

In the continuous time limit we define

dxt = xt+dt − xt = dξ

with dξ an infinitesimal mean zero Gaussian variable with⟨dξ2⟩

= νdt.

Then

d

dt〈x〉 = 0, ⇒ 〈x〉 (t) = 0

d

dt

⟨x2⟩

= ν, ⇒⟨x2⟩

(t) = νt

ρ(x, t|x0, 0) =1√

2πνtexp

(−(x− x0)2

2νt

)

Bert Kappen ICML, July 5 2008 27

Page 29: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Stochastic optimal control

Consider a stochastic dynamical system

dx = f(t, x, u)dt+ dξ

dξ Gaussian noise 〈dξidξj〉 = νij(t, x, u)dt.

The cost becomes an expectation:

C(t, x, u(t→ T )) =

⟨φ(x(T )) +

∫ T

t

dτR(t, x(t), u(t))

over all stochastic trajectories starting at x with control path u(t→ T ).

Note, that u(t) as part of u(t → T ) is used at time t. Next move to x + dx andrepeat the optimization.

Bert Kappen ICML, July 5 2008 28

Page 30: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Stochastic optimal control

We obtain the Bellman recursion

J(t, xt) = minut

R(t, xt, ut) + 〈J(t+ dt, xt+dt)〉

〈J(t+ dt, xt+dt)〉 =

∫dxt+dtN (xt+dt|xt, νdt)J(t+ dt, xt+dt)

= J(t, xt) + dt∂tJ(t, xt) + 〈dx〉 ∂xJ(t, xt) +1

2

⟨dx2⟩∂2xJ(t, xt)

〈dx〉 = f(x, u, t)dt⟨dx2⟩

= ν(t, x, u)dt

Thus,

−∂tJ(t, x) = minu

(R(t, x, u) + f(x, u, t)∂xJ(x, t) +

1

2ν(t, x, u)∂2

xJ(x, t)

)

with boundary condition J(x, T ) = φ(x).

Bert Kappen ICML, July 5 2008 29

Page 31: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Linear Quadratic control

The dynamics is linear

dx = [A(t)x+B(t)u+ b(t)]dt+m∑

j=1

(Cj(t)x+Dj(t)u+ σj(t))dξj,⟨dξjdξj′

⟩= δjj′dt

The cost function is quadratic

φ(x) =1

2xTGx

R(x, u, t) =1

2xTQ(t)x+ uTS(t)x+

1

2uTR(t)u

In this case the optimal cost-to-go is quadratic in x:

J(t, x) =1

2xTP (t)x+ αT (t)x+ β(t)

u(t) = −Ψ(t)x(t)− ψ(t)

Bert Kappen ICML, July 5 2008 30

Page 32: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Substitution in the HJB equation yields ODEs for P,α, β:

−P = PA+ATP +mX

j=1

CTj PCj +Q− ST R−1S

−α = [A−BR−1S]

Tα+

mX

j=1

[Cj −DjR−1S]

TPσj + Pb

β =1

2

˛˛pRψ˛˛2

− αTb− 1

2

mX

j=1

σTj Pσj

R = R+mX

j=1

DTj PDj

S = BTP + S +

mX

j=1

DTj PCj

Ψ = R−1S

ψ = R−1

(BTα+

mX

j=1

DTj Pσj)

with P (tf) = G and α(tf) = β(tf) = 0.

Bert Kappen ICML, July 5 2008 31

Page 33: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Example

Find the optimal control for the dynamics

dx = (x+ u)dt+ dξ,⟨dξ2⟩

= νdt

with end cost φ(x) = 0 and path cost R(x, u) = 12(x2 + u2).

The Ricatti equations reduce to

−P = 2P + 1− P 2

−α = 0

β = −1

2νP

with P (T ) = α(T ) = β(T ) = 0 and

u(x, t) = −P (t)x

0 2 4 6 8 100

1

2

3

4

5

6

t

Bert Kappen ICML, July 5 2008 32

Page 34: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Comments

Note, that in the last example the optimal control is independent of ν, i.e. optimalstochastic control equals optimal deterministic control.

This is true in general for ’non-multiplicative’ noise (Cj = Dj = 0).

Bert Kappen ICML, July 5 2008 33

Page 35: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Learning

What happens if (part of) the state is not observed?

For instance,- As a result of measurement error we do not know xt but p(xt|y0:t)- We do not know the parameters of the dynamics- We do not know the cost/rewards (RL case)

Bert Kappen ICML, July 5 2008 34

Page 36: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

Bert Kappen ICML, July 5 2008 35

Page 37: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

When the next day arrives, you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes

. . .

Bert Kappen ICML, July 5 2008 36

Page 38: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

When the next day arrives, you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

. . .

tt

V

- The learning phase takes forever.- Mix of exploration and optimizing (RL: actor-critic, E3, ...)- Learning is not part of the control problem

Bert Kappen ICML, July 5 2008 37

Page 39: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

When the next day arrives, you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.

. . .

tt

V

- The learning phase takes forever.- Mix of exploration and optimizing (RL: actor-critic, E3, ...)- Learning is not part of the control problem

Bert Kappen ICML, July 5 2008 38

Page 40: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Finite horizon learning

Imagine instead life as we know it. It is finite and we have only one life.

Aim is to maximize accumulated reward. This requires to plan your learning!- At t = 0, action is useless.- At t = T , learning is useless.

t

learningaction

Problem of inference and control.

Bert Kappen ICML, July 5 2008 39

Page 41: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Inference and control

As an example, consider the problem

dx = αudt+ dξ

with α unobserved and x observed. with α unobserved and x observed. Path costR(x, u, t), end cost φ(x) and noise variance

⟨dξ2⟩

= νdt are given.

The problem is that the future information that we receive about α depends on u.Each time step we observe dx and u and thus learn about α.

pt+dt(α|dx, u) ∝ p(dx|α, u)pt(α)

The solution is to augment the state space with parameters θt (sufficient statistics)that describe pt(α) = p(α|θt) and θ0 known. Then with α = ±1, pt(α = 1) =σ(θt):

dθ =u

νdx =

u

ν(αudt+ dξ)

NB: In forward pass dθ = F (dx), thus θ also observed.

Bert Kappen ICML, July 5 2008 40

Page 42: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

With zt = (xt, θt) we obtain a standard HJB:

−∂tJ(t, z)dt = minu

(R(t, z, u)dt+ 〈dz〉z ∂zJ(z, t) +

1

2

⟨dz2⟩z∂2zJ(z, t)

)

with boundary condition J(z, T ) = φ(x).

The expectation values are conditioned on (xt, θt) and are averages over p(α|θt)and the Gaussian noise, cf.

〈dx〉x,θ = 〈αudt+ dξ〉x,θ = αudt α =

∫dαp(α|θ)α

t tft+dt

J(z,t+dt)<dz>z<dz >z2

Bert Kappen ICML, July 5 2008 41

Page 43: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Certainty equivalence

An important special case of a partial observable control problem is the Kalmanfilter (y observed, x not observed).

dx = (x+ u)dt+ dξ

y = x+ η

The cost is quadratic in x and u, for instance

C(xt, t, ut:T ) =

⟨T∑

τ=t

1

2(x2τ + u2

τ)

The optimal control is u(x, t).

When xt is not observed, we can compute p(xt|y0:t) using Kalman filtering andthe optimal control minimizes

CKF(y0:t, t, ut:T ) =

∫dxtp(xt|y0:t)C(xt, t, ut:T )

Bert Kappen ICML, July 5 2008 42

Page 44: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Since p(xt|y0:t) = N (xt|µt, σ2t ) is Gaussian and

CKF(y0:t, t, ut:T ) =

∫dxtC(xt, t, ut:T )N (xt|µt, σ2

t ) =

T∑

τ=t

1

2u2τ +

T∑

τ=t

⟨x2τ

⟩µt,σt

= · · ·= C(µt, t, ut:T ) +

1

2(T − t)σ2

t

The first term is identical to the observed case with xt → µt. The second termdoes not depend on u and thus does not affect the optimal control.

The optimal control for the Kalman filter is identical to the observed case with xtreplaced by µt:

uKF(y0:t, t) = u(µt, t)

Bert Kappen ICML, July 5 2008 43

Page 45: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Summary

For infinite time problems, learning is a meta problem with a time scale unrelatedto the horizon time.

A finite time partial observable (or adaptive) control problem is in general equivalentto an observable non-adaptive control problem in an extended state space.

The partial observable case is generally more complex than the observable case.For instance a LQ problem with unknown parameters is not LQ.

For a Kalman filter with unobserved states and known parameters, the partialobservability does not affect the optimal control law (Certainty equivalence).

Learning can be done either in the maximum likelihood sense or in a full Bayesianway.

Bert Kappen ICML, July 5 2008 44

Page 46: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Path integral control

Solving the PDE is hard. We consider a special case that can be ’solved’.

dx = (f(x, t) + u)dt+ dξ

R(x, u, t) = V (x, t) +1

2uTRu

and φ(x) arbitrary.

The stochastic HJB equation becomes:

−∂tJ = minu

(1

2uTRu+ V + (f + u)T∂xJ +

1

2Tr(ν∂2

xJ)

)

= −1

2(∂xJ)TR−1(∂xJ) + V + fT∂xJ +

1

2Tr(ν∂2

xJ)

u = −R−1∂xJ(x, t)

must be solved backward in time with boundary condition J(T, x) = φ(x).

Bert Kappen ICML, July 5 2008 45

Page 47: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Closed form solution

If we further assume that R−1 = λν for some scalar λ then we can solve J inclosed form. The solution contains the following steps:

Substitute J = −λ log Ψ in the HJB equations. Because of the relation R−1 = λνthe terms quadratic in Ψ cancel and only linear terms remain.

∂tΨ = −HΨ, H = −Vλ

+ f∂x +1

2ν∂2

x

This equation must be solved backward in time with boundary condition Ψ(x, T ) =exp(−φ(x)/λ).

The linearity allows us to reverse the direction of computation, replacing it by adiffusion process, in the following way.

Bert Kappen ICML, July 5 2008 46

Page 48: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Closed form solution

Let ρ(y, τ |x, t) describe a diffusion process defined by the Fokker-Planck equation

∂τρ = −Vνρ− ∂y(fρ) +

1

2ν∂2

yρ = H†ρ (1)

with ρ(y, t|x, t) = δ(y − x).

Define

A(x, t) =

∫dyρ(y, τ |x, t)Ψ(y, τ).

It is easy to see by using the equations of motions for Ψ and ρ that A(x, t) isindependent of τ . Evaluating A(x, t) for τ = t yields A(x, t) = Ψ(x, t). EvaluatingA(x, t) for τ = T yields A(x, t) =

∫dyρ(y, T |x, t)Ψ(y, T ). Thus,

A(x, t) = A(x, T )

Ψ(x, t) =

∫dyρ(y, T |x, t) exp(−φ(y)/ν)

Bert Kappen ICML, July 5 2008 47

Page 49: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Forward sampling of the diffusion process

The diffusion equation

∂τρ = −Vλρ− ∂y(fρ) +

1

2ν∂2

yρ (2)

can be sampled as

dx = f(x, t)dt+ dξ

x = x+ dx, with probability 1− V (x, t)dt/λ

xi = †, with probability V (x, t)dt/λ

Bert Kappen ICML, July 5 2008 48

Page 50: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Forward sampling of the diffusion process

We can estimate

Ψ(x, t) =

∫dyρ(y, T |x, t) exp(−φ(y)/λ)

≈ 1

N

i∈alive

exp(−φ(xi(T ))/λ)

by computing N trajectories xi(t→ T ), i = 1, . . . , N .

’Alive’ denotes the subset of trajectories that do not get killed along the way bythe † operation.

Bert Kappen ICML, July 5 2008 49

Page 51: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

The diffusion process

The diffusion process can be written as a path integral:

ρ(y, T |x, t) =

∫[dx]yx exp

(−1

νSpath(x(t→ T ))

)

Spath(x(t→ T )) =

∫ T

t

dτ1

2(x(τ)− f(x(τ), τ))2 + V (x(τ), τ)

t t f

xy

Bert Kappen ICML, July 5 2008 50

Page 52: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

The path integral formulation

Ψ(x, t) =

∫dyρ(y, T |x, t) exp

(−φ(x)

ν

)

=

∫[dx]x exp

(−1

νS(x(t→ T ))

)

S(x(t→ T )) = Spath(x(t→ T ) + φ(x(T ))

Ψ is a partition sum and J = −ν log Ψ therefore can be interpreted as a freeenergy. S is the energy of a path and ν the temperature.

The corresponding probability distribution is

p(x(t→ T )|x, t) =1

Ψ(x, t)exp

(−1

νS(x(t→ T ))

)

Bert Kappen ICML, July 5 2008 51

Page 53: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

An example: double slit

dx = udt+ dξ

C =

⟨1

2x(T )2 +

∫ T

0

dτ1

2u(τ)2 + V (x, t)

V (x, t = 1) implements a slit at an intermediatetime t = 1.

Ψ(x, t) =

∫dyρ(y, T |x, t)Ψ(y, T )

can be solved in closed form.

0 0.5 1 1.5 2

−6

−4

−2

0

2

4

6

8

−10 −5 0 5 100

1

2

3

4

5

x

J

t=0t=0.99t=1.01t=2

Bert Kappen ICML, July 5 2008 52

Page 54: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

MC sampling on double slit

J(x, t) ≈ −ν log

(1

N

N∑

i=1

exp(−φ(xiT )/ν)

)

0 0.5 1 1.5 2−10

−5

0

5

10

(a) Naive sample paths

−10 −5 0 5 100

2

4

6

8

10

x

J

MCExact

(b) J(x, t = 0) by naivesampling (n = 100000).

−10 −5 0 5 100.5

1

1.5

2

2.5

3

x

J

(c) J(x, t = 0) by Laplaceapproximation and importancesampling (n = 100).

Bert Kappen ICML, July 5 2008 53

Page 55: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

The delayed choice

If we go further back in time we encounter a new phenomenon: a delayed choicewhen to decide.

When slit size goes to zero, J is given by

J(x, T ) = −ν log

∫dyρ(y|x)e−φ(y)/ν

=1

T

(1

2x2 − νT log 2 cosh

x

νT

)

where T the time to reach the slits.The expression between brackets is a typical freeenergy with temperature νT .

−2 −1 0 1 20.4

0.6

0.8

1

1.2

1.4

1.6

1.8

x

J(x,

t)

T=2

T=1

T=0.5

Symmetry breaking at νT = 1 separates two qualitatively different behaviors.

Bert Kappen ICML, July 5 2008 54

Page 56: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

The delayed choice

0 0.5 1 1.5 2−2

−1

0

1

2stochastic

0 0.5 1 1.5 2−2

−1

0

1

2

0 0.5 1 1.5 2−2

−1

0

1

2deterministic

0 0.5 1 1.5 2−2

−1

0

1

2

The timing of the decision, that is when the automaton decides to go left or right,is the consequence of spontaneous symmetry breaking.

Bert Kappen ICML, July 5 2008 55

Page 57: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

N joint arm in a plane

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

dθi = uidt+ dξi, i = 1, . . . , n

C =

⟨φ(xn(~θ)) +

∫dτ

1

2u(τ)2

ui =〈θi〉 − θiT − t , i = 1, . . . , n

〈· · ·〉 ∝ ρ(θ, T |θ0, t) exp(−φ(xn(~θ)))

〈θ〉 from uncontrolled penalized diffusion. Variational approximation.

Bert Kappen ICML, July 5 2008 56

Page 58: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Other path integral control issues

Multiple agents coordination

Relation Reinforcement learning and learning.

Realistic robotics applications; Non-differentiable cost functions (obstacles).

Bert Kappen ICML, July 5 2008 57

Page 59: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Summary

Deterministic control can be solved by- HJB, a PDE- PMP, two coupled ODEs with mixed boundary conditions

Stochastic control can be solved by- HJB in general- Ricatti equation for sufficient statistics in LQ case- Path integral is LQ control with arbitrary cost and dynamics- RL is special case

Learning, (PO states or parameter values)- Decoupled from control in the RL case- Joint inference and control typically harder than control only- For Kalman filters the PO is irrelevant (certainty equivalence)

Bert Kappen ICML, July 5 2008 58

Page 60: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Summary

The PI control problems provides novel link to machine learning:- statistical physics- symmetry breaking, ’phase transitions’- ’efficient’ computation (MCMC, BP, MF, EP)

Bert Kappen ICML, July 5 2008 59

Page 61: Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Further reading

Check out the ICML tutorial paper and references there and other general referencesat the tutorial web page

http://ml.cs.tu-berlin.de/~mtoussai/08-optimal-control/

Bert Kappen ICML, July 5 2008 60