Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic

Stochastic optimal control theory

Bert KappenSNN Radboud UniversityNijmegen the Netherlands

July 5, 2008

Bert Kappen

Introduction

Optimal control theory: Optimize sum of a path cost and end cost. Result isoptimal control sequence and optimal trajectory.

Input: Cost function. Output: Optimal trajectory and controls.

Classical control theory: what control signal should I give to move a plant to adesired state?

Input: Desired state trajectory. Output: Optimal control trajectory.

Bert Kappen ICML, July 5 2008 1

Types of optimal control problems

Finite horizon (fixed horizon time)

t tf

controlled trajectory

environmentx

Dynamics and environment may depend explicitly on time.Optimal control depends explicitly on time.



Finite horizon (moving horizon)

t tf

Dynamics and environment are static.Optimal control is time independent.



Finite horizon (moving horizon)

t tft tf

Dynamics and environment are static.Optimal control is time independent.Similar to RL.



Other types of control problems:- minimum time- infinite horizon, average reward- infinite horizon, absorbing states

In addition one should distinguish:- discrete vs. continuous state- discrete vs. continuous time- observable vs. partial observable


Overview

Deterministic optimal control (Kappen, 30 min.)- Introduction of delayed reward problem in discrete time;- Dynamic programming solution and deterministic Bellman equations;- Solution in continuous time and states;- Example: Mass on a spring- Pontryagin maximum principle; Notion of an optimal (particle) trajectory- Again Mass on a spring


Overview

Stochastic optimal control, discrete case (Toussaint, 40 min.)- Stochastic Bellman equation (discrete state and time) and Dynamic Programming- Reinforcement learning (exact solution, value iteration, policy improvement);Actor critic networks;- Markov decision problems and probabilistic inference;- Example: robotic motion control and planning


Overview

Stochastic optimal control, continuous case (Kappen, 40 min.)- Stochastic differential equations- Hamilton-Jacobi-Bellman equation (continuous state and time)- LQ control, Ricatti equation;- Example of LQ control- Learning; Partial observability: Inference and control;- Certainty equivalence- Path integral control; the role of noise and symmetry breaking; efficientapproximate computation (MC, MF, BP, ...)- Examples: Double slit, delayed choice, n joint arm


Overview

Research issues (Toussaint, 30 min.)- Learning;- efficient methods to compute value functions/cost-to-go- control under partial observability (POMDPs)


Discrete time control

Consider the control of a discrete time deterministic dynamical system:

xt+1 = xt + f(t, xt, ut), t = 0, 1, . . . , T − 1

xt describes the state and ut specifies the control or action at time t.

Given xt=0 = x0 and u0:T−1 = u0, u1, . . . , uT − 1, we can compute x1:T .

Define a cost for each sequence of controls:

C(x0, u0:T−1) = φ(xT ) +T−1∑

t=0

R(t, xt, ut)

The problem of optimal control is to find the sequence u0:T−1 that minimizesC(x0, u0:T−1).


Dynamic programming

Find the minimal cost path from A to J.

C(J) = 0, C(H) = 3, C(I) = 4

C(F ) = min(6 + C(H), 3 + C(I))



The optimal control problem can be solved by dynamic programming. Introducethe optimal cost-to-go:

J(t, xt) = minut:T−1

(φ(xT ) +

T−1∑

s=t

R(s, xs, us)

)

which solves the optimal control problem from an intermediate time t until thefixed end time T , for all intermediate states xt.

Then,

J(T, x) = φ(x)

J(0, x) = minu0:T−1

C(x, u0:T−1)



One can recursively compute J(t, x) from J(t+ 1, x) for all x in the following way:

J(t, xt) = minut:T−1

(φ(xT ) +

T−1∑

s=t

R(s, xs, us)

)

= minut

(R(t, xt, ut) + min

ut+1:T−1

[φ(xT ) +

T−1∑

s=t+1

R(s, xs, us)

])

= minut

(R(t, xt, ut) + J(t+ 1, xt+1))

= minut

(R(t, xt, ut) + J(t+ 1, xt + f(t, xt, ut)))

This is called the Bellman Equation.

Computes u as a function of x, t for all intermediate t and all x.



The algorithm to compute the optimal control u∗0:T−1, the optimal trajectory x∗1:T

and the optimal cost is given by

1. Initialization: J(T, x) = φ(x)

2. Backwards: For t = T − 1, . . . , 0 and for all x compute

u∗t (x) = arg minu{R(t, x, u) + J(t+ 1, x+ f(t, x, u))}

J(t, x) = R(t, x, u∗t ) + J(t+ 1, x+ f(t, x, u∗t ))

3. Forwards: For t = 0, . . . , T − 1 compute

x∗t+1 = x∗t + f(t, x∗t , u∗t (x∗t ))

NB: the backward computation requires u∗t (x) for all x.


Continuous limit

Replace t+ 1 by t+ dt with dt→ 0.

xt+dt = xt + f(xt, ut, t)dt

C(x0, u0→T ) = φ(xT ) +

∫ T

0

dτR(τ, x(τ), u(τ))

Assume J(x, t) is smooth.

J(t, x) = minu

(R(t, x, u)dt+ J(t+ dt, x+ f(x, u, t)dt))

≈ minu

(R(t, x, u)dt+ J(t, x) + ∂tJ(t, x)dt+ ∂xJ(t, x)f(x, u, t)dt)

−∂tJ(t, x) = minu

(R(t, x, u) + f(x, u, t)∂xJ(x, t))

with boundary condition J(x, T ) = φ(x).


Continuous limit


(R(t, x, u) + f(x, u, t)∂xJ(x, t))


This is called the Hamilton-Jacobi-Bellman Equation.

Computes the anticipated potential J(t, x) from the future potential φ(x).


Example: Mass on a spring

The spring force Fz = −z towards the rest position and control force Fu = u.

Newton’s LawF = −z + u = mz

with m = 1.

Control problem: Given initial position and velocity z(0) = z(0) = 0 at time t = 0,find the control path −1 < u(0→ T ) < 1 such that z(T ) is maximal.



Introduce x1 = z, x2 = z, then

x1 = x2

x2 = −x1 + u

The end cost is φ(x) = −x1; path cost R(x, u, t) = 0.

The HJB takes the form:

−∂tJ = minu

(−x2

∂J

∂x1+ x1

∂J

∂x2+∂J

∂x2u

)

= −x2∂J

∂x1+ x1

∂J

∂x2−∣∣∣∣∂J

∂x2

∣∣∣∣ , u = −sign

(∂J

∂x2

)



The solution is

J(t, x1, x2) = − cos(t− T )x1 + sin(t− T )x2 + α(t)

u(t, x1, x2) = −sign(sin(t− T ))

As an example consider T = 2π. Then, the optimal control is

u = −1, 0 < t < π

u = 1, π < t < 2π

0 2 4 6 8−2

−1

0

1

2

3

4

t

x1

x2


Pontryagin minimum principle

The HJB equation is a PDE with boundary condition at future time. The PDE issolved using discretization of space and time.

The solution is an optimal cost-to-go for all x and t. From this we compute theoptimal trajectory and optimal control.

An alternative approach is a variational approach that directly finds the optimaltrajectory and optimal control.


Pontryagin minimum principle

We can write the optimal control problem as a constrained optimization problemwith independent variables u(0→ T ) and x(0→ T )

minu(0→T ),x(0→T )

φ(x(T )) +

∫ T

0

dtR(x(t), u(t), t)

subject to the constraintx = f(x, u, t)

and boundary condition x(0) = x0.

Introduce the Lagrange multiplier function λ(t):

C = φ(x(T )) +

∫ T

0

dt [R(t, x(t), u(t))− λ(t)(f(t, x(t), u(t))− x(t))]

= φ(x(T )) +

∫ T

0

dt[−H(t, x(t), u(t), λ(t)) + λ(t)x(t))]

−H(t, x, u, λ) = R(t, x, u)− λf(t, x, u)


Derivation PMP

The solution is found by extremizing C. This gives a necessary but not sufficientcondition for a solution.

If we vary the action wrt to the trajectory x, the control u and the Lagrangemultiplier λ, we get:

δC = φx(x(T ))δx(T )

+

∫ T

0

dt[−Hxδx(t)−Huδu(t) + (−Hλ + x(t))δλ(t) + λ(t)δx(t)]

= (φx(x(T )) + λ(T )) δx(T )

+

∫ T

0

dt[(−Hx − λ(t))δx(t)−Huδu(t) + (−Hλ + x(t))δλ(t)

]

For instance, Hx = ∂H(t,x(t),u(t),λ(t))∂x(t) .


We can solve Hu(t, x, u, λ) = 0 for u and denote the solution as

u∗(t, x, λ)

Assumes H convex in u.

The remaining equations are

x = Hλ(t, x, u∗(t, x, λ), λ)

λ = −Hx(t, x, u∗(t, x, λ), λ)

with boundary conditions

x(0) = x0 λ(T ) = −φx(x(T ))

Mixed boundary value problem.


Again mass on a spring

Problem

x1 = x2, x2 = −x1 + u

R(x, u, t) = 0 φ(x) = −x1

Hamiltonian

H(t, x, u, λ) = −R(t, x, u) + λTf(t, x, u) = λ1x2 + λ2(−x1 + u)

H∗(t, x, λ) = λ1x2 − λ2x1 − |λ2| u∗ = −sign(λ2)

The Hamilton equations

x =∂H∗

∂λ⇒ x1 = x2, x2 = −x1 − sign(λ2)

λ = −∂H∗

∂x⇒ λ1 = −λ2, λ2 = λ1

with x(t = 0) = x0 and λ(t = T ) = 1.


Comments

The HJB method gives a sufficient (and often necessary) condition for optimality.The solution of the PDE is expensive.

The PMP method provides a necessary condition for optimal control. This meansthat it provides candidate solutions for optimality.

The PMP method is computationally less complicated than the HJB methodbecause it does not require discretization of the state space.

Optimal control in continuous space and time contains many complications relatedto the existence, uniqueness and smoothness of the solution, particular in theabsence of noise. In the presence of noise many of these intricacies disappear.

HJB generalizes to the stochastic case, PMP does not (at least not easy).


Stochastic differential equations

Consider the random walk on the line:

xt+1 = xt + ξt ξt = ±1

with x0 = 0. We can compute

xt =t∑

i=1

ξi

Since xt is a sum of random variables, xt becomes Gaussian distributed with

〈xt〉 =t∑

i=1

〈ξi〉 = 0

⟨x2t

⟩=

t∑

i,j=1

〈ξiξj〉 =t∑

i=1

⟨ξ2i

⟩+

t∑

i,j=1,j 6=i〈ξiξj〉 = t

Note, that the fluctuations ∝√t.


Stochastic differential equations

In the continuous time limit we define

dxt = xt+dt − xt = dξ

with dξ an infinitesimal mean zero Gaussian variable with⟨dξ2⟩

= νdt.

Then

d

dt〈x〉 = 0, ⇒ 〈x〉 (t) = 0

d

dt

⟨x2⟩

= ν, ⇒⟨x2⟩

(t) = νt

ρ(x, t|x0, 0) =1√

2πνtexp

(−(x− x0)2

2νt

)


Stochastic optimal control

Consider a stochastic dynamical system

dx = f(t, x, u)dt+ dξ

dξ Gaussian noise 〈dξidξj〉 = νij(t, x, u)dt.

The cost becomes an expectation:

C(t, x, u(t→ T )) =

⟨φ(x(T )) +

∫ T

t

dτR(t, x(t), u(t))

⟩

over all stochastic trajectories starting at x with control path u(t→ T ).

Note, that u(t) as part of u(t → T ) is used at time t. Next move to x + dx andrepeat the optimization.


Stochastic optimal control

We obtain the Bellman recursion

J(t, xt) = minut

R(t, xt, ut) + 〈J(t+ dt, xt+dt)〉

〈J(t+ dt, xt+dt)〉 =

∫dxt+dtN (xt+dt|xt, νdt)J(t+ dt, xt+dt)

= J(t, xt) + dt∂tJ(t, xt) + 〈dx〉 ∂xJ(t, xt) +1

2

⟨dx2⟩∂2xJ(t, xt)

〈dx〉 = f(x, u, t)dt⟨dx2⟩

= ν(t, x, u)dt

Thus,


(R(t, x, u) + f(x, u, t)∂xJ(x, t) +

1

2ν(t, x, u)∂2

xJ(x, t)

)



Linear Quadratic control

The dynamics is linear

dx = [A(t)x+B(t)u+ b(t)]dt+m∑

j=1

(Cj(t)x+Dj(t)u+ σj(t))dξj,⟨dξjdξj′

⟩= δjj′dt

The cost function is quadratic

φ(x) =1

2xTGx

R(x, u, t) =1

2xTQ(t)x+ uTS(t)x+

1

2uTR(t)u

In this case the optimal cost-to-go is quadratic in x:

J(t, x) =1

2xTP (t)x+ αT (t)x+ β(t)

u(t) = −Ψ(t)x(t)− ψ(t)


Substitution in the HJB equation yields ODEs for P,α, β:

−P = PA+ATP +mX

j=1

CTj PCj +Q− ST R−1S

−α = [A−BR−1S]

Tα+

mX

j=1

[Cj −DjR−1S]

TPσj + Pb

β =1

2

˛˛pRψ˛˛2

− αTb− 1

2

mX

j=1

σTj Pσj

R = R+mX

j=1

DTj PDj

S = BTP + S +

mX

j=1

DTj PCj

Ψ = R−1S

ψ = R−1

(BTα+

mX

j=1

DTj Pσj)

with P (tf) = G and α(tf) = β(tf) = 0.


Example

Find the optimal control for the dynamics

dx = (x+ u)dt+ dξ,⟨dξ2⟩

= νdt

with end cost φ(x) = 0 and path cost R(x, u) = 12(x2 + u2).

The Ricatti equations reduce to

−P = 2P + 1− P 2

−α = 0

β = −1

2νP

with P (T ) = α(T ) = β(T ) = 0 and

u(x, t) = −P (t)x

0 2 4 6 8 100

1

2

3

4

5

6

t

Pβ


Comments

Note, that in the last example the optimal control is independent of ν, i.e. optimalstochastic control equals optimal deterministic control.

This is true in general for ’non-multiplicative’ noise (Cj = Dj = 0).


Learning

What happens if (part of) the state is not observed?

For instance,- As a result of measurement error we do not know xt but p(xt|y0:t)- We do not know the parameters of the dynamics- We do not know the cost/rewards (RL case)


Learning in RL or receding horizon

Imagine eternity, and you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes.





When the next day arrives, you are ordered to cook a very good meal for tomorrow.

You decide to spend all of today to learn the recipes

. . .







. . .

tt

V

- The learning phase takes forever.- Mix of exploration and optimizing (RL: actor-critic, E3, ...)- Learning is not part of the control problem







. . .

tt

V

- The learning phase takes forever.- Mix of exploration and optimizing (RL: actor-critic, E3, ...)- Learning is not part of the control problem


Finite horizon learning

Imagine instead life as we know it. It is finite and we have only one life.

Aim is to maximize accumulated reward. This requires to plan your learning!- At t = 0, action is useless.- At t = T , learning is useless.

t

learningaction

Problem of inference and control.


Inference and control

As an example, consider the problem

dx = αudt+ dξ

with α unobserved and x observed. with α unobserved and x observed. Path costR(x, u, t), end cost φ(x) and noise variance

⟨dξ2⟩

= νdt are given.

The problem is that the future information that we receive about α depends on u.Each time step we observe dx and u and thus learn about α.

pt+dt(α|dx, u) ∝ p(dx|α, u)pt(α)

The solution is to augment the state space with parameters θt (sufficient statistics)that describe pt(α) = p(α|θt) and θ0 known. Then with α = ±1, pt(α = 1) =σ(θt):

dθ =u

νdx =

u

ν(αudt+ dξ)

NB: In forward pass dθ = F (dx), thus θ also observed.


With zt = (xt, θt) we obtain a standard HJB:

−∂tJ(t, z)dt = minu

(R(t, z, u)dt+ 〈dz〉z ∂zJ(z, t) +

1

2

⟨dz2⟩z∂2zJ(z, t)

)

with boundary condition J(z, T ) = φ(x).

The expectation values are conditioned on (xt, θt) and are averages over p(α|θt)and the Gaussian noise, cf.

〈dx〉x,θ = 〈αudt+ dξ〉x,θ = αudt α =

∫dαp(α|θ)α

t tft+dt

J(z,t+dt)<dz>z<dz >z2


Certainty equivalence

An important special case of a partial observable control problem is the Kalmanfilter (y observed, x not observed).

dx = (x+ u)dt+ dξ

y = x+ η

The cost is quadratic in x and u, for instance

C(xt, t, ut:T ) =

⟨T∑

τ=t

1

2(x2τ + u2

τ)

⟩

The optimal control is u(x, t).

When xt is not observed, we can compute p(xt|y0:t) using Kalman filtering andthe optimal control minimizes

CKF(y0:t, t, ut:T ) =

∫dxtp(xt|y0:t)C(xt, t, ut:T )


Since p(xt|y0:t) = N (xt|µt, σ2t ) is Gaussian and

CKF(y0:t, t, ut:T ) =

∫dxtC(xt, t, ut:T )N (xt|µt, σ2

t ) =

T∑

τ=t

1

2u2τ +

T∑

τ=t

⟨x2τ

⟩µt,σt

= · · ·= C(µt, t, ut:T ) +

1

2(T − t)σ2

t

The first term is identical to the observed case with xt → µt. The second termdoes not depend on u and thus does not affect the optimal control.

The optimal control for the Kalman filter is identical to the observed case with xtreplaced by µt:

uKF(y0:t, t) = u(µt, t)


Summary

For infinite time problems, learning is a meta problem with a time scale unrelatedto the horizon time.

A finite time partial observable (or adaptive) control problem is in general equivalentto an observable non-adaptive control problem in an extended state space.

The partial observable case is generally more complex than the observable case.For instance a LQ problem with unknown parameters is not LQ.

For a Kalman filter with unobserved states and known parameters, the partialobservability does not affect the optimal control law (Certainty equivalence).

Learning can be done either in the maximum likelihood sense or in a full Bayesianway.


Path integral control

Solving the PDE is hard. We consider a special case that can be ’solved’.

dx = (f(x, t) + u)dt+ dξ

R(x, u, t) = V (x, t) +1

2uTRu

and φ(x) arbitrary.

The stochastic HJB equation becomes:

−∂tJ = minu

(1

2uTRu+ V + (f + u)T∂xJ +

1

2Tr(ν∂2

xJ)

)

= −1

2(∂xJ)TR−1(∂xJ) + V + fT∂xJ +

1

2Tr(ν∂2

xJ)

u = −R−1∂xJ(x, t)

must be solved backward in time with boundary condition J(T, x) = φ(x).


Closed form solution

If we further assume that R−1 = λν for some scalar λ then we can solve J inclosed form. The solution contains the following steps:

Substitute J = −λ log Ψ in the HJB equations. Because of the relation R−1 = λνthe terms quadratic in Ψ cancel and only linear terms remain.

∂tΨ = −HΨ, H = −Vλ

+ f∂x +1

2ν∂2

x

This equation must be solved backward in time with boundary condition Ψ(x, T ) =exp(−φ(x)/λ).

The linearity allows us to reverse the direction of computation, replacing it by adiffusion process, in the following way.


Closed form solution

Let ρ(y, τ |x, t) describe a diffusion process defined by the Fokker-Planck equation

∂τρ = −Vνρ− ∂y(fρ) +

1

2ν∂2

yρ = H†ρ (1)

with ρ(y, t|x, t) = δ(y − x).

Define

A(x, t) =

∫dyρ(y, τ |x, t)Ψ(y, τ).

It is easy to see by using the equations of motions for Ψ and ρ that A(x, t) isindependent of τ . Evaluating A(x, t) for τ = t yields A(x, t) = Ψ(x, t). EvaluatingA(x, t) for τ = T yields A(x, t) =

∫dyρ(y, T |x, t)Ψ(y, T ). Thus,

A(x, t) = A(x, T )

Ψ(x, t) =

∫dyρ(y, T |x, t) exp(−φ(y)/ν)


Forward sampling of the diffusion process

The diffusion equation

∂τρ = −Vλρ− ∂y(fρ) +

1

2ν∂2

yρ (2)

can be sampled as

dx = f(x, t)dt+ dξ

x = x+ dx, with probability 1− V (x, t)dt/λ

xi = †, with probability V (x, t)dt/λ


Forward sampling of the diffusion process

We can estimate

Ψ(x, t) =

∫dyρ(y, T |x, t) exp(−φ(y)/λ)

≈ 1

N

∑

i∈alive

exp(−φ(xi(T ))/λ)

by computing N trajectories xi(t→ T ), i = 1, . . . , N .

’Alive’ denotes the subset of trajectories that do not get killed along the way bythe † operation.


The diffusion process

The diffusion process can be written as a path integral:

ρ(y, T |x, t) =

∫[dx]yx exp

(−1

νSpath(x(t→ T ))

)

Spath(x(t→ T )) =

∫ T

t

dτ1

2(x(τ)− f(x(τ), τ))2 + V (x(τ), τ)

t t f

xy


The path integral formulation

Ψ(x, t) =

∫dyρ(y, T |x, t) exp

(−φ(x)

ν

)

=

∫[dx]x exp

(−1

νS(x(t→ T ))

)

S(x(t→ T )) = Spath(x(t→ T ) + φ(x(T ))

Ψ is a partition sum and J = −ν log Ψ therefore can be interpreted as a freeenergy. S is the energy of a path and ν the temperature.

The corresponding probability distribution is

p(x(t→ T )|x, t) =1

Ψ(x, t)exp

(−1

νS(x(t→ T ))

)


An example: double slit

dx = udt+ dξ

C =

⟨1

2x(T )2 +

∫ T

0

dτ1

2u(τ)2 + V (x, t)

⟩

V (x, t = 1) implements a slit at an intermediatetime t = 1.

Ψ(x, t) =

∫dyρ(y, T |x, t)Ψ(y, T )

can be solved in closed form.

0 0.5 1 1.5 2

−6

−4

−2

0

2

4

6

8

−10 −5 0 5 100

1

2

3

4

5

x

J

t=0t=0.99t=1.01t=2


MC sampling on double slit

J(x, t) ≈ −ν log

(1

N

N∑

i=1

exp(−φ(xiT )/ν)

)

0 0.5 1 1.5 2−10

−5

0

5

10

(a) Naive sample paths

−10 −5 0 5 100

2

4

6

8

10

x

J

MCExact

(b) J(x, t = 0) by naivesampling (n = 100000).

−10 −5 0 5 100.5

1

1.5

2

2.5

3

x

J

(c) J(x, t = 0) by Laplaceapproximation and importancesampling (n = 100).


The delayed choice

If we go further back in time we encounter a new phenomenon: a delayed choicewhen to decide.

When slit size goes to zero, J is given by

J(x, T ) = −ν log

∫dyρ(y|x)e−φ(y)/ν

=1

T

(1

2x2 − νT log 2 cosh

x

νT

)

where T the time to reach the slits.The expression between brackets is a typical freeenergy with temperature νT .

−2 −1 0 1 20.4

0.6

0.8

1

1.2

1.4

1.6

1.8

x

J(x,

t)

T=2

T=1

T=0.5

Symmetry breaking at νT = 1 separates two qualitatively different behaviors.


The delayed choice

0 0.5 1 1.5 2−2

−1

0

1

2stochastic

0 0.5 1 1.5 2−2

−1

0

1

2

0 0.5 1 1.5 2−2

−1

0

1

2deterministic

0 0.5 1 1.5 2−2

−1

0

1

2

The timing of the decision, that is when the automaton decides to go left or right,is the consequence of spontaneous symmetry breaking.


N joint arm in a plane

−3 −2 −1 0 1 2 3−3

−2

−1

0

1

2

3

−6 −4 −2 0 2 4 6

−6

−4

−2

0

2

4

6

dθi = uidt+ dξi, i = 1, . . . , n

C =

⟨φ(xn(~θ)) +

∫dτ

1

2u(τ)2

⟩

ui =〈θi〉 − θiT − t , i = 1, . . . , n

〈· · ·〉 ∝ ρ(θ, T |θ0, t) exp(−φ(xn(~θ)))

〈θ〉 from uncontrolled penalized diffusion. Variational approximation.


Other path integral control issues

Multiple agents coordination

Relation Reinforcement learning and learning.

Realistic robotics applications; Non-differentiable cost functions (obstacles).


Summary

Deterministic control can be solved by- HJB, a PDE- PMP, two coupled ODEs with mixed boundary conditions

Stochastic control can be solved by- HJB in general- Ricatti equation for sufficient statistics in LQ case- Path integral is LQ control with arbitrary cost and dynamics- RL is special case

Learning, (PO states or parameter values)- Decoupled from control in the RL case- Joint inference and control typically harder than control only- For Kalman filters the PO is irrelevant (certainty equivalence)


Summary

The PI control problems provides novel link to machine learning:- statistical physics- symmetry breaking, ’phase transitions’- ’efficient’ computation (MCMC, BP, MF, EP)


Further reading

Check out the ICML tutorial paper and references there and other general referencesat the tutorial web page

http://ml.cs.tu-berlin.de/~mtoussai/08-optimal-control/


Documents

Stochastic optimal control theory - Uni Stuttgart · Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic