Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Approximate Reinforcement Learning(Optimal Adaptive Control workshop: Part II)
Lucian BusoniuSequeL, INRIA Lille, France
CDC-ECC 2011, 11 December, Orlando
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Reinforcement learning perspectives
Learn to optimally controlunknown system
... or stated in AI terms ...
Learn to optimally behavein unkown environment
This talk: methods developed in AI,focus on control problems
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Reminder: Markov decision process
Observe states x , apply actions u, receive rewards r
System: stochastic dynamics xk+1 ∼ f (xk , uk , ·)Performance: reward function rk+1 = r(xk , uk , xk+1)
Controller: uk = h(xk )
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Goal
Find h to maximize from any x0 expected discounted return:
Rh(x0) = E{∑∞
k=0γk rk+1 |h
}
Equivalent to minimizing the expected cost:
E {J(x0) |h} = E{∑∞
k=0γk · (–rk+1) |h
}
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Solution using Q-functions
Q-function of h: Qh(x , u) = E{
r(x , u, x ′) + γRh(x ′)}
Bellman equation:
Qh(x , u) = E{
r(x , u, x ′) + γQh(x ′, h(x ′))}
Optimal Q-function: Q∗ = maxh Qh
Bellman optimality equation:
Q∗(x , u) = E{
r(x , u, x ′) + γ maxu′
Q∗(x ′, u′)}
Optimal policy: h∗(x) = arg maxu Q∗(x , u)
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Value & policy iteration with Q-functions
Q-value iterationTurn Bellman optimality equation into iterative update
repeatQ(x , u)← E {r(x , u, x ′) + γ maxu′ Q(x ′, u′)} ∀x , u
until convergence to Q∗
Policy iterationiteratively evaluate & improve policies
repeatpolicy evaluation:
solve Qh(x , u) = E{
r(x , u, x ′) + γQh(x ′, h(x ′))}
(e.g. by using VI-like update)policy improvement: h(x)← arg maxu Qh(x , u) ∀x
until convergence to h∗
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Example: Inverted pendulum
x = [angle α, velocity α]>,u = voltagedeterministic dynamics
Goal: stabilize pointing up:
r(x , u) = −x>[5 00 0.1
]x − u>1u
discount factor γ = 0.98
Underactuated⇒ must swing before stabilizing
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Example: Optimal solution
Left: slice Q∗(x , u) for u = 0 Right: optimal policy
Replay
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Need for approximation
Classical RL – discrete x , uQ(x , u) and h(x) exactly represented
In control problems, x , u typically continuous
Approximation over x , u necessary
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Approximators
Parametric: fixed form, # of parameters
Linear:QW (x , u) = φ>(x , u)WE.g. RBFs
Nonlinear:E.g. neural net
Nonparametric: form, # of parameters derived from dataE.g. locally linear regression
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Taxonomy of methods
By path to solution:1 Approximate value iteration: Find Q∗, use it to compute h∗
2 Approximate policy iteration: Evaluate h (find Qh),improve h, repeat
3 Approximate policy search: Look directly for h∗
By level of interaction:1 Offline (batch): data collected in advance2 Online: learn by interaction
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
1 Introduction & Recap
2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration
3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA
4 Optimistic planning
5 Conclusions
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Fuzzy approximation
Linear approximator:basis functions (BFs) over stateφ1, . . . , φN : X → [0, 1]
discretized actions u1, . . . , uM ∈ Uparameters W ∈ RN×M
Approximate Q-function:
QW (x , u) = φ>(x)W∗,j =N∑
i=1
φi(x)Wi,j
j = arg minj ′
∥∥u − uj ′∥∥ (nearest neighbor)
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Assumptions
Deterministic systemBFs normalized:
∑Ni=1 φi(x) = 1
At center xi of each BF, φi(xi) = 1, φi ′(xi) = 0 ∀i ′
Simplest BFs satisfying this – triangular:
⇒ multilinear interpolation
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Fuzzy Q-iteration
Revisit exact Q-iteration:repeat at each iteration `
Q`+1(x , u)← r(x , u, x ′) + γ maxu′ Q`(x ′, u′) for all x , u=: [T (Q`)](x , u) with T : Q → Q Bellman mapping
until convergence to Q∗
output Q∗, h∗(x) = arg maxu Q∗(x , u)
Fuzzy Q-iteration (Busoniu et al., 2007)
repeat at each iteration `
W`+1,i,j ← [T (QW`)](xi , uj) for all i , j
= r(xi , uj , x ′) + γ maxu′ QW`(x ′, u′)until convergence to W ∗
output QW∗, hW∗
(x) = arg maxujQW∗
(x , uj)
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Convergence
T contraction with factor γ:‖T (Q)− T (Q′)‖∞ ≤ γ ‖Q −Q′‖∞QW non-expansion (seen as mapping RN×M → Q):∥∥∥QW − QW ′
∥∥∥∞≤ maxi,j
∣∣∣Wi,j −W ′i,j
∣∣∣Therefore:
Theorem
Fuzzy Q-iteration monotonically converges to W ∗
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Solution quality
Characterize approximation power by minimum distance to Q∗:
ε = minW∈RN×M
∥∥∥Q∗ − QW∥∥∥∞
Theorem (continued)Returned Q-function is near-optimal:∥∥∥Q∗ − QW∗
∥∥∥∞≤ 2ε
1−γ
... and corresponding policy also near-optimal:∥∥∥Q∗ −QhW∗∥∥∥∞≤ 4γε
(1−γ)2 —
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Example: Fuzzy QI for the inverted pendulum
BFs over state space: triangular, on 41× 21 equidistant gridAction discretization: grid of 5 actions, centered on 0
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
1 Introduction & Recap
2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration
3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA
4 Optimistic planning
5 Conclusions
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Generalizing fuzzy Q-iteration
Arbitrary approximator, parametric or nonparametric(fuzzy QI: restricted BFs + discrete actions)
Arbitrary transition samples (xis , uis , ris , x ′is), is = 1, . . . , ns(fuzzy QI: center-discrete action pairs (xi , uj))
Possibly stochastic system
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Fitted Q-iteration
Algorithm for parametric approximator(can also use nonparametric regression)
Fitted Q-iteration (Ernst et al., 2005)
repeat at each iteration `for each sample is = 1, . . . , ns do
compute target Q-value:qis ← ris ,+γ maxu′ QW`(x ′is , u′)
end forperform regression on data (xis , uis)→ qis :
W`+1 ← arg minW∈RN×M∑ns
is=1
∣∣∣qis − QW (xis , uis)∣∣∣2
until finished
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Fitted Q-iteration (cont’d)
If system is stochastic, we actually aim for:QW (x , u) ≈ E
{r(x , u, x ′) + γ maxu′ QW`(x ′, u′)
}– regression on (xis , uis)→ qis naturally achieves this
Convergence to near-optimal region(given MDP and approximator characteristics)
(Munos & Szepesvari, 2008)
Left: fitted QI Right: fuzzy QI
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Example: Fitted QI for the inverted pendulum
Approximator over state space: local linear regressionAction discretization: 3 actionsSamples: on a grid 31× 15× 3
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
1 Introduction & Recap
2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration
3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA
4 Optimistic planning
5 Conclusions
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Approximate policy iteration
Revisit exact policy iteration:repeat
policy evaluation: solveQh`(x , u) = E
{r(x , u, x ′) + γQh`(x ′, h`(x ′))
}=: [T h`(Qh`)](x , u), T h policy eval. mapping
policy improvement: h`+1(x)← arg maxu Qh`(x , u) ∀xuntil convergence to h∗
Approximate policy iterationrepeat
approximate policy evaluation: solve Qh` ≈ T h`(Qh`)
policy improvement: h`+1(x)← arg maxu Qh`(x , u)until finished
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Guarantees
Theorem (Bertsekas & Tsitsiklis 1996, Lagoudakis & Parr 2003)
If every policy evaluation is ε-accurate:∥∥∥Qh` −Qh`
∥∥∥∞≤ ε,
a near-optimal policy is eventually obtained:
lim sup`→∞
∥∥∥Qh` −Q∗∥∥∥∞≤ 2γε
(1− γ)2
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Projected policy evaluation (LSTD)
Linear approximator QW (x , u) = φ>(x , u)W
Qh ≈ T h(Qh) instantiated to QW = P[T h(QW )]with P weighted least-squares projectionHas meaningful solution under conditions on PCalled least-squares temporal difference
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Projected policy evaluation (cont’d)
QW = P[T h(QW )]
QW linear in W ; T h(Q) linear in Q; P(Q) linear in Q⇒ rewrite as linear equation in W :
AW = b
A, b can be estimated from samples (xis , uis , ris , x ′is):
A←A + φ(xis , uis)φ>(xis , uis)
− γφ(xis , uis)φ>(x ′is , h(x ′is))
b ←b + φ(xis , uis)ris
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Least-squares policy iteration
LSPI (Lagoudakis & Parr, 2003)
repeatuse samples to build estimates A`, b` for h`
projected policy evaluation: solve A`W` = b`
policy improvement: h`+1(x)← arg maxu QW`(x , u)until finished
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Example: LSPI for the inverted pendulum
BFs over state space: RBFs on a 15× 9 gridAction discretization: 3 actionsSamples: 7500 transitions from random (x , discrete u) pairs
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Comparison between API and AVI
Similar theoretical behavior in general:convergence to near-optimal region
In practice, API typically converges in fewer iterations...
...but also has larger complexity per iteration
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
1 Introduction & Recap
2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration
3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA
4 Optimistic planning
5 Conclusions
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Q-learning
1 Take (model-based) Q-iteration update:Q`+1(x , u)← E {r(x , u, x ′) + γ maxu′ Q`(x ′, u′)}
2 Use transition sample (xk , uk , rk+1, xk+1) at each step k :Q(xk , uk )← rk+1 + γ maxu′ Q(xk+1, u′)
– note xk+1 = x ′, rk+1 = r(x , u, x ′) in deterministic case;in stochastic case they just provide a sample of r.h.s.
3 Make update incremental with learning rate αk ∈ (0, 1]:Q(xk , uk )←Q(xk , uk ) + αk ·
[rk+1 + γ maxu′
Q(xk+1, u′)−Q(xk , uk )]
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Temporal difference
[rk+1 + γ maxu′
Q(xk+1, u′)−Q(xk , uk )]
is the Q-learning temporal difference:“error” in Bellman optimality equation for current sample
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Q-learning algorithm
Q-learningfor every trial do
initialize x0repeat for each step k
take action ukmeasure xk+1, receive rk+1Q(xk , uk )←Q(xk , uk ) + αk ·
[rk+1 + γ maxu′
Q(xk+1, u′)−Q(xk , uk )]
until trial finishedend for
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Exploration-exploitation tradeoff
Essential condition for convergence to Q∗:all (x , u) pairs must be visited infinitely often
⇒ Exploration necessary:sometimes, choose actions randomlyExploitation of current knowledge is also necessary:sometimes, choose actions greedily
Simple solution: ε-greedy
uk =
{arg maxu Q(xk , u) with probability (1− εk )
a random action with probability εk
with exploration probability εk ∈ (0, 1) decreasing in time
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
SARSA
Just like Q-learning but starting from policy evaluation:Qh(x , u) = E
{r(x , u, x ′) + γQh(x ′, h(x ′))
}We get the temporal-difference update:
Q(xk , uk )←Q(xk , uk ) + αk ·[rk+1 + γQ(xk+1, uk+1)−Q(xk , uk )]
Two algorithms can be obtained:1 Policy held fixed⇒ online variant of policy evaluation2 Policy based on greedy⇒ SARSA
(xk , uk , rk+1, xk+1, uk+1 = state, action, reward, state, action)
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
SARSA Algorithm
SARSA with ε-greedy explorationfor every trial do
initialize x0
u0 =
{arg maxu Q(x0, u) w.p. (1− ε0)
random w.p. ε0repeat each step k
apply uk , measure xk+1, receive rk+1
uk+1 =
{arg maxu Q(xk+1, u) w.p. (1− εk+1)
random w.p. εk+1
Q(xk , uk )←Q(xk , uk ) + αk ·[rk+1 + γQ(xk+1, uk+1)−Q(xk , uk )]
until trial finishedend for
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
On-policy vs. off-policy
SARSA on-policy:[rk+1 + γQ(xk+1, uk+1)−Q(xk , uk )]
– updates towards the value of current policy used⇒ policy must converge to greedy
Q-learning off-policy:[rk+1 + γ maxu′ Q(xk+1, u′)−Q(xk , uk )]
– updates towards optimal value, regardless of policy⇒ any policy can be used to learn
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
1 Introduction & Recap
2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration
3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA
4 Optimistic planning
5 Conclusions
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Approximate Q-learning
Parametric approximation QW (x , u)
Gradient descent on the error [Q∗(xk , uk )− QW (xk , uk )]2:
W ←W − 12αk
∂
∂W
[Q∗(xk , uk )− QW (xk , uk )
]2
= W + αk∂QW (xk , uk )
∂W·[Q∗(xk , uk )− QW (xk , uk )
]Estimate Q∗(xk , uk ) using Bellman:
W ←W + αk∂QW (xk , uk )
∂W·[
rk+1 + γ maxu′
QW (xk+1, u′)− QW (xk , uk )
](approximate temporal difference)
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Approximate Q-learning algorithm
Approximate Q-learning, ε-greedy (Sutton & Barto, 1998)
for every trial doinitialize x0repeat at each step k
uk =
{arg maxu QW (xk , u) w.p. (1− εk )
random w.p. εkapply uk , measure xk+1, receive rk+1
W ←W + αk∂ bQW (xk ,uk )
∂W ·[rk+1 + γ maxu′ QW (xk+1, u′)− QW (xk , uk )
]until trial finished
end for
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Approximate SARSA
Approximate SARSA, ε-greedy (Sutton & Barto, 1998)
for every trial doinitialize x0
u0 =
{arg maxu QW (x0, u) w.p. (1− ε0)
random w.p. ε0repeat at each step k
apply uk , measure xk+1, receive rk+1
uk+1 =
{arg maxu QW (xk+1, u) w.p. (1− εk+1)
random w.p. εk+1
W ←W + αk∂ bQW (xk ,uk )
∂W ·[rk+1 + γQW (xk+1, uk+1)− QW (xk , uk )
]until trial finished
end for
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Convergence
Assumptions:linear parametrizationtechnical conditions on the policy
Results:1 Approximate SARSA converges2 Approximate Q-learning converges when policy fixed
(Melo et al., 2008)
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Application: Robotic goalkeeper
Vision-based control: learn how to catch ballusing video camera image6 states, 2 actions (motor torques)
(Adam et al., 2011)
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Robot goalkeeper details
Challenge: fast learning from only a few data(∼25 samples/ball shot for 25FPS camera)
Modifications:Reuse transition samples (“experience replay”)
Reduce dimensionality: x = [ϕball, ϕ1]>, u = τ1
(& control second arm so camera points forward)
Guide initial learning phasewith heuristic controller (“easy missions”)
Reward r = −x>[1 00 0.01
]x , discount γ = 0.8
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Demo: online RL for robot goalkeeper
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
1 Introduction & Recap
2 Offline approximate value iteration & policy iterationAVI: Fuzzy Q-iterationAVI: Fitted Q-iterationAPI: Least-squares policy iteration
3 Online temporal-difference RLClassical Q-learning and SARSAApproximate Q-learning and SARSA
4 Optimistic planning
5 Conclusions
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Idea: Online, receding-horizon planning
At each step k :1 Explore possible policies (e.g. action sequences) from xk
2 Choose uk from resulting information
Also a type of model-predictive control
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Optimistic planning
Optimistic planning for MDPs (Busoniu & Munos, 2011)
initialize tree with root xkrepeat n times
find optimistic partial policyexpand node with largest uncertainty
output near-optimal action
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Online treatment of HIV (simulation)
Optimistic planning vs. full treatmentInfection eventually controlled without drugs
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Outlook
Additional topics:Better exploration-exploitationPolicy search & gradient methodsAlgorithm variants & extensions
Open issues in “AI” RL:Large-scale systemsSafe learningGood applicationsTighter integration with “control” ADP & RL!
Intro & Recap Offline AVI & API Online TD RL Optimistic planning Conclusions
Conclusion
Approximate reinforcement learning =Learn how to near-optimally control
unknown, nonlinear, stochastic systems
Appendix
Books on ADP & RL (chronologically)
Bertsekas & Tsitsiklis, Neuro-Dynamic Programming, 1996.Sutton & Barto, Reinforcement Learning: An Introduction, 1998.Si, Barto, & Powell (eds.), Handbook of Learning andApproximate Dynamic Programming, 2004.Bertsekas, Dynamic Programming and Optimal Control, 3rd ed.,2007.Sigaud & Buffet (eds.), Markov Decision Processes in ArtificialIntelligence, 2010.Busoniu, Babuska, De Schutter, & Ernst, ReinforcementLearning and Dynamic Programming Using FunctionApproximators, 2010.Szepesvari, Algorithms for Reinforcement Learning, 2010.Powell, Approximate Dynamic Programming: Solving the Cursesof Dimensionality, 2nd ed., 2011.
Upcoming: Lewis & Liu (eds.) RL and ADP for FeedbackControl, Wiering & Otterlo (eds.) RL: State of the Art
Appendix
References for this talk (in citation order)
Busoniu, Ernst, Babuska, De Schutter, Fuzzy Approximation forConvergent Model-Based Reinforcement Learning, FUZZ-IEEE2007; extended in Automatica, 2010.Ernst, Geurts, Wehenkel, Tree-Based Batch ModeReinforcement Learning, JMLR, 2005.Munos & Szepesvari, Finite Time Bounds for Fitted ValueIteration, JMLR, 2008.Bertsekas & Tsitsiklis, Neuro-Dynamic Programming, 1996.Lagoudakis & Parr, Least-Squares Policy Iteration, JMLR, 2003.Sutton & Barto, Reinforcement Learning: An Introduction, 1998.Melo, Meyn, Ribeiro, An Analysis of Reinforcement Learningwith Function Approximation, ICML 2008.Adam, Busoniu, Babuska, Experience Replay for Real-TimeReinforcement Learning Control, Tran. SMC-C 2011.Busoniu, Munos, De Schutter, and Babuska, Optimistic Planningfor Sparsely Stochastic Systems, ADPRL 2011.