Upload
vokhuong
View
214
Download
0
Embed Size (px)
Citation preview
Reinforcement Learning Algorithms in MarkovDecision Processes
AAAI-10 Tutorial
Introduction
Csaba Szepesvari Richard S. Sutton
University of AlbertaE-mails: {szepesva,rsutton}@.ualberta.ca
Atlanta, July 11, 2010
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 1 / 51
Contributions! !"# !"$ %&'()*+
!
%, -.*/0/)1)(2
34+5)(2
67+'()*+5
80.94(
:*1)'2;<)(=;
.4'*9+)>4.
?4=0@)*.;:*1)'2
80.94(;
:*1)'2;<A*
;.4'*9+)>4.
Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh
McGill University, University of Alberta, University of Michigan
Options
Distinguished
region
Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy
Wall
Options• A way of behaving for a period of time
Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?
Dribble Keepaway Pass
Options for soccer players could be
Options in a 2D world
The red and blue options
are mostly executed.
Surely we should be able
to learn about them from
this experience!
Experienced
trajectory
Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems
(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time
(learning models of options)
! a policy! a stopping condition
Non-sequential example
Problem formulation w/o recognizers
Problem formulation with recognizers
• One state
• Continuous action a ! [0, 1]
• Outcomes zi = ai
• Given samples from policy b : [0, 1] " #+
• Would like to estimate the mean outcome for a sub-region of the
action space, here a ! [0.7, 0.9]
Target policy ! : [0, 1] ! "+ is uniform within the region of interest
(see dashed line in figure below). The estimator is:
m! =1n
nX
i=1
!(ai)
b(ai)zi.
Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the
possible actions. Consider a fixed behavior policy b and let !A be
the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step
importance sampling corrections, among those in !A:
! as given by (1) = arg minp!!A
Eb
"
„
!(ai)
b(ai)
«2#
(2)
Proof: Using Lagrange multipliers
Theorem 2 Consider two binary recognizers c1 and c2, such that
µ1 > µ2. Then the importance sampling corrections for c1 have
lower variance than the importance sampling corrections for c2.
Off-policy learning
Let the importance sampling ratio at time step t be:
!t ="(st, at)
b(st, at)
The truncated n-step return, R(n)t , satisfies:
R(n)t = !t[rt+1 + (1 ! #t+1)R
(n!1)t+1 ].
The update to the parameter vector is proportional to:
!$t =h
R!t ! yt
i
""yt!0(1 ! #1) · · · !t!1(1 ! #t).
Theorem 3 For every time step t ! 0 and any initial state s,
Eb[!!t|s] = E![!!t|s].
Proof: By induction on n we show that
Eb{R(n)t |s} = E!{R
(n)t |s}
which implies that Eb{R"t |s} = E!(R"
t |s}. The rest of the proof isalgebraic manipulations (see paper).
Implementation of off-policy learning for options
In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:
!!t = (R!t " yt)#"yt
tX
i=0
gi"i..."t!1(1 " #i+1)...(1 " #t),
where gt is the extent of restarting in state st.
The incremental learning algorithm is the following:
• Initialize !0 = g0, e0 = !0!!y0
• At every time step t:
"t = #t (rt+1 + (1 " $t+1)yt+1) " yt
%t+1 = %t + &"tet
!t+1 = #t!t(1 " $t+1) + gt+1
et+1 = '#t(1 " $t+1)et + !t+1!!yt+1
References
Off-policy learning is tricky
• The Bermuda triangle
! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy
• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA
Baird's Counterexample
Vk(s) =
!(7)+2!(1)
terminal
state99%
1%
100%
Vk(s) =
!(7)+2!(2)
Vk(s) =
!(7)+2!(3)
Vk(s) =
!(7)+2!(4)
Vk(s) =
!(7)+2!(5)
Vk(s) =
2!(7)+!(6)
0
5
10
0 1000 2000 3000 4000 5000
10
10
/ -10
Iterations (k)
510
1010
010
-
-
Parametervalues, !k(i)
(log scale,
broken at !1)
!k(7)
!k(1) – !k(5)
!k(6)
Precup, Sutton & Dasgupta (PSD) algorithm
• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!
BUT!
• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)
Option formalism
An option is defined as a triple o = !I,!, ""
• I # S is the set of states in which the option can be initiated
• ! is the internal policy of the option
• " : S $ [0, 1] is a stochastic termination condition
We want to compute the reward model of option o:
Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}
We assume that linear function approximation is used to represent
the model:
Eo{R(s)} % #T $s = y
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function
approximation. In Proceedings of ICML.
Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference
learning with function approximation. In Proceedings of ICML.
Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning. Artificial
Intelligence, vol . 112, pp. 181–211.
Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings
of NIPS-17.
Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in
temporal-difference networks”. In Proceedings of NIPS-18.
Tadic, V. (2001). On the convergence of temporal-difference learning with linear
function approximation. In Machine learning vol. 42.
Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning
with function approximation. IEEE Transactions on Automatic Control 42.
Acknowledgements
Theorem 4 If the following assumptions hold:
• The function approximator used to represent the model is a
state aggregrator
• The recognizer behaves consistently with the function
approximator, i.e., c(s, a) = c(p, a), !s " p
• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:
µ(p) =N(p, c = 1)
N(p)
Then there exists a policy ! such that the off-policy learning
algorithm converges to the same model as the on-policy algorithm
using !.
Proof: In the limit, w.p.1, µ converges toP
s db(s|p)P
a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.
Let ! be defined to be the same for all states in a partition p:
!(p, a) = "(p, a)X
s
db(s|p)b(s, a)
! is well-defined, in the sense thatP
a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.
The authors gratefully acknowledge the ideas and encouragement
they have received in this work from Eddie Rafols, Mark Ring,
Lihong Li and other members of the rlai.net group. We thank Csaba
Szepesvari and the reviewers of the paper for constructive
comments. This research was supported in part by iCore, NSERC,
Alberta Ingenuity, and CFI.
The target policy ! is induced by a recognizer function
c : [0, 1] !" #+:
!(a) =c(a)b(a)
P
x c(x)b(x)=
c(a)b(a)µ
(1)
(see blue line below). The estimator is:
m! =1n
nX
i=1
zi!(ai)b(ai)
=1n
nX
i=1
zic(ai)b(ai)
µ
1b(ai)
=1n
nX
i=1
zic(ai)
µ
!" !"" #"" $"" %"" &"""
'&
!
!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349
:+;<7=;0,3-762+>3,
:+;<0,3-762+>3,
?=)@3,07804.)*/30.-;+724
McGill
The importance sampling corrections are:
!(s, a) ="(s, a)b(s, a)
=c(s, a)µ(s)
where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate
µ : S ! [0, 1], and importance sampling corrections will be definedas:
!(s, a) =c(s, a)
µ(s)
On-policy learning
If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.
The n-step truncated return is:
R(n)t = rt+1 + (1 ! "t+1)R
(n!1)t+1 .
The #-return is defined as usual:
R!t = (1 ! #)
"X
n=1
#n!1R(n)t .
The parameters of the function approximator are updated on every
step proportionally to:
!$t =h
R!t ! yt
i
""yt(1 ! "1) · · · (1 ! "t).
• Recognizers reduce variance
• First off-policy learning algorithm for option models
• Off-policy learning without knowledge of the behavior
distribution
• Observations
– Options are a natural way to reduce the variance of
importance sampling algorithms (because of the termination
condition)
– Recognizers are a natural way to define options, especially
for large or continuous action spaces.
Contributions! !"# !"$ %&'()*+
!
%, -.*/0/)1)(2
34+5)(2
67+'()*+5
80.94(
:*1)'2;<)(=;
.4'*9+)>4.
?4=0@)*.;:*1)'2
80.94(;
:*1)'2;<A*
;.4'*9+)>4.
Off-policy learning with options and recognizersDoina Precup, Richard S. Sutton, Cosmin Paduraru, Anna J. Koop, Satinder Singh
McGill University, University of Alberta, University of Michigan
Options
Distinguished
region
Ideas and Motivation Background Recognizers Off-policy algorithm for options Learning w/o the Behavior Policy
Wall
Options• A way of behaving for a period of time
Models of options• A predictive model of the outcome of following the option• What state will you be in?• Will you still control the ball?• What will be the value of some feature?• Will your teammate receive the pass?• What will be the expected total reward along the way?• How long can you keep control of the ball?
Dribble Keepaway Pass
Options for soccer players could be
Options in a 2D world
The red and blue options
are mostly executed.
Surely we should be able
to learn about them from
this experience!
Experienced
trajectory
Off-policy learning• Learning about one policy while behaving according to another• Needed for RL w/exploration (as in Q-learning)• Needed for learning abstract models of dynamical systems
(representing world knowledge)• Enables efficient exploration• Enables learning about many ways of behaving at the same time
(learning models of options)
! a policy! a stopping condition
Non-sequential example
Problem formulation w/o recognizers
Problem formulation with recognizers
• One state
• Continuous action a ! [0, 1]
• Outcomes zi = ai
• Given samples from policy b : [0, 1] " #+
• Would like to estimate the mean outcome for a sub-region of the
action space, here a ! [0.7, 0.9]
Target policy ! : [0, 1] ! "+ is uniform within the region of interest
(see dashed line in figure below). The estimator is:
m! =1n
nX
i=1
!(ai)
b(ai)zi.
Theorem 1 Let A = {a1, . . . ak} ! A be a subset of all the
possible actions. Consider a fixed behavior policy b and let !A be
the class of policies that only choose actions from A, i.e., if!(a) > 0 then a " A. Then the policy induced by b and the binaryrecognizer cA is the policy with minimum-variance one-step
importance sampling corrections, among those in !A:
! as given by (1) = arg minp!!A
Eb
"
„
!(ai)
b(ai)
«2#
(2)
Proof: Using Lagrange multipliers
Theorem 2 Consider two binary recognizers c1 and c2, such that
µ1 > µ2. Then the importance sampling corrections for c1 have
lower variance than the importance sampling corrections for c2.
Off-policy learning
Let the importance sampling ratio at time step t be:
!t ="(st, at)
b(st, at)
The truncated n-step return, R(n)t , satisfies:
R(n)t = !t[rt+1 + (1 ! #t+1)R
(n!1)t+1 ].
The update to the parameter vector is proportional to:
!$t =h
R!t ! yt
i
""yt!0(1 ! #1) · · · !t!1(1 ! #t).
Theorem 3 For every time step t ! 0 and any initial state s,
Eb[!!t|s] = E![!!t|s].
Proof: By induction on n we show that
Eb{R(n)t |s} = E!{R
(n)t |s}
which implies that Eb{R"t |s} = E!(R"
t |s}. The rest of the proof isalgebraic manipulations (see paper).
Implementation of off-policy learning for options
In order to avoid!! ! 0, we use a restart function g : S ! [0, 1](like in the PSD algorithm). The forward algorithm becomes:
!!t = (R!t " yt)#"yt
tX
i=0
gi"i..."t!1(1 " #i+1)...(1 " #t),
where gt is the extent of restarting in state st.
The incremental learning algorithm is the following:
• Initialize !0 = g0, e0 = !0!!y0
• At every time step t:
"t = #t (rt+1 + (1 " $t+1)yt+1) " yt
%t+1 = %t + &"tet
!t+1 = #t!t(1 " $t+1) + gt+1
et+1 = '#t(1 " $t+1)et + !t+1!!yt+1
References
Off-policy learning is tricky
• The Bermuda triangle
! Temporal-difference learning! Function approximation (e.g., linear)! Off-policy
• Leads to divergence of iterative algorithms! Q-learning diverges with linear FA! Dynamic programming diverges with linear FA
Baird's Counterexample
Vk(s) =
!(7)+2!(1)
terminal
state99%
1%
100%
Vk(s) =
!(7)+2!(2)
Vk(s) =
!(7)+2!(3)
Vk(s) =
!(7)+2!(4)
Vk(s) =
!(7)+2!(5)
Vk(s) =
2!(7)+!(6)
0
5
10
0 1000 2000 3000 4000 5000
10
10
/ -10
Iterations (k)
510
1010
010
-
-
Parametervalues, !k(i)
(log scale,
broken at !1)
!k(7)
!k(1) – !k(5)
!k(6)
Precup, Sutton & Dasgupta (PSD) algorithm
• Uses importance sampling to convert off-policy case to on-policy case• Convergence assured by theorem of Tsitsiklis & Van Roy (1997)• Survives the Bermuda triangle!
BUT!
• Variance can be high, even infinite (slow learning)• Difficult to use with continuous or large action spaces• Requires explicit representation of behavior policy (probability distribution)
Option formalism
An option is defined as a triple o = !I,!, ""
• I # S is the set of states in which the option can be initiated
• ! is the internal policy of the option
• " : S $ [0, 1] is a stochastic termination condition
We want to compute the reward model of option o:
Eo{R(s)} = E{r1 + r2 + . . . + rT |s0 = s, !, "}
We assume that linear function approximation is used to represent
the model:
Eo{R(s)} % #T $s = y
Baird, L. C. (1995). Residual algorithms: Reinforcement learning with function
approximation. In Proceedings of ICML.
Precup, D., Sutton, R. S. and Dasgupta, S. (2001). Off-policy temporal-difference
learning with function approximation. In Proceedings of ICML.
Sutton, R.S., Precup D. and Singh, S (1999). Between MDPs and semi-MDPs: A
framework for temporal abstraction in reinforcement learning. Artificial
Intelligence, vol . 112, pp. 181–211.
Sutton,, R.S. and Tanner, B. (2005). Temporal-difference networks. In Proceedings
of NIPS-17.
Sutton R.S., Rafols E. and Koop, A. (2006). Temporal abstraction in
temporal-difference networks”. In Proceedings of NIPS-18.
Tadic, V. (2001). On the convergence of temporal-difference learning with linear
function approximation. In Machine learning vol. 42.
Tsitsiklis, J. N., and Van Roy, B. (1997). An analysis of temporal-difference learning
with function approximation. IEEE Transactions on Automatic Control 42.
Acknowledgements
Theorem 4 If the following assumptions hold:
• The function approximator used to represent the model is a
state aggregrator
• The recognizer behaves consistently with the function
approximator, i.e., c(s, a) = c(p, a), !s " p
• The recognition probability for each partition, µ(p) is estimatedusing maximum likelihood:
µ(p) =N(p, c = 1)
N(p)
Then there exists a policy ! such that the off-policy learning
algorithm converges to the same model as the on-policy algorithm
using !.
Proof: In the limit, w.p.1, µ converges toP
s db(s|p)P
a c(p, a)b(s, a) where db(s|p) is the probability ofvisiting state s from partition p under the stationary distribution of b.
Let ! be defined to be the same for all states in a partition p:
!(p, a) = "(p, a)X
s
db(s|p)b(s, a)
! is well-defined, in the sense thatP
a !(s, a) = 1. Using Theorem3, off-policy updates using importance sampling corrections " willhave the same expected value as on-policy updates using !.
The authors gratefully acknowledge the ideas and encouragement
they have received in this work from Eddie Rafols, Mark Ring,
Lihong Li and other members of the rlai.net group. We thank Csaba
Szepesvari and the reviewers of the paper for constructive
comments. This research was supported in part by iCore, NSERC,
Alberta Ingenuity, and CFI.
The target policy ! is induced by a recognizer function
c : [0, 1] !" #+:
!(a) =c(a)b(a)
P
x c(x)b(x)=
c(a)b(a)µ
(1)
(see blue line below). The estimator is:
m! =1n
nX
i=1
zi!(ai)b(ai)
=1n
nX
i=1
zic(ai)b(ai)
µ
1b(ai)
=1n
nX
i=1
zic(ai)
µ
!" !"" #"" $"" %"" &"""
'&
!
!'& ()*+,+-./01.,+.2-345.13,.630780#""04.)*/301.,+.2-349
:+;<7=;0,3-762+>3,
:+;<0,3-762+>3,
?=)@3,07804.)*/30.-;+724
McGill
The importance sampling corrections are:
!(s, a) ="(s, a)b(s, a)
=c(s, a)µ(s)
where µ(s) depends on the behavior policy b. If b is unknown,instead of µ we will use a maximum likelihood estimate
µ : S ! [0, 1], and importance sampling corrections will be definedas:
!(s, a) =c(s, a)
µ(s)
On-policy learning
If ! is used to generate behavior, then the reward model of anoption can be learned using TD-learning.
The n-step truncated return is:
R(n)t = rt+1 + (1 ! "t+1)R
(n!1)t+1 .
The #-return is defined as usual:
R!t = (1 ! #)
"X
n=1
#n!1R(n)t .
The parameters of the function approximator are updated on every
step proportionally to:
!$t =h
R!t ! yt
i
""yt(1 ! "1) · · · (1 ! "t).
• Recognizers reduce variance
• First off-policy learning algorithm for option models
• Off-policy learning without knowledge of the behavior
distribution
• Observations
– Options are a natural way to reduce the variance of
importance sampling algorithms (because of the termination
condition)
– Recognizers are a natural way to define options, especially
for large or continuous action spaces.
Reinforcement Learning Algorithms in MarkovDecision Processes
AAAI-10 Tutorial
Introduction
Csaba Szepesvari Richard S. Sutton
University of AlbertaE-mails: {szepesva,rsutton}@.ualberta.ca
Atlanta, July 11, 20102010
-07-
12RL Algorithms
Outline
1 Introduction
2 Markov decision processesMotivating examplesControlled Markov processesAlternate definitionsPolicies, values
3 Theory of dynamic programmingThe fundamental theoremAlgorithms of dynamic programming
4 Bibliography
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 2 / 51
Outline
1 Introduction
2 Markov decision processesMotivating examplesControlled Markov processesAlternate definitionsPolicies, values
3 Theory of dynamic programmingThe fundamental theoremAlgorithms of dynamic programming
4 Bibliography2010
-07-
12RL Algorithms
Outline
Outline
Presenters
Richard S. Sutton is a professor and iCORE chair inthe Department of Computing Science at theUniversity of Alberta. He is a fellow of the AAAI andco-author of the textbook Reinforcement Learning: AnIntroduction from MIT Press. His research interestscenter on the learning problems facing adecision-maker interacting with its environment, whichhe sees as central to artificial intelligence.
Csaba Szepesvari, an Associate Professor at theDepartment of Computing Science of the University ofAlberta, is the coauthor of a book on nonlinearapproximate adaptive controllers and the author of arecent book on reinforcement learning. His maininterest is the design and analysis of efficient learningalgorithms in various active and passive learningscenarios.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 3 / 51
Presenters
Richard S. Sutton is a professor and iCORE chair inthe Department of Computing Science at theUniversity of Alberta. He is a fellow of the AAAI andco-author of the textbook Reinforcement Learning: AnIntroduction from MIT Press. His research interestscenter on the learning problems facing adecision-maker interacting with its environment, whichhe sees as central to artificial intelligence.
Csaba Szepesvari, an Associate Professor at theDepartment of Computing Science of the University ofAlberta, is the coauthor of a book on nonlinearapproximate adaptive controllers and the author of arecent book on reinforcement learning. His maininterest is the design and analysis of efficient learningalgorithms in various active and passive learningscenarios.
2010
-07-
12RL Algorithms
Introduction
Presenters
Place for shameless self-promotion! Buy the books!!:)
Reinforcement learning
Reward
State
Action
SystemSystem
ControllerController
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 4 / 51
Reinforcement learning
Reward
State
Action
SystemSystem
ControllerController
2010
-07-
12RL Algorithms
Introduction
Reinforcement learning
• Sequential decision making under uncertainty
• Long-term objective
• Numerical performance measure
• Learning! (to overcome the “curse of modeling”)
• Other terminologies in use:
– Agent, environment– Plant, Controller (feedback, closed-loop)
• The state is sometimes not observable but that should not bother us now
Preview of coming attractions
!"#$%&'%()
*+,-#.%'#"+'%()
!(,%&/.%'#"+'%()
!(,%&/.0#+"&1
2()'"(,
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 5 / 51
Preview of coming attractions
!"#$%&'%()
*+,-#.%'#"+'%()
!(,%&/.%'#"+'%()
!(,%&/.0#+"&1
2()'"(,
2010
-07-
12RL Algorithms
Introduction
Preview of coming attractions
• The structure of the talk follows this.
• Except that first we introduce the framework of MDPs
The structure of the tutorial
Markov decision processesI Generalizes shortest path computationsI Stochasticity, state, action, reward, value functions, policiesI Bellman (optimality) equations, operators, fixed-pointsI Value iteration, policy iteration
Value predictionI Temporal difference learning unifies Monte-Carlo and bootstrappingI Function approximation to deal with large spacesI New gradient based methodsI Least-squares methods
ControlI Closed-loop interactive learning: exploration vs. exploitationI Q-learningI SARSAI Policy gradient, natural actor-critic
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 6 / 51
The structure of the tutorial
Markov decision processesI Generalizes shortest path computationsI Stochasticity, state, action, reward, value functions, policiesI Bellman (optimality) equations, operators, fixed-pointsI Value iteration, policy iteration
Value predictionI Temporal difference learning unifies Monte-Carlo and bootstrappingI Function approximation to deal with large spacesI New gradient based methodsI Least-squares methods
ControlI Closed-loop interactive learning: exploration vs. exploitationI Q-learningI SARSAI Policy gradient, natural actor-critic
2010
-07-
12RL Algorithms
Introduction
The structure of the tutorial
Mention that we will see quite a few applications along the way.
How to get to Atlanta?
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
How to get to Atlanta?
• Story: How do you decide what is the shortest (fastest, cheapest, etc.)way to get to AAAI?
(if you do not have a secretary)
• Consider the alternatives!
• There are many many paths!!
• How to find the shortest one in an efficient manner?
• Let the audience think.. They should answer this question really..
• Then solve the shortest path problem by hand, by computing the optimalcost-to-go values, with full backups, following the ordering of nodesshown
• The goal is the node at the right that does not have any out edge.
How to get to Atlanta?
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
How to get to Atlanta?
• Mention that extension to non-uniform costs is trivial
• Except when some of the costs could be negative
• Assume that some costs can be negative.
Homework: when will the algorithm work?
Give examples when it does work and when it does not work!
• Solution: Two conditions:
– There should be no “free lunch”F “Free lunch” ≡ ∃ policy that does not reach the goal state
yet it has less than infinite cost from each state
– There should be at least one policy that reaches the goal
Value iteration
function VALUEITERATION(x∗)1: for x ∈ X do V[x]← 02: V ′ ← V3: repeat4: for x ∈ X \ {x∗} do5: V[x]← 1 + miny∈N (x) V(y)6: end for7: until V 6= V ′
8: return V
function BESTNEXTNODE(x,V)1: return arg miny∈N (x) V(y)
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 10 / 51
Value iteration
function VALUEITERATION(x∗)1: for x ∈ X do V[x]← 02: V ′ ← V3: repeat4: for x ∈ X \ {x∗} do5: V[x]← 1 + miny∈N (x) V(y)6: end for7: until V 6= V ′
8: return V
function BESTNEXTNODE(x,V)1: return arg miny∈N (x) V(y)20
10-0
7-12
RL Algorithms
Markov decision processes
Motivating examples
Value iteration
• Go through the algorithm
• Discuss how to decide which way to go to stay on the shortest path
• Introduce greedy choice
• Introduce state, action
• Introduce policy, i.e., stationary deterministic policy
Rewarding excursions
function VALUEITERATION
1: for x ∈ X do V[x]← 02: V ′ ← V3: repeat4: for x ∈ X \ {x∗} do5: V[x]← max
a∈A(x){ r(x, a) + γ V( f (x, a)) }
6: end for7: until V 6= V ′
8: return V
function BESTACTION(x,V)1: return argmax
a∈A(x){ r(x, a) + γ V( f (x, a)) }
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 11 / 51
Rewarding excursions
function VALUEITERATION
1: for x ∈ X do V[x]← 02: V ′ ← V3: repeat4: for x ∈ X \ {x∗} do5: V[x]← max
a∈A(x){ r(x, a) + γ V( f (x, a)) }
6: end for7: until V 6= V ′
8: return V
function BESTACTION(x,V)1: return argmax
a∈A(x){ r(x, a) + γ V( f (x, a)) }20
10-0
7-12
RL Algorithms
Markov decision processes
Motivating examples
Rewarding excursions
• Introduce rewards
• Introduce transition function f
• Introduce discounting:
– One dollar in the future worth less than today.– In general, the future is important but maybe not as important as
today– Write
∑∞t=0 γ
tRt
– Economics: γ = 1/(1 + ρ), where 0 < ρ� 1 is the interest rate.
Uncertainty
“Uncertainty is the only certainty there is, andknowing how to live with insecurity is the onlysecurity.” (John Allen Paulos, 1945–)
Next state might be uncertainThe reward dettoAdvantage: Richer model, robustnessA transition from X after taking action A:
Y = f (X,A,D),
R = g(X,A,D)
D – random variable; “disturbance”f – transition functiong – reward function
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 12 / 51
Uncertainty
“Uncertainty is the only certainty there is, andknowing how to live with insecurity is the onlysecurity.” (John Allen Paulos, 1945–)
Next state might be uncertainThe reward dettoAdvantage: Richer model, robustnessA transition from X after taking action A:
Y = f (X,A,D),
R = g(X,A,D)
D – random variable; “disturbance”f – transition functiong – reward function
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Uncertainty
• John Allen Paulos, http://www.math.temple.edu/˜paulos/,John Allen Paulos is an extensively “kudized” author, popular publicspeaker, and monthly columnist for ABCNews.com and formerly for theGuardian. Professor of math at Temple. Funny guy, has interestingthoughts:)
• In the previous graph we could have actions, instead of edges and thenstochastic transitions out from the nodes.
Power management
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Power management
• Blurb about how important is power management
The monumental number of PCs operating worldwide createsother requirements for PC power management. Because there arehundreds of millions of PCs in operation, the installed base ofcomputers worldwide consumes tens of gigawatts for every hour ofoperation. Even small changes in average desktop computer powerconsumption can, on the global scale, save as much power asgenerated by a small power plant. (Source: Intel, http://www.intel.com/intelpress/samples/PPM_chapter.pdf)
Computer usage data
Sheet1
Page 1
HomeGaming 4Music entertainment 4Transcode multitasking 3Internet content creation 4Broad based productivity 36Media playback multitasking 4Windows idle 44
OfficeTranscode multitasking 2Internet content creation 3Broad based productivity 53Video content creation 1Image content creation 2Windows idle 39
http://www.amd.com/us/Documents/43029A_Brochure_PFD.pdf
Sheet1
Page 1
HomeGaming 4Music entertainment 4Transcode multitasking 3Internet content creation 4Broad based productivity 36Media playback multitasking 4Windows idle 44
OfficeTranscode multitasking 2Internet content creation 3Broad based productivity 53Video content creation 1Image content creation 2Windows idle 39
http://www.amd.com/us/Documents/43029A_Brochure_PFD.pdf
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 14 / 51
Source: http://www.amd.com/us/Documents/43029A_Brochure_PFD.pdf
Computer usage data
Sheet1
Page 1
HomeGaming 4Music entertainment 4Transcode multitasking 3Internet content creation 4Broad based productivity 36Media playback multitasking 4Windows idle 44
OfficeTranscode multitasking 2Internet content creation 3Broad based productivity 53Video content creation 1Image content creation 2Windows idle 39
http://www.amd.com/us/Documents/43029A_Brochure_PFD.pdf
Sheet1
Page 1
HomeGaming 4Music entertainment 4Transcode multitasking 3Internet content creation 4Broad based productivity 36Media playback multitasking 4Windows idle 44
OfficeTranscode multitasking 2Internet content creation 3Broad based productivity 53Video content creation 1Image content creation 2Windows idle 39
http://www.amd.com/us/Documents/43029A_Brochure_PFD.pdf
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Computer usage data
• Background: AMD surveyed over 1200 users to determine consumerand commercial usage patterns (daily time spent on each application orat idle) in four countries.
• What is important:
– Computers are often idle– They do all kind of works at other times, which require different
parts of the computer to be awake– Why is power management challenging?
F Everyone uses the computer differently.F One size fits all?? NO!
Power management
Advanced Configuration and Power Interface (ACPI)First released in December 1996, last release in June 2010Platform-independent interfaces for hardware discovery,configuration, power management and monitoring
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 15 / 51
Power management
Advanced Configuration and Power Interface (ACPI)First released in December 1996, last release in June 2010Platform-independent interfaces for hardware discovery,configuration, power management and monitoring
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Power management
This is a complex issue. This will be illustrated on the next two slides.The details on these slides are not so important. However, theyillustrate well the complexity of the issue.
Power mgmt – Power states
G0 (S0): WorkingG1, Sleeping subdivides into the four states S1 through S4
I S1: All processor caches are flushed, and the CPU(s) stopexecuting instructions. Power to the CPU(s) and RAM ismaintained; devices that do not indicate they must remain on maybe powered down
I S2: CPU powered offI S3: Commonly referred to as Standby, Sleep, or Suspend to RAM.
RAM remains poweredI S4: Hibernation or Suspend to Disk. All content of main memory is
saved to non-volatile memory such as a hard drive, and is powereddown
G2 (S5), Soft Off: G2 is almost the same as G3 Mechanical Off,but some components remain powered so the computer can”wake” from input from the keyboard, clock, modem, LAN, or USBdevice.G3, Mechanical Off: The computer’s power consumptionapproaches close to zero, to the point that the power cord can beremoved and the system is safe for dis-assembly (typically, onlythe real-time clock is running off its own small battery).
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 16 / 51
Power mgmt – Power states
G0 (S0): WorkingG1, Sleeping subdivides into the four states S1 through S4
I S1: All processor caches are flushed, and the CPU(s) stopexecuting instructions. Power to the CPU(s) and RAM ismaintained; devices that do not indicate they must remain on maybe powered down
I S2: CPU powered offI S3: Commonly referred to as Standby, Sleep, or Suspend to RAM.
RAM remains poweredI S4: Hibernation or Suspend to Disk. All content of main memory is
saved to non-volatile memory such as a hard drive, and is powereddown
G2 (S5), Soft Off: G2 is almost the same as G3 Mechanical Off,but some components remain powered so the computer can”wake” from input from the keyboard, clock, modem, LAN, or USBdevice.G3, Mechanical Off: The computer’s power consumptionapproaches close to zero, to the point that the power cord can beremoved and the system is safe for dis-assembly (typically, onlythe real-time clock is running off its own small battery).
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Power mgmt – Power states
Power mgmt – Device, processor, performance statesDevice states
I D0 Fully-On is the operating stateI D1 and D2 are intermediate power-states whose definition varies
by device.I D3 Off has the device powered off and unresponsive to its bus.
Processor statesI C0 is the operating state.I C1 (often known as Halt) is a state where the processor is not
executing instructions, but can return to an executing stateessentially instantaneously. All ACPI-conformant processors mustsupport this power state. Some processors, such as the Pentium 4,also support an Enhanced C1 state (C1E or Enhanced Halt State)for lower power consumption.
I C2 (often known as Stop-Clock) is a state where the processormaintains all software-visible state, but may take longer to wake up.This processor state is optional.
I C3 (often known as Sleep) is a state where the processor does notneed to keep its cache coherent, but maintains other state. Someprocessors have variations on the C3 state (Deep Sleep, DeeperSleep, etc.) that differ in how long it takes to wake the processor.This processor state is optional.
Performance states: While a device or processor operates (D0and C0, respectively), it can be in one of severalpower-performance states. These states areimplementation-dependent, but P0 is always thehighest-performance state, with P1 to Pn being successivelylower-performance states, up to an implementation-specific limit ofn no greater than 16.P-states have become known as SpeedStep in Intel processors,as PowerNow! or Cool’n’Quiet in AMD processors, and asPowerSaver in VIA processors.
I P0 max power and frequencyI P1 less than P0, voltage/frequency scaledI Pn less than P(n-1), voltage/frequency scaled
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 17 / 51
Power mgmt – Device, processor, performance statesDevice states
I D0 Fully-On is the operating stateI D1 and D2 are intermediate power-states whose definition varies
by device.I D3 Off has the device powered off and unresponsive to its bus.
Processor statesI C0 is the operating state.I C1 (often known as Halt) is a state where the processor is not
executing instructions, but can return to an executing stateessentially instantaneously. All ACPI-conformant processors mustsupport this power state. Some processors, such as the Pentium 4,also support an Enhanced C1 state (C1E or Enhanced Halt State)for lower power consumption.
I C2 (often known as Stop-Clock) is a state where the processormaintains all software-visible state, but may take longer to wake up.This processor state is optional.
I C3 (often known as Sleep) is a state where the processor does notneed to keep its cache coherent, but maintains other state. Someprocessors have variations on the C3 state (Deep Sleep, DeeperSleep, etc.) that differ in how long it takes to wake the processor.This processor state is optional.
Performance states: While a device or processor operates (D0and C0, respectively), it can be in one of severalpower-performance states. These states areimplementation-dependent, but P0 is always thehighest-performance state, with P1 to Pn being successivelylower-performance states, up to an implementation-specific limit ofn no greater than 16.P-states have become known as SpeedStep in Intel processors,as PowerNow! or Cool’n’Quiet in AMD processors, and asPowerSaver in VIA processors.
I P0 max power and frequencyI P1 less than P0, voltage/frequency scaledI Pn less than P(n-1), voltage/frequency scaled
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Power mgmt – Device, processor, performance states
An oversimplified model
NoteThe transitions can be represented as
Y = f (x, a,D),
R = g(x, a,D).
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 18 / 51
An oversimplified model
NoteThe transitions can be represented as
Y = f (x, a,D),
R = g(x, a,D).2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
An oversimplified model
• This is indeed oversimplified
• We could have more states
• . . . but the message should be clear.
Value iteration
function VALUEITERATION
1: for x ∈ X do V[x]← 02: V ′ ← V3: repeat4: for x ∈ X \ {x∗} do5: V[x]← max
a∈A(x)E [g(x, a,D) + γ V( f (x, a,D)) ]
6: end for7: until V 6= V ′
8: return V
function BESTACTION(x,V)1: return argmax
a∈A(x)E [g(x, a,D) + γ V( f (x, a,D)) ]
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 19 / 51
Value iteration
function VALUEITERATION
1: for x ∈ X do V[x]← 02: V ′ ← V3: repeat4: for x ∈ X \ {x∗} do5: V[x]← max
a∈A(x)E [g(x, a,D) + γ V( f (x, a,D)) ]
6: end for7: until V 6= V ′
8: return V
function BESTACTION(x,V)1: return argmax
a∈A(x)E [g(x, a,D) + γ V( f (x, a,D)) ]
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Value iteration
• Straightforward generalization of previous algorithm:
In each step we compute the expected total reward based on theassumption that the total future rewards are given by V
• Homework: Think about the stochastic shortest path case and convinceyourself that this works.
• Question: Why does it work in general? POSTPONED!
• Show how the computation is done⇐ Excel sheet!
• IMPORTANT: Uncertainty makes it necessary to check back where youare! In deterministic problems, knowing the initial state is enough toknow how to act until you get to your goal!
How to gamble if you must?
The safest way to double your money is to fold it over onceand put it in your pocket. (“Kin” Hubbard, 1868–1930)
State Xt ≡ wealth of gambler at step t, Xt ≥ 0
Action: At ∈ [0, 1]: the fraction of Xt put at stakeSt ∈ {−1,+1}, P (St+1 = 1) = p, p ∈ [0, 1], i.i.d., random variablesFortune at next time step:
Xt+1 = (1 + St+1At)Xt.
Goal: maximize the probability that the wealth reaches w∗.How to put this into our framework?
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 20 / 51
How to gamble if you must?
The safest way to double your money is to fold it over onceand put it in your pocket. (“Kin” Hubbard, 1868–1930)
State Xt ≡ wealth of gambler at step t, Xt ≥ 0
Action: At ∈ [0, 1]: the fraction of Xt put at stakeSt ∈ {−1,+1}, P (St+1 = 1) = p, p ∈ [0, 1], i.i.d., random variablesFortune at next time step:
Xt+1 = (1 + St+1At)Xt.
Goal: maximize the probability that the wealth reaches w∗.How to put this into our framework?20
10-0
7-12
RL Algorithms
Markov decision processes
Motivating examples
How to gamble if you must?
• Frank McKinney Hubbard is an American cartoonist, humorist, andjournalist better known by his pen name ”Kin” Hubbard.
• Imagine e.g. that as a last resort you have to bet on horses because youjust need to money to pay back your debt to the mafia or you die.
How to gamble if you must? – Solution
Xt ∈ X = [0,w∗], A = [0, 1]
Let f : X ×A× {−1,+1} → X be
f (x, a, s) =
{(1 + s a)x ∧ w∗, if x < w∗;w∗, otherwise.
Let g : X ×A× {−1,+1} → X be
g(x, a, s) =
{1, if (1 + s a)x ≥ w∗ and x < w∗;0, otherwise.
What is the optimal policy?
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 21 / 51
How to gamble if you must? – Solution
Xt ∈ X = [0,w∗], A = [0, 1]
Let f : X ×A× {−1,+1} → X be
f (x, a, s) =
{(1 + s a)x ∧ w∗, if x < w∗;w∗, otherwise.
Let g : X ×A× {−1,+1} → X be
g(x, a, s) =
{1, if (1 + s a)x ≥ w∗ and x < w∗;0, otherwise.
What is the optimal policy?2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
How to gamble if you must? – Solution
• Mention that this is an episodic problem
• The strange tent-like symbol, ∧, is a binary operator, similar to, say, +. Itcomputes the minimum of its arguments.
• Here w∗ is an absorbing state
• This is a prototypical example of how we can deal with problems whenthe goal is to maximize the probability of an event
• Ask audience: How would you play this game?
• How to compute the optimal value function?
The state space became infinite..
• Optimal strategy: “bold strategy” (no need to tell them)– Risk the smallest of the amount missing to reach w∗ and your
current wealth
Inventory control
19:00
7:00
14:00
X = {0, 1, . . . ,M}; Xt size of the inventory in the evening of day t
A = {0, 1, . . . ,M}; At number of items ordered in the evening of day t
Dynamics:Xt+1 = ((Xt + At) ∧M − Dt+1)
+.
Reward:
Rt+1 = −K I{At>0} − c ((Xt + At) ∧M − Xt)+
− h Xt + p ((Xt + At) ∧M − Xt+1)+.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 22 / 51
Inventory control
19:00
7:00
14:00
X = {0, 1, . . . ,M}; Xt size of the inventory in the evening of day t
A = {0, 1, . . . ,M}; At number of items ordered in the evening of day t
Dynamics:Xt+1 = ((Xt + At) ∧M − Dt+1)
+.
Reward:
Rt+1 = −K I{At>0} − c ((Xt + At) ∧M − Xt)+
− h Xt + p ((Xt + At) ∧M − Xt+1)+.
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Inventory control
• Explanation of the quantities involved:
– M – maximum inventory size– Dt+1 – demand– K – fixed cost of ordering any items– c – cost of lost sales– h – inventory cost– p – sales proceeds
• Zt = (Xt + At) ∧M – post-decision state, or afterstate
This also comes up in games
Other examples
Engineering, operations researchI Process control
F ChemicalF ElectronicF Mechanical systems⇒ ROBOTS
I Supply chain managementInformation theory
I optimal codingI channel allocationI sensing, sensor networks
FinanceI portfolio managementI option pricing
Artificial intelligenceI The whole problem of acting under uncertaintyI SearchI GamesI Vision: Gaze controlI Information retrieval
...Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 23 / 51
Other examples
Engineering, operations researchI Process control
F ChemicalF ElectronicF Mechanical systems⇒ ROBOTS
I Supply chain managementInformation theory
I optimal codingI channel allocationI sensing, sensor networks
FinanceI portfolio managementI option pricing
Artificial intelligenceI The whole problem of acting under uncertaintyI SearchI GamesI Vision: Gaze controlI Information retrieval
...
2010
-07-
12RL Algorithms
Markov decision processes
Motivating examples
Other examples
• The point is, these problems are ubiquous.
• TODO: Add some figures
Controlled Markov processes
Xt+1 = f (Xt,At,Dt+1) State dynamicsRt+1 = g(Xt,At,Dt+1) Reward
t = 0, 1, . . . .
Xt ∈ X – state at time t
X – set of statesAt ∈ A – action at time t
A – set of actionsSometimes, A(x): admissible actionsRt+1 ∈ R – reward⇒ RDt ∈ D – disturbance; i.i.d. sequenceD – disturbance space
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 25 / 51
Controlled Markov processes
Xt+1 = f (Xt,At,Dt+1) State dynamicsRt+1 = g(Xt,At,Dt+1) Reward
t = 0, 1, . . . .
Xt ∈ X – state at time t
X – set of statesAt ∈ A – action at time t
A – set of actionsSometimes, A(x): admissible actionsRt+1 ∈ R – reward⇒ RDt ∈ D – disturbance; i.i.d. sequenceD – disturbance space
2010
-07-
12RL Algorithms
Markov decision processes
Controlled Markov processes
Controlled Markov processes
This just collects on a single slide what we have talked about before.
Return
Definition (Return)Return, or total discounted return is:
R =
∞∑t=0
γtRt+1,
where 0 ≤ γ ≤ 1 is the so-called discount factor. The return dependson how we act!
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 26 / 51
Return
Definition (Return)Return, or total discounted return is:
R =
∞∑t=0
γtRt+1,
where 0 ≤ γ ≤ 1 is the so-called discount factor. The return dependson how we act!
2010
-07-
12RL Algorithms
Markov decision processes
Controlled Markov processes
Return
• The return is an important quantity.
• Return 6= immediate reward (or, just reward)
• The index goes fro time step zero, but the “zero” can be shifted around
• If the rewards are bounded in expectation and the discount factor γ isless than one then the expected return is well defined.
• If the rewards can be unbounded (from below, or above), care must betaken, e.g., Gaussian noise..
• The discount factor could be one, but then one must be careful becausethe return might become unbounded, even when the rewards arebounded
The goal of control
GoalMaximize the expected total discounted reward, or expected return,irrespective of the initial state:
E
[ ∞∑t=0
γtRt+1 |X0 = x
]→ max!, x ∈ X .
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 27 / 51
The goal of control
GoalMaximize the expected total discounted reward, or expected return,irrespective of the initial state:
E
[ ∞∑t=0
γtRt+1 |X0 = x
]→ max!, x ∈ X .
2010
-07-
12RL Algorithms
Markov decision processes
Controlled Markov processes
The goal of control
• Note that there is no distribution over the states.
• We want to act optimally from each state
• For each state, a different “policy” might be optimal.
Alternate definition
Definition (Markov decision process)Triplet: (X ,A,P0), whereX – set of statesA – set of actionsP0 – state and reward kernelP0(U|x, a) is the probability that (Xt+1,Rt+1) lands in U ⊂ X × Rgiven that Xt = x, At = a
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 29 / 51
Alternate definition
Definition (Markov decision process)Triplet: (X ,A,P0), whereX – set of statesA – set of actionsP0 – state and reward kernelP0(U|x, a) is the probability that (Xt+1,Rt+1) lands in U ⊂ X × Rgiven that Xt = x, At = a
2010
-07-
12RL Algorithms
Markov decision processes
Alternate definitionsAlternate definition
• Somewhat unconventional definition, but very compact at least,and very general, too
Connection to previous definition
Assume that
Xt+1 = f (Xt,At,Dt+1)
Rt+1 = g(Xt,At,Dt+1)
t = 0, 1, . . . .
ThenP0(U| x, a) = P ( [ f (x, a,D), g(x, a,D)] ∈ U ) ,
Here, D has the same distribution as D1,D2, . . ..
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 30 / 51
Connection to previous definition
Assume that
Xt+1 = f (Xt,At,Dt+1)
Rt+1 = g(Xt,At,Dt+1)
t = 0, 1, . . . .
ThenP0(U| x, a) = P ( [ f (x, a,D), g(x, a,D)] ∈ U ) ,
Here, D has the same distribution as D1,D2, . . ..2010
-07-
12RL Algorithms
Markov decision processes
Alternate definitionsConnection to previous definition
• The two definitions, in fact, are equivalent.
• Sometimes this, sometimes the other definition is the more convenient
“Classical form”
Finite MDP (as is often seen in AI publications):
(X ,A,P, r)
X ,A are finite.P(x, a, y) is the probability of landing at state y given that action awas chosen in state x
r(x, a, y) is the expected reward received when making thistransition.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 31 / 51
“Classical form”
Finite MDP (as is often seen in AI publications):
(X ,A,P, r)
X ,A are finite.P(x, a, y) is the probability of landing at state y given that action awas chosen in state x
r(x, a, y) is the expected reward received when making thistransition.
2010
-07-
12RL Algorithms
Markov decision processes
Alternate definitions“Classical form”
• This is a smaller class
• But if X , A are allowed to be countably infinite, the analysis can getpretty complicated.
• What is lost is our ability to talk about continuity:
– The naming of states, actions is arbitrary!!– In continuous problems, we often have some sort of continuity,
allowing for generalization to “nearby” states/actions!– This could be mimicked by introducing some kind of “distance
function”
Policies, values
NoteFrom now on we assume that A is countable.
Definition (General policy)Maps each history to a distribution over A.Deterministic policy: π = (π0, π1, . . .), where π0 : X → A andπt : (X ×A× R)t−1 ×X → A, t = 1, 2, . . ..Following the policy: At = πt(X0,A0,R1, . . . ,Xt−1,At−1,Rt,Xt).
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 33 / 51
Policies, values
NoteFrom now on we assume that A is countable.
Definition (General policy)Maps each history to a distribution over A.Deterministic policy: π = (π0, π1, . . .), where π0 : X → A andπt : (X ×A× R)t−1 ×X → A, t = 1, 2, . . ..Following the policy: At = πt(X0,A0,R1, . . . ,Xt−1,At−1,Rt,Xt).
2010
-07-
12RL Algorithms
Markov decision processes
Policies, valuesPolicies, values
• General policies: When we do learning, we follow a general policy,because the policy depends on the history!
• Usually the history is compressed in some form
Stationary policies
Definition (Stationary policy)The map depends on the last state only.
Deterministic policy: π = (π0, π0, . . .).Following the policy: At = π0(Xt).Stochastic policy: π = (π0, π0, . . .), π0 : X → M1(A).Following the policy: At ∼ π0(·|Xt).
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 34 / 51
Stationary policies
Definition (Stationary policy)The map depends on the last state only.
Deterministic policy: π = (π0, π0, . . .).Following the policy: At = π0(Xt).Stochastic policy: π = (π0, π0, . . .), π0 : X → M1(A).Following the policy: At ∼ π0(·|Xt).
2010
-07-
12RL Algorithms
Markov decision processes
Policies, valuesStationary policies
• We just identify π and π0.
• Stationary policies has a distinctive role in the theory of MDPs
The value of a policy
Definition (Value of a state under π)The expected return given that the policy is started in state x:
Vπ(x) = E [Rπ|X0 = x] .
Vπ – value function of π.
Definition (Action-value of a state-action pair under π)The expected return given that the process is started from state x, thefirst action is a after which the policy π is followed:
Qπ(x, a) = E [Rπ|X0 = x,A0 = a] .
Qπ – action-value function of π
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 35 / 51
The value of a policy
Definition (Value of a state under π)The expected return given that the policy is started in state x:
Vπ(x) = E [Rπ|X0 = x] .
Vπ – value function of π.
Definition (Action-value of a state-action pair under π)The expected return given that the process is started from state x, thefirst action is a after which the policy π is followed:
Qπ(x, a) = E [Rπ|X0 = x,A0 = a] .
Qπ – action-value function of π2010
-07-
12RL Algorithms
Markov decision processes
Policies, valuesThe value of a policy
• These are well-defined under our conditions.
• Even for general policies.
• The action-values were introduced by Watkins; they are very useful aswe shall see later.
Optimal values
Definition (Optimal values)The optimal value of a state is the value of the best possible expectedreturn that can be obtained from that state:
V∗(x) = supπ
Vπ(x).
Similarly, the optimal value of a state-action pair isQ∗(x, a) = supπ Qπ(x, a).
Definition (Optimal policy)A policy π is called optimal if Vπ(x) = V∗(x) holds for all states x ∈ X .
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 36 / 51
Optimal values
Definition (Optimal values)The optimal value of a state is the value of the best possible expectedreturn that can be obtained from that state:
V∗(x) = supπ
Vπ(x).
Similarly, the optimal value of a state-action pair isQ∗(x, a) = supπ Qπ(x, a).
Definition (Optimal policy)A policy π is called optimal if Vπ(x) = V∗(x) holds for all states x ∈ X .20
10-0
7-12
RL Algorithms
Markov decision processes
Policies, valuesOptimal values
• The questions are:
– Does there exist and optimal policy?– A simple optimal policy?– A computable optimal policy?– How to compute it?
The fundamental theorem and the Bellman (optimality) operator
TheoremAssume that |A| < +∞. Then the optimal value function satisfies
V∗(x) = maxa∈A
r(x, a) + γ∑y∈XP(x, a, y)V∗(y)
, x ∈ X .
and if policy π is such that in each state x it selects an action that maximizesthe r.h.s. then π is an optimal policy.
A shorter way to write this isV∗ = T∗V∗,
(T∗V)(x) = maxa∈A
r(x, a) + γ∑y∈XP(x, a, y)V(y)
, x ∈ X .
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 38 / 51
The fundamental theorem and the Bellman (optimality) operator
TheoremAssume that |A| < +∞. Then the optimal value function satisfies
V∗(x) = maxa∈A
r(x, a) + γ∑y∈XP(x, a, y)V∗(y)
, x ∈ X .
and if policy π is such that in each state x it selects an action that maximizesthe r.h.s. then π is an optimal policy.
A shorter way to write this isV∗ = T∗V∗,
(T∗V)(x) = maxa∈A
r(x, a) + γ∑y∈XP(x, a, y)V(y)
, x ∈ X .2010
-07-
12RL Algorithms
Theory of dynamic programming
The fundamental theoremThe fundamental theorem and the Bellman(optimality) operator
• Explain that operators are just functions that act on functions, butfunctions are really like vectors, so no one should be afraid of this.
• What is the history? Hard to tell. The theorem exists in variousgeneralities. This simple form must have been known to Bellman(1920–1984), who “invented” dynamic programming in 1953. Dreyfus,Blackwell and others worked out the math for the more complicatedcases and there is still work left. Some older economics literaturementioned the principle of optimality.
Action evaluation operator
Definition (Action evaluation operator)Let a ∈ A and define
(TaV)(x) = r(x, a) + γ∑y∈XP(x, a, y)V(y), x ∈ X .
Comment
T∗V [x] = maxa∈A
TaV [x].
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 39 / 51
Action evaluation operator
Definition (Action evaluation operator)Let a ∈ A and define
(TaV)(x) = r(x, a) + γ∑y∈XP(x, a, y)V(y), x ∈ X .
Comment
T∗V [x] = maxa∈A
TaV [x].2010
-07-
12RL Algorithms
Theory of dynamic programming
The fundamental theoremAction evaluation operator
Policy evaluation operator
Definition (Policy evaluation operator)Let π be a stochastic stationary policy. Define
(TπV)(x) =∑a∈A
π(a|x)
r(x, a) + γ∑y∈XP(x, a, y)V(y)
=
∑a∈A
π(a|x)TaV(x), x ∈ X .
CorollaryTπ is a contraction, and Vπ is the unique fixed point of Tπ.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 40 / 51
Policy evaluation operator
Definition (Policy evaluation operator)Let π be a stochastic stationary policy. Define
(TπV)(x) =∑a∈A
π(a|x)
r(x, a) + γ∑y∈XP(x, a, y)V(y)
=
∑a∈A
π(a|x)TaV(x), x ∈ X .
CorollaryTπ is a contraction, and Vπ is the unique fixed point of Tπ.
2010
-07-
12RL Algorithms
Theory of dynamic programming
The fundamental theoremPolicy evaluation operator
Greedy policy
Definition (Greedy policy)Policy π is greedy w.r.t. V if
TπV = T∗V,
or
∑a∈A
π(a|x)
r(x, a) + γ∑y∈XP(x, a, y)V(y)
=
maxa∈A
{r(x, a) + γ
∑y∈X P(x, a, y)V(y)
}holds for all states x.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 41 / 51
Greedy policy
Definition (Greedy policy)Policy π is greedy w.r.t. V if
TπV = T∗V,
or
∑a∈A
π(a|x)
r(x, a) + γ∑y∈XP(x, a, y)V(y)
=
maxa∈A
{r(x, a) + γ
∑y∈X P(x, a, y)V(y)
}holds for all states x.20
10-0
7-12
RL Algorithms
Theory of dynamic programming
The fundamental theoremGreedy policy
A restatement of the main theorem
TheoremAssume that |A| < +∞. Then the optimal value function satisfies thefixed-point equation V∗ = T∗V∗ and any greedy policy w.r.t. V∗ isoptimal.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 42 / 51
A restatement of the main theorem
TheoremAssume that |A| < +∞. Then the optimal value function satisfies thefixed-point equation V∗ = T∗V∗ and any greedy policy w.r.t. V∗ isoptimal.
2010
-07-
12RL Algorithms
Theory of dynamic programming
The fundamental theoremA restatement of the main theorem
Action-value functions
CorollaryLet Q∗ be the optimal action-value function. Then,
Q∗ = T∗Q∗
and if π is a policy such that∑a∈A
π(a|x)Q∗(x, a) = maxa∈A
Q∗(x, a)
then π is optimal. Here,
T∗Q (x, a) = r(x, a) + γ∑y∈XP(x, a, y)max
a′∈AQ(y, a′), x ∈ X , a ∈ A.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 43 / 51
Action-value functions
CorollaryLet Q∗ be the optimal action-value function. Then,
Q∗ = T∗Q∗
and if π is a policy such that∑a∈A
π(a|x)Q∗(x, a) = maxa∈A
Q∗(x, a)
then π is optimal. Here,
T∗Q (x, a) = r(x, a) + γ∑y∈XP(x, a, y)max
a′∈AQ(y, a′), x ∈ X , a ∈ A.20
10-0
7-12
RL Algorithms
Theory of dynamic programming
The fundamental theoremAction-value functions
• The advantage is that the knowledge of Q∗ alone (without knowing themodel) is sufficient to know how to act optimally.
• The proof of the corollary is very simple from the fundamental theorem.
Finding the action-value functions of policies
TheoremLet π be a stationary policy, Tπ be defined by
TπQ (x, a) = r(x, a) + γ∑y∈XP(x, a, y)
∑a′∈A
π(a′|y)Q(y, a′), x ∈ X , a ∈ A.
Then Qπ is the unique solution of
TπQπ = Qπ.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 44 / 51
Finding the action-value functions of policies
TheoremLet π be a stationary policy, Tπ be defined by
TπQ (x, a) = r(x, a) + γ∑y∈XP(x, a, y)
∑a′∈A
π(a′|y)Q(y, a′), x ∈ X , a ∈ A.
Then Qπ is the unique solution of
TπQπ = Qπ.
2010
-07-
12RL Algorithms
Theory of dynamic programming
The fundamental theoremFinding the action-value functions of policies
Value iteration – a second look
function VALUEITERATION
1: for x ∈ X do V[x]← 02: V ′ ← V3: repeat4: for x ∈ X \ {x∗} do5: V[x]← T∗V [x]6: end for7: until V 6= V ′
8: return V
function BESTACTION(x,V)1: return argmax
a∈A(x)TaV [x]
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 46 / 51
Value iteration – a second look
function VALUEITERATION
1: for x ∈ X do V[x]← 02: V ′ ← V3: repeat4: for x ∈ X \ {x∗} do5: V[x]← T∗V [x]6: end for7: until V 6= V ′
8: return V
function BESTACTION(x,V)1: return argmax
a∈A(x)TaV [x]20
10-0
7-12
RL Algorithms
Theory of dynamic programming
Algorithms of dynamic programming
Value iteration – a second look
• Asynchronous updates have been studied.
• Also, “labelled value iteration” tries to keep track of what needs to beupdated.
• Another variation is real-time dynamic programming (RTDP), which isrelated to Korf’s LRTA∗.
• With optimistic initialization this is known to converge.
• Optimistic initialization ≡ admissible heuristics!
Value iteration
NoteIf Vt is the value-function computed in the tth iteration of valueiteration then
Vt+1 = T∗Vt.
The key is that T∗ is a contraction in the supremum norm andBanach’s fixed-point theorem gives the key to the proof thetheorem mentioned before.
NoteOne can also use Qt+1 = T∗Qt, or value functions with post-decisionstates. What is the advantage?
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 47 / 51
Value iteration
NoteIf Vt is the value-function computed in the tth iteration of valueiteration then
Vt+1 = T∗Vt.
The key is that T∗ is a contraction in the supremum norm andBanach’s fixed-point theorem gives the key to the proof thetheorem mentioned before.
NoteOne can also use Qt+1 = T∗Qt, or value functions with post-decisionstates. What is the advantage?
2010
-07-
12RL Algorithms
Theory of dynamic programming
Algorithms of dynamic programming
Value iteration
Policy iteration
function POLICYITERATION(π)1: repeat2: π′ ← π3: V ← GETVALUEFUNCTION(π′)4: π ← GETGREEDYPOLICY(V)5: until π 6= π′
6: return π
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 48 / 51
Policy iteration
function POLICYITERATION(π)1: repeat2: π′ ← π3: V ← GETVALUEFUNCTION(π′)4: π ← GETGREEDYPOLICY(V)5: until π 6= π′
6: return π
2010
-07-
12RL Algorithms
Theory of dynamic programming
Algorithms of dynamic programming
Policy iteration
• The policy does not need to be stored explicitly
• The algorithm could also use action-value functions
• The number of iterations is finite in finite MDPs
• In infinite MDPs, the precision increases geometrically, never “slower”than value iteration
• However, a single step of the iteration is more expensive
• Generalized Policy Iteration: interleave value function updates and policyupdates at fine grades. There is advantage to doing this.
What if we stop early?
Theorem (e.g., Corollary 2 of Singh and Yee 1994)Fix an action-value function Q and let π be a greedy policy w.r.t. Q.Then the value of policy π can be lower bounded as follows:
Vπ(x) ≥ V∗(x)− 21− γ
‖Q− Q∗‖∞, x ∈ X .
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 49 / 51
What if we stop early?
Theorem (e.g., Corollary 2 of Singh and Yee 1994)Fix an action-value function Q and let π be a greedy policy w.r.t. Q.Then the value of policy π can be lower bounded as follows:
Vπ(x) ≥ V∗(x)− 21− γ
‖Q− Q∗‖∞, x ∈ X .
2010
-07-
12RL Algorithms
Theory of dynamic programming
Algorithms of dynamic programming
What if we stop early?
Books
Bertsekas and Shreve (1978)Puterman (1994)Bertsekas (2007a,b)
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 50 / 51
Books
Bertsekas and Shreve (1978)Puterman (1994)Bertsekas (2007a,b)
2010
-07-
12RL Algorithms
Theory of dynamic programming
Algorithms of dynamic programming
Books
References
Bertsekas, D. P. (2007a). Dynamic Programming and Optimal Control,volume 1. Athena Scientific, Belmont, MA, 3 edition.
Bertsekas, D. P. (2007b). Dynamic Programming and Optimal Control,volume 2. Athena Scientific, Belmont, MA, 3 edition.
Bertsekas, D. P. and Shreve, S. (1978). Stochastic Optimal Control(The Discrete Time Case). Academic Press, New York.
Puterman, M. (1994). Markov Decision Processes — DiscreteStochastic Dynamic Programming. John Wiley & Sons, Inc., NewYork, NY.
Singh, S. P. and Yee, R. C. (1994). An upper bound on the loss fromapproximate optimal-value functions. Machine Learning,16(3):227–233.
Szepesvari & Sutton (UofA) RL Algorithms July 11, 2010 51 / 51
References
Bertsekas, D. P. (2007a). Dynamic Programming and Optimal Control,volume 1. Athena Scientific, Belmont, MA, 3 edition.
Bertsekas, D. P. (2007b). Dynamic Programming and Optimal Control,volume 2. Athena Scientific, Belmont, MA, 3 edition.
Bertsekas, D. P. and Shreve, S. (1978). Stochastic Optimal Control(The Discrete Time Case). Academic Press, New York.
Puterman, M. (1994). Markov Decision Processes — DiscreteStochastic Dynamic Programming. John Wiley & Sons, Inc., NewYork, NY.
Singh, S. P. and Yee, R. C. (1994). An upper bound on the loss fromapproximate optimal-value functions. Machine Learning,16(3):227–233.20
10-0
7-12
RL Algorithms
Bibliography
References