Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Planning in Markov Decision Processes
Deep Reinforcement Learning and Control
Katerina Fragkiadaki
Carnegie MellonSchool of Computer Science
Lecture 3, CMU 10703
A Markov Decision Process is a tuple
• is a finite set of states
• is a finite set of actions
• is a state transition probability function
• is a reward function
• is a discount factor
Markov Decision Process (MDP)
�
r(s, a) = E[Rt+1|St = s,At = a]
r
T
T (s0|s, a) = P[St+1 = s0|St = s,At = a]
A
S
(S,A, T, r, �)
� 2 [0, 1]
Solving MDPs
• Prediction: Given an MDP and a policyfind the state and action value functions.
• Optimal control: given an MDP , find the optimal policy (aka the planning problem). Compare with the learning problem with missing information about rewards/dynamics.
• We still consider finite MDPs (finite and ) with known dynamics!
⇡(a|s) = P[At = a|St = s]
S A
(S,A, T, r, �)
(S,A, T, r, �)
Outline
• Policy iteration
• Value iteration
• Linear programming
• Asynchronous DP
Policy Evaluation
Policy evaluation: for a given policy , compute the state value function where is implicitly given by the Bellman equation
a system of simultaneous equations.
⇡
V⇡(s) = E⇡[Rt+1 + �Rt+2 + �2Rt+3 + ...|St = s]
v⇡(s) =X
a2A⇡(a|s)
r(s, a) + �
X
s02ST (s0|s, a)v⇡(s0)
!
|S|
v⇡(s) = E⇡
⇥Rt+1 + �Rt+2 + �2Rt+3 + . . . |St = s
⇤
v⇡(s)
MDPs to MRPs
MDP under a fixed policy becomes Markov Reward Process (MRP)
where and r⇡s =P
a2A ⇡(a|s)r(s, a) T⇡s0s =
Pa2A ⇡(a|s)T (s0|s, a)
v⇡(s) =X
a2A⇡(a|s)
r(s, a) + �
X
s02ST (s0|s, a)v⇡(s0)
!
=X
a2A⇡(a|s)r(s, a) + �
X
a2A⇡(a|s)
X
s02ST (s0|s, a)v⇡(s0)
= r⇡s + �X
s02ST⇡s0sv⇡(s
0)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) � s
vk(s0) � s0
vk+1
(s) =X
a2A⇡(a|s)
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk
r
Back Up Diagram
MDP
v⇡(s) s
v⇡(s0) s0
Back Up Diagram
MDP
MRP
v⇡(s0) s0
v⇡(s) s
v⇡(s) = r⇡s + �P
s02S T⇡s0sv⇡(s
0)
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) � s
vk(s0) � s0
vk+1
(s) =X
a2A⇡(a|s)
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk
v⇡(s) s
v⇡(s0) s0
r
Matrix Form
The Bellman expectation equation can be written concisely using the induced MRP as
with direct solution
of complexity O(N3)
v⇡ = (I � �T⇡)�1r⇡
v⇡ = r⇡ + �T⇡v⇡
Iterative Methods: Recall the Bellman Equation
v⇡(s) =X
a2A⇡(a|s)
r(s, a) + �
X
s02ST (s0|s, a)v⇡(s0)
!Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) � s
vk(s0) � s0
vk+1
(s) =X
a2A⇡(a|s)
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk
v⇡(s) s
v⇡(s0) s0
r
Lecture 3: Planning by Dynamic Programming
Policy Evaluation
Iterative Policy Evaluation
Iterative Policy Evaluation (2)
a
r
vk+1(s) � s
vk(s0) � s0
vk+1
(s) =X
a2A⇡(a|s)
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1 = R⇡R⇡R⇡ + �P⇡P⇡P⇡vk
r
Iterative Methods: Backup Operation
Given an expected value function at iteration k, we back up the expected value function at iteration k+1:
v[k+1] sv[k+1](s) s
v[k+1](s) =X
a2A⇡(a|s)
r(s, a) + �
X
s02ST (s0|s, a)v[k](s0)
!
v[k](s0) s0
A sweep consists of applying the backup operation for all the states in
Applying the back up operator iteratively
Iterative Methods: Sweep
v[0] ! v[1] ! v[2] ! . . . v⇡
v ! v0S
A Small-Grid World
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that would take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
R
γ = 1
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that would take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
Policy , an equiprobable random action
Iterative Policy Evaluation
⇡
for therandom policyv[k]
1
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that would take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
Policy , an equiprobable random action
Iterative Policy Evaluation
⇡
for therandom policyv[k]
1
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that would take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
Policy , an equiprobable random action
Iterative Policy Evaluation
⇡
for therandom policyv[k]
1
An operator on a normed vector space is a -contraction, for , provided for all
Theorem (Contraction mapping)For a -contraction in a complete normed vector space
• converges to a unique fixed point in
• at a linear convergence rate
Remark. In general we only need metric (vs normed) space
Contraction Mapping Theorem
F X �
||T (x)� T (y)|| �||x� y||
x, y 2 X0 < � < 1
� F X
F X�
Value Function Sapce
• Consider the vector space over value functions
• There are dimensions
• Each point in this space fully specifies a value function
• Bellman backup brings value functions closer in this space?
• And therefore the backup must converge to a unique solution
|S|v(s)
V
Value Function -Norm
• We will measure distance between state-value functions and by the -norm
• i.e. the largest difference between state values,
1
u v1
||u� v||1 = max
s2S|u(s)� v(s)|
Bellman Expectation Backup is a Contraction
• Define the Bellman expectation backup operator
• This operator is a -contraction, i.e. it makes value functions closer by at least , �
�
||F⇡(u)� F⇡(v)||1 = ||(r⇡ + �T⇡u)||1 � ||(r⇡ + �T⇡v)||1= ||�T⇡(u� v)||1 ||�T⇡||u� v||1||1 �||u� v||1
F⇡(v) = r⇡ + �T⇡v
Convergence of Iter. Policy Evaluation and Policy Iteration
• The Bellman expectation operator has a unique fixed point
• is a fixed point of (by Bellman expectation equation)
• By contraction mapping theorem
• Iterative policy evaluation converges on
F⇡
v⇡
F⇡v⇡
Policy Improvement
• Suppose we have computed for a deterministic policy
• For a given state , would it be better to do an action ?
• It is better to switch to action for state if and only if
• And we can compute from by:
v⇡ ⇡
s a 6= ⇡(s)
a s
q⇡(s, a) > v⇡(s)
q⇡(s, a) v⇡
q⇡(s, a) = E[Rt+1 + �v⇡(St+1)|St = s,At = a]
= r(s, a) + �X
s02ST (s0|s, a)v⇡(s0)
Policy Improvement Cont.
• Do this for all states to get a new policy that is greedy with respect to :
• What if the policy is unchanged by this?
• Then the policy must be optimal!
⇡0 � ⇡v⇡
⇡0(s) = argmax
aq⇡(s, a)
= argmax
aE[Rt+1 + �v⇡(s
0)|St = s,At = a]
= argmax r(s, a) + �X
s02ST (s0|s, a)v⇡(s0)
Policy Iteration
policy evaluation policy improvement“greedification”
⇡0E�! v⇡0
I�! ⇡1E�! v⇡1
I�! ⇡2E�! ...
I�! ⇡⇤E�! v⇤
Policy Iteration
92 CHAPTER 4. DYNAMIC PROGRAMMING
1. InitializationV (s) 2 R and ⇡(s) 2 A(s) arbitrarily for all s 2 S
2. Policy EvaluationRepeat
� 0For each s 2 S:
v V (s)V (s)
Ps0,r p(s0, r|s, ⇡(s))
⇥r + �V (s0)
⇤
� max(�, |v � V (s)|)until � < ✓ (a small positive number)
3. Policy Improvementpolicy-stable trueFor each s 2 S:
a ⇡(s)⇡(s) arg maxa
Ps0,r p(s0, r|s, a)
⇥r + �V (s0)
⇤
If a 6= ⇡(s), then policy-stable falseIf policy-stable, then stop and return V and ⇡; else go to 2
Figure 4.3: Policy iteration (using iterative policy evaluation) for v⇤. Thisalgorithm has a subtle bug, in that it may never terminate if the policy con-tinually switches between two or more policies that are equally good. The bugcan be fixed by adding additional flags, but it makes the pseudocode so uglythat it is not worth it. :-)
argmax r(s, a) + �⌃s02ST (s0|s, a)v⇡(s0)
⌃a2A⇡(a|s) (r(s, a) + �⌃s02ST (s0|s, a)V (s0))
v
v
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
6
Iterative Policy Eval for the Small Gridworld
∞
R
γ = 1
Policy , an equiprobable random action⇡
• An undiscounted episodic task
• Nonterminal states: 1, 2, … , 14
• Terminal state: one, shown in shaded square
• Actions that take the agent off the grid leave the state unchanged
• Reward is -1 until the terminal state is reached
Iterative Policy Eval for the Small Gridworld
∞
R
γ = 1
Policy , an equiprobable random action⇡
• Does policy evaluation need to converge to ?
• Or should we introduce a stopping condition
• e.g. -convergence of value function
• Or simply stop after k iterations of iterative policy evaluation?
• For example, in the small grid world k = 3 was sufficient to achieve optimal policy
• Why not update policy every iteration? i.e. stop after k = 1
• This is equivalent to value iteration (next section)
Generalized Policy Iteration
v⇡
✏
Generalized Policy Iteration
Generalized Policy Iteration (GPI): any interleaving of policy evaluation and policy improvement, independent of their granularity.
A geometric metaphor forconvergence of GPI:
evaluation
improvement
⇡ greedy(V )
V⇡
V v⇡
v⇤⇡⇤
v⇤,⇡⇤
V0,⇡0
V = v⇡
⇡ = greedy(V )
v⇡
v⇡
v⇤
v⇤
Principle of Optimality
• Any optimal policy can be subdivided into two components:
• An optimal first action
• Followed by an optimal policy from successor state
• Theorem (Principle of Optimality)
• A policy achieves the optimal value from state , dfsfdsfdf dsfdf , if and only if
• For any state reachable from , achieves the optimal value from state ,
⇡(a|s) sv⇡(s) = v⇤(s)
s0 s ⇡s0 v⇡(s
0) = v⇤(s0)
A⇤
S 0
• If we know the solution to subproblems
• Then solution can be found by one-step lookahead
• The idea of value iteration is to apply these updates iteratively
• Intuition: start with final rewards and work backwards
• Still works with loopy, stochastic MDPs
Value Iteration
v⇤(s0)
v⇤(s0)
v⇤(s) max
a2Ar(s, a) + �
X
s02ST (s0|s, a)v⇤(s0)
Example: Shortest Path
• Body Level One
• Body Level Two
• Body Level Three
• Body Level Four
Body Level Five
Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Example: Shortest Path
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
-1
0
-1
-2
-2
-1
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
-2
0
-1
-2
-3
-1
-2
-3
-3
-2
-3
-3
-3
-3
-3
-3
-3
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-4
-3
-4
-4
-4
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-5
-3
-4
-5
-5
0
-1
-2
-3
-1
-2
-3
-4
-2
-3
-4
-5
-3
-4
-5
-6
g
Problem V1 V2 V3
V4 V5 V6 V7
Value Iteration
• Problem: find optimal policy
• Solution: iterative application of Bellman optimality backup
•
• Using synchronous backups
• At each iteration k + 1
• For all states
• Update from
⇡
v1 ! v2 ! ... ! v⇤
vk+1(s) vk(s0)
s 2 S
Value Iteration (2)Lecture 3: Planning by Dynamic Programming
Value Iteration
Value Iteration in MDPs
Value Iteration (2)
vk+1(s) � s
vk(s0) � s0
r
a
vk+1
(s) = maxa2A
Ra
s
+ �X
s
02SPa
ss
0vk
(s 0)
!
vk+1
= maxa2A
RaRaRa + �PaPaPavk
r
vk+1(s) = max
a2A
r(s, a) + �
X
s02ST (s0|s, a)vk(s0)
!
vk+1 = max
a2Ar(a) + �T (a)vk
vk+1(s) s
vk(s0) s0
Bellman Optimality Backup is a Contraction
• Define the Bellman optimality backup operator ,
• This operator is a -contraction, i.e. it makes value functions closer by at least (similar to previous proof)
F ⇤
F ⇤(v) = max
a2Ar(a) + �T (a)v
||F ⇤(u)� F ⇤(v)||1 �||u� v||1
��
Convergence of Value Iteration
• The Bellman optimality operator has a unique fixed point
• is a fixed point of (by Bellman optimality equation)
• By contraction mapping theorem
• Value iteration converges on
F ⇤
v⇤ F ⇤
v⇤
• Algorithms are based on state-value function or
• Complexity per iteration, for actions and states
• Could also apply to action-value function or • Complexity per iteration
Synchronous Dynamic Programming Algorithms
Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iterative Policy Evaluation
Control Bellman Expectation Equation + Greedy Policy Improvement Policy Iteration
Control Bellman Optimality Equation Value Iteration
v⇤(s)v⇡(s)
q⇡(s, a) q⇤(s, a)O(mn2)
O(m2n2)
m n
Efficiency of DP
• To find an optimal policy is polynomial in the number of states…
• BUT, the number of states is often astronomical, e.g., often growing exponentially with the number of state variables (what Bellman called “the curse of dimensionality”).
• In practice, classical DP can be applied to problems with a few millions of states.
Asynchronous DP
• All the DP methods described so far require exhaustive sweeps of the entire state set.
• Asynchronous DP does not use sweeps. Instead it works like this:
• Repeat until convergence criterion is met:
• Pick a state at random and apply the appropriate backup
• Still need lots of computation, but does not get locked into hopelessly long sweeps
• Guaranteed to converge if all states continue to be selected
• Can you select states to backup intelligently? YES: an agent’s experience can act as a guide.
Asynchronous Dynamic Programming
• Three simple ideas for asynchronous dynamic programming:
• In-place dynamic programming
• Prioritized sweeping
• Real-time dynamic programming
• Synchronous value iteration stores two copies of value function
• for all in
• In-place value iteration only stores one copy of value function
• for all in
In-Place Dynamic Programming
s Sv
new
(s) max
a2A
r(s, a) + �
X
s
02ST (s0|s, a)v
old
(s0)
!
v(s) max
a2A
r(s, a) + �
X
s02ST (s0|s, a)v(s0)
!
vold
vnew
s S
Prioritized Sweeping
• Use magnitude of Bellman error to guide state selection, e.g.
• Backup the state with the largest remaining Bellman error
• Update Bellman poor of affected states after each backup
• Requires knowledge of reverse dynamics (predecessor states)
• Can be implemented efficiently by maintaining a priority queue
�����max
a2A
r(s, a) + �
X
s02ST (s0|s, a)v(s0)
!� v(s)
�����
Real-time Dynamic Programming
• Idea: only states that are relevant to agent
• Use agent’s experience to guide the selection of states
• After each time-step
• Backup the state
v(St) max
a2A
r(St, a) + �
X
s02ST (s0|St, a)v(s
0)
!
St,At, rt+1
St
Lecture 3: Planning by Dynamic Programming
Extensions to Dynamic Programming
Full-width and sample backups
Sample Backups
In subsequent lectures we will consider sample backups
Using sample rewards and sample transitionshS ,A,R , S 0iInstead of reward function R and transition dynamics PAdvantages:
Model-free: no advance knowledge of MDP requiredBreaks the curse of dimensionality through samplingCost of backup is constant, independent of n = |S|
Sample Backups
• In subsequent lectures we will consider sample backups
• Using sample rewards and sample transitions
• Instead of reward function and transition dynamics
• Advantages:
• Model-free: no advance knowledge of MDP required
• Breaks the curse of dimensionality through sampling
• Cost of backup is constant, independent of n = |S|
(S,A, r,S 0)
Approximate Dynamic Programming
• Approximate the value function
• Using a function approximate
• Apply dynamic programming to
• e.g. Fitted Value Iteration repeats at each iteration k,
• Sample states
• For each state , estimate target value using Bellman optimality equation,
• Train next value function using targets
v(s,w)
v(· ,w)
S ✓ Ss 2 S
v(· ,wk+1) {hs, vk(s)i}
vk(s) = max
a2A
r(s, a) + �
X
s02ST (s0|s, a)v(s0,wk)
!
Linear Programming (LP)
• Recall, at value iteration convergence we have
• LP formulation to find
8s 2 S :
Theorem. is the solution to the above LP.
is a probability distribution over , with for all in . µ0 µ0(s) > 0 sS S
V ⇤
minV
X
Sµ0(s)V (s)
s.t. 8s 2 S, 8a 2 A :
V (s) � r(s, a) + �X
s02ST (s0|s, a)V ⇤(s0)
V ⇤(s) = max
a2A
r(s, a) + �
X
s02ST (s0|s, a)V ⇤
(s0)
!
• Let be the Bellman operator, i.e., . Then the LP can be written as:
LP con’t
F V ⇤i+1 = F (Vi)
Monotonicity Property: If then .Hence, if then , and by repeated application,
Any feasible solution to the LP must satisfy , and hence must satisfy . Hence, assuming all entries in are positive, is the optimal solution to the LP.
U � V F (U) � F (V )V � F (V ) F (V ) � F (F (V ))
V � F (V ) � F 2V � F 3V � ... � F1V = V ⇤
V � V ⇤ V ⇤µ0
V � F (V )
minV
µ>0 V
s.t. V � F (V )
• Interpretation
•
• Equation 2: ensures that has the above meaning
• Equation 1: maximize expected discounted sum of rewards
• Optimal policy:
LP: The Dual -> q*
�
⇡⇤(s) = argmax
a�(s, a)
�(s, a) =1X
t=0
�tP (st = s, at = a)
min�
X
s2S
X
a2A
X
s02S�(s, a)T (s, a, s0)r(s, a)
s.t. 8s 2 S :X
a02A�(s0, a0) = µ0(s) + �
X
s2S
X
a2A�(s, a)T (s, a, s0)