Week 3: Model-free Prediction and Control

Week 3: Model-free Prediction and Control

Bolei Zhou

The Chinese University of Hong Kong

September 19, 2021

Bolei Zhou IERG5350 Reinforcement Learning September 19, 2021 1 / 54

Announcement

1 Project proposal due by the end of next week1 2-page proposal in Latex Template: https://github.com/

cuhkrlcourse/RLexample/tree/master/project_template2 Submit through Blackboard, with title: 2021fall-IERG5350-proposal3 Please list each team member’s name and student ID in the proposal

2 Assignment 1 dues on 23:59 October 3 (Sunday), 2021.1 https://github.com/cuhkrlcourse/ierg5350-assignment-20212 Please start early!


https://github.com/cuhkrlcourse/RLexample/tree/master/project_template

https://github.com/cuhkrlcourse/RLexample/tree/master/project_template

https://github.com/cuhkrlcourse/ierg5350-assignment-2021

This Week’s Plan

1 Last week1 MDP, policy evaluation, policy iteration and value iteration for solving

a known MDP

2 This week1 Model-free prediction: Estimate value function of an unknown MDP2 Model-free control: Optimize value function of an unknown MDP


Review on the control in MDP

1 When the MDP is known?1 Both R and P are exposed to the agent2 Therefore we can run policy iteration and value iteration

2 Policy iteration: Given a known MDP, compute the optimal policyand the optimal value function

1 Policy evaluation: iteration on the Bellman expectation backup

vi (s) =∑a∈A

π(a|s)(R(s, a) + γ∑s′∈S

P(s ′|s, a)vi−1(s ′))

2 Policy improvement: greedy on action-value function q

qπi (s, a) =R(s, a) + γ∑s′∈S

P(s ′|s, a)vπi (s′)

πi+1(s) = arg maxa

qπi (s, a)


Review on the control in MDP

1 Value iteration: Given a known MDP, compute the optimal valuefunction

2 Iteration on the Bellman optimality backup

vi+1(s)← maxa∈A

R(s, a) + γ∑s′∈S

P(s ′|s, a)vi (s′)

3 To retrieve the optimal policy after the value iteration:

π∗(s)← arg maxa

R(s, a) + γ∑s′∈S

P(s ′|s, a)vend(s ′) (1)


RL when knowing how the world works

1 Both of the policy iteration and value iteration assume the directaccess to the dynamics and rewards of the environment

2 In a lot of real-world problems, MDP model is either unknown orknown by too big or too complex to use

1 Atari Game, Game of Go, Helicopter, Portfolio management, etc


Model-free RL: Learning through interactions

1 Model-free RL can solve the problems through the interaction withthe environment

2 No more direct access to the known transition dynamics and rewardfunction

3 Trajectories/episodes are collected by the agent’s interaction with theenvironment

4 Each trajectory/episode contains {S1,A1,R2,S2,A2,R3, ...,ST}


Model-free prediction: policy evaluation without the accessto the model

1 Estimating the expected return of a particular policy if we don’t haveaccess to the MDP models

1 Monte Carlo policy evaluation2 Temporal Difference (TD) learning


Monte-Carlo Policy Evaluation

1 Return: Gt = Rt+1 + γRt+2 + γ2Rt+3 + ... under policy π

2 vπ(s) = Eτ∼π[Gt |st = s], thus expectation over trajectories τgenerated by following π

3 MC simulation: we can simply sample a lot of trajectories, computethe actual returns for all the trajectories, then average them

4 MC policy evaluation uses empirical mean return instead of expectedreturn

5 MC does not require MDP dynamics/rewards, no bootstrapping, anddoes not assume state is Markov.

6 Only applied to episodic MDPs (each episode terminates)



1 To evaluate state v(s)1 Every time-step t that state s is visited in an episode,2 Increment counter N(s)← N(s) + 13 Increment total return S(s)← S(s) + Gt

4 Value is estimated by mean return v(s) = S(s)/N(s)

2 By law of large numbers, v(s)→ vπ(s) as N(s)→∞



1 How to calculate G (s): compute backward


Incremental Mean

Mean from the average of samples x1, x2, ...

µt =1

t

t∑j=1

xj

=1

t

(xt +

t−1∑j=1

xj

)=

1

t(xt + (t − 1)µt−1)

=µt−1 +1

t(xt − µt−1)


Incremental MC Updates

1 Collect one episode (S1,A1,R2, ...,St)

2 For each state st with computed return Gt

N(St)←N(St) + 1

v(St)←v(St) +1

N(St)(Gt − v(St))

3 Or use a running mean (old episodes are forgotten). Good fornon-stationary problems.

v(St)← v(St) + α(Gt − v(St))


Difference between DP and MC for policy evaluation

1 Dynamic Programming (DP) computes vi by bootstrapping the restof the expected return by the value estimate vi−1

2 Iteration on Bellman expectation backup:

vi (s)←∑a∈A

π(a|s)(R(s, a) + γ

∑s′∈S

P(s ′|s, a)vi−1(s ′))


Difference between DP and MC for policy evaluation

1 MC updates the empirical mean return with one sampled episode

v(St)← v(St) + α(Gi ,t − v(St))


Advantages of MC over DP

1 MC works when the environment is unknown

2 Working with sample episodes has a huge advantage, even when onehas complete knowledge of the environment’s dynamics, for example,transition probability is complex to compute

3 Cost of estimating a single state’s value is independent of the totalnumber of states. So you can sample episodes starting from thestates of interest then average returns


Temporal-Difference (TD) Learning

1 TD methods learn directly from episodes of experience

2 TD is model-free: no knowledge of MDP transitions/rewards

3 TD learns from incomplete episodes, by bootstrapping


Temporal-Difference (TD) Learning

1 Objective: learn vπ online from experience under policy π2 Simplest TD algorithm: TD(0)

1 Update v(St) toward estimated return Rt+1 + γv(St+1)

v(St)← v(St) + α(Rt+1 + γv(St+1)− v(St))

3 Rt+1 + γv(St+1) is called TD target

4 δt = Rt+1 + γv(St+1)− v(St) is called the TD error5 Comparison: Incremental Monte-Carlo

1 Update v(St) toward actual return Gt given an episode i

v(St)← v(St) + α(Gi,t − v(St))


Advantages of TD over MC


Comparison of TD and MC

1 TD can learn online after every step

2 MC must wait until end of episode before return is known

3 TD can learn from incomplete sequences

4 MC can only learn from complete sequences

5 TD works in continuing (non-terminating) environments

6 MC only works for episodic (terminating) environments

7 TD exploits Markov property, more efficient in Markov environments

8 MC does not exploit Markov property, more effective in non-Markovenvironments


n-step TD

1 n-step TD methods that generalize both one-step TD and MC.

2 We can shift from one to the other smoothly as needed to meet thedemands of a particular task.


n-step TD prediction

1 Consider the following n-step returns for n = 1, 2,∞

n = 1(TD) G(1)t =Rt+1 + γv(St+1)

n = 2 G(2)t =Rt+1 + γRt+2 + γ2v(St+2)

...

n =∞(MC ) G∞t =Rt+1 + γRt+2 + ...+ γT−t−1RT

2 Thus the n-step return is defined as

Gnt = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnv(St+n)

3 n-step TD: v(St)← v(St) + α(Gnt − v(St)

)Bolei Zhou IERG5350 Reinforcement Learning September 19, 2021 22 / 54

Bootstrapping and Sampling for DP, MC, and TD

1 Bootstrapping: update involves an estimate1 MC does not bootstrap2 DP bootstraps3 TD bootstraps

2 Sampling: update samples an expectation1 MC samples2 DP does not sample3 TD samples


Unified View: Dynamic Programming Backup

v(St)← Eπ[Rt+1 + γv(St+1)]


Unified View: Monte-Carlo Backup

v(St)← v(St) + α(Gt − v(St))


Unified View: Temporal-Difference Backup

TD(0) : v(St)← v(St) + α(Rt+1 + γv(st+1)− v(St))


Unified View of Reinforcement Learning


Summary

1 Model-free prediction1 Evaluate the state value by only interacting with the environment2 Many algorithms can do it: Temporal Difference Learning and

Monte-Carlo method


Model-free Control for MDP

1 Model-free control:1 Optimize the value function of an unknown MDP2 Generate a optimal control policy

2 But is there model-based control? What is it? will come back to it

3 Generalized Policy Iteration (GPI) with MC or TD in the loop


Revisiting Policy Iteration

1 Iterate through the two steps:1 Evaluate the policy π (computing v given current π)2 Improve the policy by acting greedily with respect to vπ

π′ = greedy(vπ) (2)


Policy Iteration for a Known MDP1 compute the state-action value of a policy π:

qπi (s, a) = R(s, a) + γ∑s′∈S

P(s ′|s, a)vπi (s′)

2 Compute new policy πi+1 for all s ∈ S following

πi+1(s) = arg maxa

qπi (s, a) (3)

3 Problem: What to do if there is neither R(s, a) nor P(s ′|s, a)known/available?


Generalized Policy Iteration with Action-Value Function

Monte Carlo version of policy iteration

1 Policy evaluation: Monte-Carlo policy evaluation Q = qπ2 Policy improvement: Greedy policy improvement?

π(s) = arg maxa

q(s, a)


Monte Carlo with Exploring Starts

1 One assumption to obtain the guarantee of convergence in PI:Episode has exploring starts

2 Exploring starts can ensure all actions are selected infinitely often


Monte Carlo with ε-Greedy Exploration

1 Trade-off between exploration and exploitation (we will talk aboutthis in later lecture)

2 ε-Greedy Exploration: Ensuring continual exploration1 All actions are tried with non-zero probability2 With probability 1− ε choose the greedy action3 With probability ε choose an action at random

π′(a|s) =

{ε/|A|+ 1− ε if a∗ = arg maxa∈AQ(s, a)

ε/|A| otherwise



1 ε-Greedy Policy improvement theorem: For any ε-greedy policy π, theε-greedy policy π′ with respect to qπ is an improvement,vπ′(s) ≥ vπ(s)

qπ(s, π′(s)) =∑a∈A

π′(a|s)qπ(s, a)

=ε

|A|∑a∈A

qπ(s, a) + (1− ε) maxa

qπ(s, a)

≥ ε

|A|∑a∈A

qπ(s, a) + (1− ε)∑a∈A

π(a|s)− ε|A|

1− εqπ(s, a)

=∑a∈A

π(a|s)qπ(s, a) = vπ(s)

Therefore from the policy improvement theorem vπ′(s) ≥ vπ(s)



Algorithm 1

1: Initialize Q(S ,A) = 0,N(S ,A) = 0, ε = 1, k = 12: πk = ε-greedy(Q)3: loop4: Sample k-th episode (S1,A1,R2, ...,ST ) ∼ πk5: for each state St and action At in the episode do6: N(St ,At)← N(St ,At) + 17: Q(St ,At)← Q(St ,At) + 1

N(St ,At)(Gt − Q(St ,At))

8: end for9: k ← k + 1, ε← 1/k

10: πk = ε-greedy(Q)11: end loop


MC vs. TD for Prediction and Control

1 Temporal-difference (TD) learning has several advantages overMonte-Carlo (MC)

1 Lower variance2 Online3 Incomplete sequences

2 So we can use TD instead of MC in our control loop1 Apply TD to Q(S ,A)2 Use ε-greedy policy improvement3 Update every time-step rather than at the end of one episode


Recall: TD Prediction

1 An episode consists of an alternating sequence of states andstate–action pairs:

2 TD(0) method for estimating the value function V (S)

At ← action given by π for S

Take action At , observe Rt+1 and St+1

V (St)← V (St) + α[Rt+1 + γV (St+1)− V (St)]

3 How about estimating action value function Q(S ,A)?


Sarsa: On-Policy TD Control

1 An episode consists of an alternating sequence of states andstate–action pairs:

2 ε-greedy policy for one step, then bootstrap the action value function:

Q(St ,At)← Q(St ,At) + α[Rt+1 + γQ(St+1,At+1)− Q(St ,At)

]

3 The update is done after every transition from a nonterminal state St4 TD target δt = Rt+1 + γQ(St+1,At+1)


Sarsa algorithm


n-step Sarsa

1 Consider the following n-step Q-returns for n = 1, 2,∞

n = 1(Sarsa)q(1)t =Rt+1 + γQ(St+1,At+1)

n = 2 q(2)t =Rt+1 + γRt+2 + γ2Q(St+2,At+2)

...

n =∞(MC ) q∞t =Rt+1 + γRt+2 + ...+ γT−t−1RT

2 Thus the n-step Q-return is defined as

q(n)t = Rt+1 + γRt+2 + ...+ γn−1Rt+n + γnQ(St+n,At+n)

3 n-step Sarsa updates Q(s,a) towards the n-step Q-return:

Q(St ,At)← Q(St ,At) + α(q(n)t − Q(St ,At)

)Bolei Zhou IERG5350 Reinforcement Learning September 19, 2021 41 / 54

On-policy Learning vs. Off-policy Learning

1 On-policy learning: Learn about policy π from the experiencecollected from π

1 Behave non-optimally in order to explore all actions, then reduce theexploration. e.g., ε-greedy

2 Another important approach is off-policy learning which essentiallyuses two different polices:

1 the one which is being learned about and becomes the optimal policy2 the other one which is more exploratory and is used to generate

trajectories

3 Off-policy learning: Learn about policy π from the experience sampledfrom another policy µ

1 π: target policy2 µ: behavior policy


Off-policy Learning

1 Following behaviour policy µ(a|s) to collect data

S1,A1,R2, ...,ST ∼ µUpdate π using S1,A1,R2, ...,ST

2 It leads to many benefits:1 Learn about optimal policy while following exploratory policy2 Learn from observing humans or other agents3 Re-use experience generated from old policies π1, π2, ..., πt−1


Off-Policy Control with Q Learning

1 Off-policy learning of action values Q(s, a)

2 No importance sampling is needed

3 Next action in TD target is selected from the alternative actionA′ ∼ π(.|St)

4 update Q(St ,At) towards value of alternative action

Q(St ,At)← Q(St ,At) + α(Rt+1 + γQ(St+1,A

′)− Q(St ,At))


Off-Policy Control with Q-Learning1 We allow both behavior and target policies to improve

2 The target policy π is greedy on Q(s, a)

π(St+1) = arg maxa′

Q(St+1, a′)

3 The behavior policy µ could be totally random, but we let it improvefollowing ε-greedy on Q(s, a)

4 Thus Q-learning target:

Rt+1 + γQ(St+1,A′) =Rt+1 + γQ(St+1, arg maxQ(St+1, a

′))

=Rt+1 + γmaxa′

Q(St+1, a′)

5 Thus the Q-Learning update

Q(St ,At)← Q(St ,At) + α[Rt+1 + γmax

aQ(St+1, a)− Q(St ,At)

]Bolei Zhou IERG5350 Reinforcement Learning September 19, 2021 45 / 54

Q-learning algorithm


Comparison of Sarsa and Q-Learning

1 Sarsa: On-Policy TD control

Choose action At from St using policy derived from Q with ε-greedy


Choose action At+1 from St+1 using policy derived from Q with ε-greedy

Q(St ,At)← Q(St ,At) + α[Rt+1 + γQ(St+1,At+1)− Q(St ,At)

]2 Q-Learning: Off-Policy TD control

Choose action At from St using policy derived from Q with ε-greedy


Then ‘imagine’ At+1 as arg maxQ(St+1, a′) in the update target

Q(St ,At)← Q(St ,At) + α[Rt+1 + γmax

aQ(St+1, a)− Q(St ,At)

]


Comparison of Sarsa and Q-Learning

1 Backup diagram for Sarsa and Q-learning

2 In Sarsa, A and A’ are sampled from the same policy so it is on-policy

3 In Q Learning, A and A’ are from different policies, with A beingmore exploratory and A’ determined directly by the max operator


Example on Cliff Walk (Example 6.6 from Textbook)https://github.com/cuhkrlcourse/RLexample/blob/master/modelfree/cliffwalk.py


Summary of DP and TD

Expected Update (DP) Sample Update (TD)Iterative Policy Evaluation TD LearningV (s)← E[R + γV (S ′)|s] V (S)←α R + γV (S ′)

Q-Policy Iteration SarsaQ(S ,A)← E[R + γQ(S ′,A′)|s, a] Q(S ,A)←α R + γQ(S ′,A′)

Q-Value Iteration Q-LearningQ(S ,A)← E[R + γmaxa′∈A Q(S ′,A′)|s, a] Q(S ,A)←α R + γmaxa′∈A Q(S ′, a′)

where x ←α y is defined as x ← x + α(y − x)


Unified View of Reinforcement Learning


Sarsa and Q-Learning Example

https:

//github.com/cuhkrlcourse/RLexample/tree/master/modelfree


https://github.com/cuhkrlcourse/RLexample/tree/master/modelfree

https://github.com/cuhkrlcourse/RLexample/tree/master/modelfree

Summary

1 Model-free prediction in MDP

2 Model-free control in MDP


End

1 Project proposal due by the end of next week

2 Homework 1 is out3 Next week:

1 Off-policy learning and importance sampling2 Reinforcement Learning and Optimal control


Documents

Week 3: Model-free Prediction and Control