Reinforcement Learning - Stanford University · 2020. 3. 11. · Model-Based Reinforcement Learning* (Will discuss other exploration later) 1.Take action a for current state s given

Reinforcement Learning

Emma BrunskillStanford University

Spring 2017

Reinforcement Learning: Learning to make a good sequence of decisions

Policy: mapping from history of past actions, states, rewards to next action

Figure from Shttp://www.mdpi.com/energies/energies-09-00755/article_deploy/html/images/energies-09-00755-g003.png

Reinforcement Learning Involves

• Optimization

• Delayed consequences

• Generalization

• Exploration

Learning Objectives

• Define the key features of RL vs AI & other ML

• Define MDP, POMDP, bandit, batch offline RL, online RL

• Given an application problem (e.g. from computer vision, robotics, etc) decide if it should be formulated as a RL problem, if yes how to formulate, what algorithm (from class) is best suited to addressing, and justify answer

• Implement several RL algorithms incl. a deep RL approach

• Describe multiple criteria for analyzing RL algorithms and evaluate algorithms on these metrics: e.g. regret, sample complexity, computational complexity, convergence, etc.

• Describe the exploration vs exploitation challenge and compare and contrast 2 or more approaches

• List at least two open challenges or hot topics in RL

What We’ve Covered So Far

• Markov decision process planning

• Reinforcement learning in finite state and action domains

• Generalization with model-free techniques

• Exploration

Reinforcement Learningfigure from David Silver

Figure from David Silver’s slides

Reinforcement Learningmodel → value → policy

(ordering sufficient but not necessary, e.g. having a model is not required to learn a value)

figure from David Silver

Figure from David Silver’s slides


• Markov decision process planning• definition of MDP

• Bellman operator, contraction

• value iteration, policy iteration



• Exploration

Model: Frequently model as a Markov Decision Process, <S,A,R,T,ϒ>

Agent

World

ActionState s’Reward

Policy mapping from state → action

Stochastic dynamics model T(s’|s,a)Reward model R(s,a,s’)*

Discount factor ϒ

MDPs

• Define a MDP <S,A,R,T,ϒ>• Markov property

• What is this, why is it important• What are the MDP models / values V /

state-action values Q / policy• What is MDP planning? What is difference from

reinforcement learning?• Planning = know the reward & dynamics• Learning = don’t know reward & dynamics

Value Iteration (VI)

1. Initialize V0(s

i)=0 for all states s

i,

2. Set k=1

3. Loop until [finite horizon, convergence]• For each state s,

4. Extract Policy

Bellman backup

Contraction Operator

• Let O be an operator

• If |OV – OV’| <= |V-V|’

• Then O is a contraction operator

• Let B be the Bellman backup operator

Will Value Iteration Converge to a Single Fixed Point?

• Yes, if discount factor γ < 1 or end up in a terminal state with probability 1

• Bellman backup is a contraction if discount factor, γ < 1

• If apply it to two different value functions, distance between value functions shrinks after apply Bellman equation to each

Value vs Policy Iteration

• Value iteration:• Compute optimal value if horizon=k

• Note this can be used to compute optimal policy if horizon = k

• Increment k

• Policy iteration: • Compute infinite horizon value of a policy• Use to select another (better) policy • Closely related to a very popular method in RL:

policy gradient

Policy Iteration (PI)

1. i=0; Initialize π0(s) randomly for all states s

2. Converged = 0;

3. While i == 0 or |πi-π

i-1| > 0

• i=i+1

• Policy evaluation: Compute

• Policy improvement:



• Reinforcement learning in finite state and action domains• model-based RL

• model-free RL

• Passive RL (estimate V while following 1 policy)

• General/control RL -- policy may change, may want to estimate value of a another policy


• Exploration

TransitionModel?

Agent

ActionState

Reward model?

P(s’|s1,a1)=0.5,...

R(s1,a1)=1...

Model-based RL:Agent estimates a reward & dynamics model… and

then computes V/Q

TransitionModel?

Agent

ActionState

Reward model?

Q(s1,a1)=9, Q(s1,a2)=8...

Model-free RL:Agent directly estimates

Q / V

Model-Based Passive Reinforcement Learning

• Follow policy π• Estimate MDP model parameters from data

• If finite set of states and actions: count & average

• Given estimated MDP, compute value of policy π• Planning problem! Policy evaluation with a model

• Does this give us dynamics model parameter estimates for all actions?

• No. But all ones need to estimate the value of the policy.

• How good is the model parameter estimates?

• Depends on amount of data we have

• What about the resulting policy value estimate?

• Depends on quality of model parameters

Model-Based Reinforcement Learning

1. Take action a for current state s given policy πa. Observe next state and reward

2. Create a MDP given all previous data the data3. Given that MDP compute policy π using planning4. Go to step 1

Model-Based Reinforcement Learning*(Will discuss other exploration later)

1. Take action a for current state s given policy πa. Observe next state and reward

2. Create a MDP given all previous data the data3. Given that MDP compute policy π using planning

• Certainty equivalence:

• Estimate MLE MDP M1 parameters from data

• Compute ~Q* for M1 and let π(s) = argmax ~Q*(s,a)

• E-greedy

• Estimate MLE MDP M1 parameters from data

• Compute ~Q* for M1 and π(s) = argmax ~Q*(s,a) w prob (1-e), else select action at random

4. Go to step 1

TransitionModel?

Agent

ActionState

Reward model?

P(s’|s1,a1)=0.5,...

R(s1,a1)=1...

Model-based RL:Agent estimates a reward & dynamics model… and

then computes V/Q

TransitionModel?

Agent

ActionState

Reward model?

Q(s1,a1)=9, Q(s1,a2)=8...

Model-free RL:Agent directly estimates

Q / V

Model-free Passive RL

• Directly estimate Q or V of a policy π as act using π• The Q function for a particular policy is the expected

discounted sum of future rewards obtained by following policy starting with (s,a)

• For Markov decision processes,

Model-free Passive RL

• Directly estimate Q or V of a policy from data• The Q function for a particular policy is the expected

discounted sum of future rewards obtained by following policy starting with (s,a)

• For Markov decision processes,

• Consider episodic domains• Act in world for H steps, then reset back to state

sampled from starting distribution• MC: directly average episodic rewards• TD/Q-learning: use a “target” to bootstrap

MC vs Dynamic Programming vs TD

• Updating a state/action value (V or Q)

• Bootstrapping: update involves an estimate of V• MC does not bootstrap• Dynamic programming (value iteration) bootstraps• TD bootstraps

• Sampling: update samples an expectation• MC samples • Dynamic programming does not sample• TD samples• (e.g. samples s’ instead of taking expectation directly)

Monte-Carlo Policy Evaluation

‣  Goal: learn from episodes of experience under policy π

‣  Idea: Average returns observed after visits to s:

‣  Every-Visit MC: average returns for every time s is visited in an episode

‣  First-visit MC: average returns only for first time s is visited in an episode

‣  Both converge asymptotically

- Showing this for First-visit is a few lines— see chp 5 in new Sutton & Barto textbook- Showing this for Every-Visit MC is more subtle, see Singh and Sutton 1996 Machine Learning paper

First-Visit MC Policy Evaluation

‣  To evaluate state s

‣  The first time-step t that state s is visited in an episode,

‣  Increment counter:

‣  Increment total return:

‣  Value is estimated by mean return

‣  By law of large numbers

Gt = total sum of discounted rewards in episode t from state s to end of episode

Incremental Monte Carlo Updates

‣  Update V(s) incrementally after episode

‣  For each state St with return Gt

‣  In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.

MC Estimation of Action Values (Q)

‣  Monte Carlo (MC) is most useful when a model is not available

-  We want to learn q*(s,a)

‣  qπ(s,a) - average return starting from state s and action a following π

‣  Converges asymptotically if every state-action pair is visited

‣  Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

TD Learning

• Update Vπ(s) each time after each transition (s, a, s’, r)

• Approximating expectation over next state with samples

• Bootstraps because uses estimate of V instead of real V when creating the target/sampled V

Slide adapted from Klein and Abbeel

Decrease learning rate

over time

Model-Free Learning w/Random Actions

• TD learning for policy evaluation:• As act in the world go through (s,a,r,s’,a’,r’,…)

• Update Vπ estimates at each step

• Over time updates mimic Bellman updates

• Now do for Q values

Slide adapted from Klein and Abbeel

• Update Q(s,a) every time experience (s,a,s’,r)

• Create new sample estimate

• Update estimate of Q(s,a)

Q-Learning

Convergence with Acting Randomly

• Choose actions randomly

• Do certainty based MDP planning with lookup table representation

• Compute V and pi will converge to optimal v and policy in limit of infinite data

• Q learning will converge in limit to Q*

• Under reachability assumption

• Take all actions in all states infinitely often

Q-Learning Properties

• If acting randomly*, Q-learning converges Q*• Optimal Q values• Finds optimal policy

• Off-policy learning• Can act in one way

• But learning values of another π (the optimal one!)

*Again, under mild reachability assumptions




• Generalization with model-free techniques• linear value/Q function approximation

• deep reinforcement learning

• Exploration

Scaling Up

• Want to be able to tackle problems with enormous or infinite state spaces

• Tabular representation is insufficient

S1

Okay Field Site +1

S2 S3 S4 S5 S6 S7

Fantastic Field Site+10

Value Function Approximation (VFA) ‣  Value function approximation (VFA) replaces the table with a

general parameterized form:

Stochastic Gradient Descent ‣  Goal: find parameter vector w minimizing mean-squared error between

the true value function vπ(S) and its approximation :

‣  Gradient descent finds a local minimum:

‣  Expected update is equal to full gradient update

‣  Stochastic gradient descent (SGD) samples the gradient:

Linear Value Function Approximation (VFA) ‣  Represent value function by a linear combination of features

‣  Update = step-size × prediction error × feature value ‣  Later, we will look at the neural networks as function approximators.

‣  Objective function is quadratic in parameters w

‣  Update rule is particularly simple

Don’t know true V! How approximate it?

Monte Carlo with VFA 194 CHAPTER 9. ON-POLICY PREDICTION WITH APPROXIMATION

Gradient Monte Carlo Algorithm for Approximating v̂ ⇡ v⇡

Input: the policy ⇡ to be evaluatedInput: a di↵erentiable function v̂ : S⇥ Rn ! R

Initialize value-function weights ✓ as appropriate (e.g., ✓ = 0)Repeat forever:

Generate an episode S0, A0, R1, S1, A1, . . . , RT , ST using ⇡For t = 0, 1, . . . , T � 1:

✓ ✓ + ↵⇥Gt � v̂(St,✓)

⇤rv̂(St,✓)

If Ut is an unbiased estimate, that is, if E[Ut] = v⇡(St), for each t, then ✓t is guar-anteed to converge to a local optimum under the usual stochastic approximationconditions (2.7) for decreasing ↵.

For example, suppose the states in the examples are the states generated by in-teraction (or simulated interaction) with the environment using policy ⇡. Becausethe true value of a state is the expected value of the return following it, the MonteCarlo target Ut

.= Gt is by definition an unbiased estimate of v⇡(St). With this

choice, the general SGD method (9.7) converges to a locally optimal approximationto v⇡(St). Thus, the gradient-descent version of Monte Carlo state-value predictionis guaranteed to find a locally optimal solution. Pseudocode for a complete algorithmis shown in the box.

One does not obtain the same guarantees if a bootstrapping estimate of v⇡(St)

is used as the target Ut in (9.7). Bootstrapping targets such as n-step returns G(n)t

or the DP targetP

a,s0,r ⇡(a|St)p(s0, r|St, a)[r + �v̂(s0,✓t)] all depend on the currentvalue of the weight vector ✓t, which implies that they will be biased and that theywill not produce a true gradient-descent method. One way to look at this is thatthe key step from (9.4) to (9.5) relies on the target being independent of ✓t. Thisstep would not be valid if a bootstrapping estimate was used in place of v⇡(St).Bootstrapping methods are not in fact instances of true gradient descent (Barnard,1993). They take into account the e↵ect of changing the weight vector ✓t on theestimate, but ignore its e↵ect on the target. They include only a part of the gradientand, accordingly, we call them semi-gradient methods.

Although semi-gradient (bootstrapping) methods do not converge as robustly asgradient methods, they do converge reliably in important cases such as the linearcase discussed in the next section. Moreover, they o↵er important advantages whichmakes them often clearly preferred. One reason for this is that they are typicallysignificantly faster to learn, as we have seen in Chapters 6 and 7. Another is that theyenable learning to be continual and online, without waiting for the end of an episode.This enables them to be used on continuing problems and provides computationaladvantages. A prototypical semi-gradient method is semi-gradient TD(0), which usesUt

.= Rt+1 + �v̂(St+1,✓) as its target. Complete pseudocode for this method is given

in the box at the top of the next page.

Gt = total sum of discounted rewards in episode t from state s to end of episode

TD Learning with VFA 9.3. STOCHASTIC-GRADIENT AND SEMI-GRADIENT METHODS 195

Semi-gradient TD(0) for estimating v̂ ⇡ v⇡

Input: the policy ⇡ to be evaluatedInput: a di↵erentiable function v̂ : S+ ⇥ Rn ! R such that v̂(terminal,·) = 0

Initialize value-function weights ✓ arbitrarily (e.g., ✓ = 0)Repeat (for each episode):

Initialize SRepeat (for each step of episode):

Choose A ⇠ ⇡(·|S)Take action A, observe R, S0

✓ ✓ + ↵⇥R + �v̂(S0,✓)� v̂(S,✓)

⇤rv̂(S,✓)

S S0

until S0 is terminal

Example 9.1: State Aggregation on the 1000-state Random Walk Stateaggregation is a simple form of generalizing function approximation in which statesare grouped together, with one estimated value (one component of the weight vector✓) for each group. The value of a state is estimated as its group’s component, andwhen the state is updated, that component alone is updated. State aggregation isa special case of SGD (9.7) in which the gradient, rv̂(St,✓t), is 1 for St’s group’scomponent and 0 for the other components.

Consider a 1000-state version of the random walk task (Examples 6.2 and 7.1).The states are numbered from 1 to 1000, left to right, and all episodes begin near thecenter, in state 500. State transitions are from the current state to one of the 100neighboring states to its left, or to one of the 100 neighboring states to its right, allwith equal probability. Of course, if the current state is near an edge, then there maybe fewer than 100 neighbors on that side of it. In this case, all the probability thatwould have gone into those missing neighbors goes into the probability of terminatingon that side (thus, state 1 has a 0.5 chance of terminating on the left, and state 950has a 0.25 chance of terminating on the right). As usual, termination on the leftproduces a reward of �1, and termination on the right produces a reward of +1.All other transitions have a reward of zero. We use this task as a running examplethroughout this section.

Figure 9.1 shows the true value function v⇡ for this task. It is nearly a straightline, but tilted slightly toward the horizontal and curving further in this direction forthe last 100 states at each end. Also shown is the final approximate value functionlearned by the gradient Monte-Carlo algorithm with state aggregation after 100,000episodes with a step size of ↵ = 2⇥ 10�5. For the state aggregation, the 1000 stateswere partitioned into 10 groups of 100 states each (i.e., states 1–100 were one group,states 101-200 were another, and so on). The staircase e↵ect shown in the figure istypical of state aggregation; within each group, the approximate value is constant,and it changes abruptly from one group to the next. These approximate values are

Use Vsamp as estimate of “true” value function. Takes immediate reward and value of next s’ under the current value function approximation.

Linear Action-Value Function Approximation ‣  Represent state and action by a feature vector

‣  Represent action-value function by linear combination of features

‣  Stochastic gradient descent update

Incremental Control Algorithms

234 CHAPTER 10. ON-POLICY CONTROL WITH APPROXIMATION

action-value prediction is

✓t+1.= ✓t + ↵

hUt � q̂(St, At, ✓t)

irq̂(St, At, ✓t). (10.1)

For example, the update for the one-step Sarsa method is

✓t+1.= ✓t + ↵

hRt+1 + �q̂(St+1, At+1, ✓t)� q̂(St, At, ✓t)

irq̂(St, At, ✓t). (10.2)

We call this method episodic semi-gradient one-step Sarsa. For a constant policy,this method converges in the same way that TD(0) does, with the same kind of errorbound (9.14).

To form control methods, we need to couple such action-value prediction methodswith techniques for policy improvement and action selection. Suitable techniquesapplicable to continuous actions, or to actions from large discrete sets, are a topic ofongoing research with as yet no clear resolution. On the other hand, if the action setis discrete and not too large, then we can use the techniques already developed inprevious chapters. That is, for each possible action a available in the current state St,we can compute q̂(St, a, ✓t) and then find the greedy action A⇤

t = argmaxa q̂(St, a, ✓t).Policy improvement is then done (in the on-policy case treated in this chapter) bychanging the estimation policy to a soft approximation of the greedy policy such asthe "-greedy policy. Actions are selected according to this same policy. Pseudocodefor the complete algorithm is given in the box.

Example 10.1: Mountain–Car Task Consider the task of driving an underpow-ered car up a steep mountain road, as suggested by the diagram in the upper leftof Figure 10.1. The di�culty is that gravity is stronger than the car’s engine, andeven at full throttle the car cannot accelerate up the steep slope. The only solutionis to first move away from the goal and up the opposite slope on the left. Then, by

Episodic Semi-gradient Sarsa for Estimating q̂ ⇡ q⇤

Input: a di↵erentiable function q̂ : S⇥A⇥ Rn ! R

Initialize value-function weights ✓ 2 Rn arbitrarily (e.g., ✓ = 0)Repeat (for each episode):

S, A initial state and action of episode (e.g., "-greedy)Repeat (for each step of episode):

Take action A, observe R, S0

If S0 is terminal:✓ ✓ + ↵

⇥R� q̂(S, A, ✓)

⇤rq̂(S, A, ✓)

Go to next episodeChoose A0 as a function of q̂(S0, ·, ✓) (e.g., "-greedy)✓ ✓ + ↵

⇥R + �q̂(S0, A0, ✓)� q̂(S, A, ✓)

⇤rq̂(S, A, ✓)

S S0

A A0

SGD with Experience Replay ‣  Given experience consisting of ⟨state, value⟩ pairs

‣  Converges to least squares solution

‣  We will look at Deep Q-networks later.

‣  Repeat

-  Sample state, value from experience

-  Apply stochastic gradient descent update

Monte Carlo vs TD Learning: Convergence in On Policy Case

• Evaluating value of a single policy

• where• d(s) is generally the on-policy ᷜ stationary distrib• ~V(s,w) is the value function approximation

Convergence given infinite amount of data?

Tsitsiklis and Van Roy. An Analysis of Temporal-Difference Learning with Function Approximation. 1997

Monte Carlo Convergence: Linear VFA



• Linear VFA: • Monte Carlo converges to min MSE possible!


TD Learning Convergence: Linear VFA



• Linear VFA: • TD converges to constant factor of best MSE


• MC converges to minimum MSE• TD w/infinite experience replay converges to estimate

as if computed MLE MDP model and then computed value of policy• This is often better if the domain is a MDP!• Leverages Markov structure

TD Learning vs Monte Carlo: Finite Data, Lookup Table, Which is Better?

Off Policy Learning

Can use importance sampling with MC to estimate the value of a policy didn’t try (but that has support in behavior policy)

Q-learning with function approximation can diverge (not converge even given infinite data)

• See examples in Chapter 11 (Sutton and Barto)

Deep Learning & Model Free Q learning

• Running stochastic gradient descent• Now use a deep network to approximate Q

• V(s’)• But we don’t know that• Could be use Monte Carlo estimate (sum of

rewards to the end of the episode)

Q-learning target Q-network

Challenges

• Challenge of using function approximation• Local updates (s,a,r,s’) highly correlated• “Target” (approximation to true value of s’) can

change quickly and lead to instabilities

DQN: Q-learning with DL

• Experience replay of mix of prior (si,a

i,r

i,s

i+1) tuples

to update Q(w)• Fix target Q (w_) for number of steps, then update• Optimize MSE between current Q and Q target

• Use stochastic gradient descent

Q-learning target Q-network

Deep RL

• Experience replay is hugely helpful• Target stablization is also helpful• No guarantees on convergence (yet)• Some other influential ideas

• Prioritize experience replay• Double Q (two separate networks, each act as a

“target” for each other)• Dueling: separate value and advantage• Many of these ideas build on prior ideas for RL

with look up table representations





• Exploration• e-greedy

• PAC: Rmax

• regret: UCRL

• know what different things guarantee, algorithms that achieve, when might want one criteria or another. Do not have to be able to derive full proofs

Performance of RL Algorithms

• Convergence• Asymptotically optimal

• In limit of infinite data, will converge to optimal ᯋ• E.g. Q-learning with e-greedy action selection • Says nothing about finite-data performance

• Probably approximately correct• Minimize / sublinear regret

Optimism Under Uncertainty

45

R-max (Brafman and Tennenholtz).

Slight modification of R-max (Algorithm 1) pseudo code in

Reinforcement Learning in Finite MDPs: PAC Analysis (Strehl, Li, LIttman 2009)

Rmax / (1-ᶕ)

http://www.jmlr.org/papers/volume10/strehl09a/strehl09a.pdf



Regret vs PAC vs ...?• What choice of performance should we care about?• For simplicity, consider episodic setting• Return is the sum of rewards in an episode

Episodes

Ret

urn

1 2 3 … k

Regret Bounds

Episodes

Ret

urn

1 2 3 … k

Optimal return

(ε,δ) - Probably Approximately Correct

Episodes

Ret

urn

1 2 3 … k

Optimal return

(ε,δ) - Probably Approximately Correct

Episodes

Ret

urn

1 2 3 … k

Optimal return

Number of episodes with policies not ε-close to optimal

Uniform-PAC (Dann, Lattimore, Brunskill, arxiv, 2017)

Uniform PAC Bound

Documents

Reinforcement Learning - Stanford University · 2020. 3. 11. · Model-Based Reinforcement Learning* (Will discuss other exploration later) 1.Take action a for current state s given