Upload
others
View
9
Download
0
Embed Size (px)
Citation preview
Reinforcement Learning
Emma BrunskillStanford University
Spring 2017
Reinforcement Learning: Learning to make a good sequence of decisions
Policy: mapping from history of past actions, states, rewards to next action
Figure from Shttp://www.mdpi.com/energies/energies-09-00755/article_deploy/html/images/energies-09-00755-g003.png
Reinforcement Learning Involves
• Optimization
• Delayed consequences
• Generalization
• Exploration
Learning Objectives
• Define the key features of RL vs AI & other ML
• Define MDP, POMDP, bandit, batch offline RL, online RL
• Given an application problem (e.g. from computer vision, robotics, etc) decide if it should be formulated as a RL problem, if yes how to formulate, what algorithm (from class) is best suited to addressing, and justify answer
• Implement several RL algorithms incl. a deep RL approach
• Describe multiple criteria for analyzing RL algorithms and evaluate algorithms on these metrics: e.g. regret, sample complexity, computational complexity, convergence, etc.
• Describe the exploration vs exploitation challenge and compare and contrast 2 or more approaches
• List at least two open challenges or hot topics in RL
What We’ve Covered So Far
• Markov decision process planning
• Reinforcement learning in finite state and action domains
• Generalization with model-free techniques
• Exploration
Reinforcement Learningfigure from David Silver
Figure from David Silver’s slides
Reinforcement Learningmodel → value → policy
(ordering sufficient but not necessary, e.g. having a model is not required to learn a value)
figure from David Silver
Figure from David Silver’s slides
What We’ve Covered So Far
• Markov decision process planning• definition of MDP
• Bellman operator, contraction
• value iteration, policy iteration
• Reinforcement learning in finite state and action domains
• Generalization with model-free techniques
• Exploration
Model: Frequently model as a Markov Decision Process, <S,A,R,T,ϒ>
Agent
World
ActionState s’Reward
Policy mapping from state → action
Stochastic dynamics model T(s’|s,a)Reward model R(s,a,s’)*
Discount factor ϒ
MDPs
• Define a MDP <S,A,R,T,ϒ>• Markov property
• What is this, why is it important• What are the MDP models / values V /
state-action values Q / policy• What is MDP planning? What is difference from
reinforcement learning?• Planning = know the reward & dynamics• Learning = don’t know reward & dynamics
Value Iteration (VI)
1. Initialize V0(s
i)=0 for all states s
i,
2. Set k=1
3. Loop until [finite horizon, convergence]• For each state s,
4. Extract Policy
Bellman backup
Contraction Operator
• Let O be an operator
• If |OV – OV’| <= |V-V|’
• Then O is a contraction operator
• Let B be the Bellman backup operator
Will Value Iteration Converge to a Single Fixed Point?
• Yes, if discount factor γ < 1 or end up in a terminal state with probability 1
• Bellman backup is a contraction if discount factor, γ < 1
• If apply it to two different value functions, distance between value functions shrinks after apply Bellman equation to each
Value vs Policy Iteration
• Value iteration:• Compute optimal value if horizon=k
• Note this can be used to compute optimal policy if horizon = k
• Increment k
• Policy iteration: • Compute infinite horizon value of a policy• Use to select another (better) policy • Closely related to a very popular method in RL:
policy gradient
Policy Iteration (PI)
1. i=0; Initialize π0(s) randomly for all states s
2. Converged = 0;
3. While i == 0 or |πi-π
i-1| > 0
• i=i+1
• Policy evaluation: Compute
• Policy improvement:
What We’ve Covered So Far
• Markov decision process planning
• Reinforcement learning in finite state and action domains• model-based RL
• model-free RL
• Passive RL (estimate V while following 1 policy)
• General/control RL -- policy may change, may want to estimate value of a another policy
• Generalization with model-free techniques
• Exploration
TransitionModel?
Agent
ActionState
Reward model?
P(s’|s1,a1)=0.5,...
R(s1,a1)=1...
Model-based RL:Agent estimates a reward & dynamics model… and
then computes V/Q
TransitionModel?
Agent
ActionState
Reward model?
Q(s1,a1)=9, Q(s1,a2)=8...
Model-free RL:Agent directly estimates
Q / V
Model-Based Passive Reinforcement Learning
• Follow policy π• Estimate MDP model parameters from data
• If finite set of states and actions: count & average
• Given estimated MDP, compute value of policy π• Planning problem! Policy evaluation with a model
• Does this give us dynamics model parameter estimates for all actions?
• No. But all ones need to estimate the value of the policy.
• How good is the model parameter estimates?
• Depends on amount of data we have
• What about the resulting policy value estimate?
• Depends on quality of model parameters
Model-Based Reinforcement Learning
1. Take action a for current state s given policy πa. Observe next state and reward
2. Create a MDP given all previous data the data3. Given that MDP compute policy π using planning4. Go to step 1
Model-Based Reinforcement Learning*(Will discuss other exploration later)
1. Take action a for current state s given policy πa. Observe next state and reward
2. Create a MDP given all previous data the data3. Given that MDP compute policy π using planning
• Certainty equivalence:
• Estimate MLE MDP M1 parameters from data
• Compute ~Q* for M1 and let π(s) = argmax ~Q*(s,a)
• E-greedy
• Estimate MLE MDP M1 parameters from data
• Compute ~Q* for M1 and π(s) = argmax ~Q*(s,a) w prob (1-e), else select action at random
4. Go to step 1
TransitionModel?
Agent
ActionState
Reward model?
P(s’|s1,a1)=0.5,...
R(s1,a1)=1...
Model-based RL:Agent estimates a reward & dynamics model… and
then computes V/Q
TransitionModel?
Agent
ActionState
Reward model?
Q(s1,a1)=9, Q(s1,a2)=8...
Model-free RL:Agent directly estimates
Q / V
Model-free Passive RL
• Directly estimate Q or V of a policy π as act using π• The Q function for a particular policy is the expected
discounted sum of future rewards obtained by following policy starting with (s,a)
• For Markov decision processes,
Model-free Passive RL
• Directly estimate Q or V of a policy from data• The Q function for a particular policy is the expected
discounted sum of future rewards obtained by following policy starting with (s,a)
• For Markov decision processes,
• Consider episodic domains• Act in world for H steps, then reset back to state
sampled from starting distribution• MC: directly average episodic rewards• TD/Q-learning: use a “target” to bootstrap
MC vs Dynamic Programming vs TD
• Updating a state/action value (V or Q)
• Bootstrapping: update involves an estimate of V• MC does not bootstrap• Dynamic programming (value iteration) bootstraps• TD bootstraps
• Sampling: update samples an expectation• MC samples • Dynamic programming does not sample• TD samples• (e.g. samples s’ instead of taking expectation directly)
Monte-Carlo Policy Evaluation
‣ Goal: learn from episodes of experience under policy π
‣ Idea: Average returns observed after visits to s:
‣ Every-Visit MC: average returns for every time s is visited in an episode
‣ First-visit MC: average returns only for first time s is visited in an episode
‣ Both converge asymptotically
First-Visit MC Policy Evaluation
‣ To evaluate state s
‣ The first time-step t that state s is visited in an episode,
‣ Increment counter:
‣ Increment total return:
‣ Value is estimated by mean return
‣ By law of large numbers
Incremental Monte Carlo Updates
‣ Update V(s) incrementally after episode
‣ For each state St with return Gt
‣ In non-stationary problems, it can be useful to track a running mean, i.e. forget old episodes.
MC Estimation of Action Values (Q)
‣ Monte Carlo (MC) is most useful when a model is not available
- We want to learn q*(s,a)
‣ qπ(s,a) - average return starting from state s and action a following π
‣ Converges asymptotically if every state-action pair is visited
‣ Exploring starts: Every state-action pair has a non-zero probability of being the starting pair
TD Learning
• Update Vπ(s) each time after each transition (s, a, s’, r)
• Approximating expectation over next state with samples
• Bootstraps because uses estimate of V instead of real V when creating the target/sampled V
Slide adapted from Klein and Abbeel
Decrease learning rate
over time
Model-Free Learning w/Random Actions
• TD learning for policy evaluation:• As act in the world go through (s,a,r,s’,a’,r’,…)
• Update Vπ estimates at each step
• Over time updates mimic Bellman updates
• Now do for Q values
Slide adapted from Klein and Abbeel
• Update Q(s,a) every time experience (s,a,s’,r)
• Create new sample estimate
• Update estimate of Q(s,a)
Q-Learning
Convergence with Acting Randomly
• Choose actions randomly
• Do certainty based MDP planning with lookup table representation
• Compute V and pi will converge to optimal v and policy in limit of infinite data
• Q learning will converge in limit to Q*
• Under reachability assumption
• Take all actions in all states infinitely often
Q-Learning Properties
• If acting randomly*, Q-learning converges Q*• Optimal Q values• Finds optimal policy
• Off-policy learning• Can act in one way
• But learning values of another π (the optimal one!)
*Again, under mild reachability assumptions
What We’ve Covered So Far
• Markov decision process planning
• Reinforcement learning in finite state and action domains
• Generalization with model-free techniques• linear value/Q function approximation
• deep reinforcement learning
• Exploration
Scaling Up
• Want to be able to tackle problems with enormous or infinite state spaces
• Tabular representation is insufficient
S1
Okay Field Site +1
S2 S3 S4 S5 S6 S7
Fantastic Field Site+10
Value Function Approximation (VFA) ‣ Value function approximation (VFA) replaces the table with a
general parameterized form:
Stochastic Gradient Descent ‣ Goal: find parameter vector w minimizing mean-squared error between
the true value function vπ(S) and its approximation :
‣ Gradient descent finds a local minimum:
‣ Expected update is equal to full gradient update
‣ Stochastic gradient descent (SGD) samples the gradient:
Linear Value Function Approximation (VFA) ‣ Represent value function by a linear combination of features
‣ Update = step-size × prediction error × feature value ‣ Later, we will look at the neural networks as function approximators.
‣ Objective function is quadratic in parameters w
‣ Update rule is particularly simple
Monte Carlo with VFA 194 CHAPTER 9. ON-POLICY PREDICTION WITH APPROXIMATION
Gradient Monte Carlo Algorithm for Approximating v̂ ⇡ v⇡
Input: the policy ⇡ to be evaluatedInput: a di↵erentiable function v̂ : S⇥ Rn ! R
Initialize value-function weights ✓ as appropriate (e.g., ✓ = 0)Repeat forever:
Generate an episode S0, A0, R1, S1, A1, . . . , RT , ST using ⇡For t = 0, 1, . . . , T � 1:
✓ ✓ + ↵⇥Gt � v̂(St,✓)
⇤rv̂(St,✓)
If Ut is an unbiased estimate, that is, if E[Ut] = v⇡(St), for each t, then ✓t is guar-anteed to converge to a local optimum under the usual stochastic approximationconditions (2.7) for decreasing ↵.
For example, suppose the states in the examples are the states generated by in-teraction (or simulated interaction) with the environment using policy ⇡. Becausethe true value of a state is the expected value of the return following it, the MonteCarlo target Ut
.= Gt is by definition an unbiased estimate of v⇡(St). With this
choice, the general SGD method (9.7) converges to a locally optimal approximationto v⇡(St). Thus, the gradient-descent version of Monte Carlo state-value predictionis guaranteed to find a locally optimal solution. Pseudocode for a complete algorithmis shown in the box.
One does not obtain the same guarantees if a bootstrapping estimate of v⇡(St)
is used as the target Ut in (9.7). Bootstrapping targets such as n-step returns G(n)t
or the DP targetP
a,s0,r ⇡(a|St)p(s0, r|St, a)[r + �v̂(s0,✓t)] all depend on the currentvalue of the weight vector ✓t, which implies that they will be biased and that theywill not produce a true gradient-descent method. One way to look at this is thatthe key step from (9.4) to (9.5) relies on the target being independent of ✓t. Thisstep would not be valid if a bootstrapping estimate was used in place of v⇡(St).Bootstrapping methods are not in fact instances of true gradient descent (Barnard,1993). They take into account the e↵ect of changing the weight vector ✓t on theestimate, but ignore its e↵ect on the target. They include only a part of the gradientand, accordingly, we call them semi-gradient methods.
Although semi-gradient (bootstrapping) methods do not converge as robustly asgradient methods, they do converge reliably in important cases such as the linearcase discussed in the next section. Moreover, they o↵er important advantages whichmakes them often clearly preferred. One reason for this is that they are typicallysignificantly faster to learn, as we have seen in Chapters 6 and 7. Another is that theyenable learning to be continual and online, without waiting for the end of an episode.This enables them to be used on continuing problems and provides computationaladvantages. A prototypical semi-gradient method is semi-gradient TD(0), which usesUt
.= Rt+1 + �v̂(St+1,✓) as its target. Complete pseudocode for this method is given
in the box at the top of the next page.
TD Learning with VFA 9.3. STOCHASTIC-GRADIENT AND SEMI-GRADIENT METHODS 195
Semi-gradient TD(0) for estimating v̂ ⇡ v⇡
Input: the policy ⇡ to be evaluatedInput: a di↵erentiable function v̂ : S+ ⇥ Rn ! R such that v̂(terminal,·) = 0
Initialize value-function weights ✓ arbitrarily (e.g., ✓ = 0)Repeat (for each episode):
Initialize SRepeat (for each step of episode):
Choose A ⇠ ⇡(·|S)Take action A, observe R, S0
✓ ✓ + ↵⇥R + �v̂(S0,✓)� v̂(S,✓)
⇤rv̂(S,✓)
S S0
until S0 is terminal
Example 9.1: State Aggregation on the 1000-state Random Walk Stateaggregation is a simple form of generalizing function approximation in which statesare grouped together, with one estimated value (one component of the weight vector✓) for each group. The value of a state is estimated as its group’s component, andwhen the state is updated, that component alone is updated. State aggregation isa special case of SGD (9.7) in which the gradient, rv̂(St,✓t), is 1 for St’s group’scomponent and 0 for the other components.
Consider a 1000-state version of the random walk task (Examples 6.2 and 7.1).The states are numbered from 1 to 1000, left to right, and all episodes begin near thecenter, in state 500. State transitions are from the current state to one of the 100neighboring states to its left, or to one of the 100 neighboring states to its right, allwith equal probability. Of course, if the current state is near an edge, then there maybe fewer than 100 neighbors on that side of it. In this case, all the probability thatwould have gone into those missing neighbors goes into the probability of terminatingon that side (thus, state 1 has a 0.5 chance of terminating on the left, and state 950has a 0.25 chance of terminating on the right). As usual, termination on the leftproduces a reward of �1, and termination on the right produces a reward of +1.All other transitions have a reward of zero. We use this task as a running examplethroughout this section.
Figure 9.1 shows the true value function v⇡ for this task. It is nearly a straightline, but tilted slightly toward the horizontal and curving further in this direction forthe last 100 states at each end. Also shown is the final approximate value functionlearned by the gradient Monte-Carlo algorithm with state aggregation after 100,000episodes with a step size of ↵ = 2⇥ 10�5. For the state aggregation, the 1000 stateswere partitioned into 10 groups of 100 states each (i.e., states 1–100 were one group,states 101-200 were another, and so on). The staircase e↵ect shown in the figure istypical of state aggregation; within each group, the approximate value is constant,and it changes abruptly from one group to the next. These approximate values are
Linear Action-Value Function Approximation ‣ Represent state and action by a feature vector
‣ Represent action-value function by linear combination of features
‣ Stochastic gradient descent update
Incremental Control Algorithms
234 CHAPTER 10. ON-POLICY CONTROL WITH APPROXIMATION
action-value prediction is
✓t+1.= ✓t + ↵
hUt � q̂(St, At, ✓t)
irq̂(St, At, ✓t). (10.1)
For example, the update for the one-step Sarsa method is
✓t+1.= ✓t + ↵
hRt+1 + �q̂(St+1, At+1, ✓t)� q̂(St, At, ✓t)
irq̂(St, At, ✓t). (10.2)
We call this method episodic semi-gradient one-step Sarsa. For a constant policy,this method converges in the same way that TD(0) does, with the same kind of errorbound (9.14).
To form control methods, we need to couple such action-value prediction methodswith techniques for policy improvement and action selection. Suitable techniquesapplicable to continuous actions, or to actions from large discrete sets, are a topic ofongoing research with as yet no clear resolution. On the other hand, if the action setis discrete and not too large, then we can use the techniques already developed inprevious chapters. That is, for each possible action a available in the current state St,we can compute q̂(St, a, ✓t) and then find the greedy action A⇤
t = argmaxa q̂(St, a, ✓t).Policy improvement is then done (in the on-policy case treated in this chapter) bychanging the estimation policy to a soft approximation of the greedy policy such asthe "-greedy policy. Actions are selected according to this same policy. Pseudocodefor the complete algorithm is given in the box.
Example 10.1: Mountain–Car Task Consider the task of driving an underpow-ered car up a steep mountain road, as suggested by the diagram in the upper leftof Figure 10.1. The di�culty is that gravity is stronger than the car’s engine, andeven at full throttle the car cannot accelerate up the steep slope. The only solutionis to first move away from the goal and up the opposite slope on the left. Then, by
Episodic Semi-gradient Sarsa for Estimating q̂ ⇡ q⇤
Input: a di↵erentiable function q̂ : S⇥A⇥ Rn ! R
Initialize value-function weights ✓ 2 Rn arbitrarily (e.g., ✓ = 0)Repeat (for each episode):
S, A initial state and action of episode (e.g., "-greedy)Repeat (for each step of episode):
Take action A, observe R, S0
If S0 is terminal:✓ ✓ + ↵
⇥R� q̂(S, A, ✓)
⇤rq̂(S, A, ✓)
Go to next episodeChoose A0 as a function of q̂(S0, ·, ✓) (e.g., "-greedy)✓ ✓ + ↵
⇥R + �q̂(S0, A0, ✓)� q̂(S, A, ✓)
⇤rq̂(S, A, ✓)
S S0
A A0
SGD with Experience Replay ‣ Given experience consisting of ⟨state, value⟩ pairs
‣ Converges to least squares solution
‣ We will look at Deep Q-networks later.
‣ Repeat
- Sample state, value from experience
- Apply stochastic gradient descent update
Monte Carlo vs TD Learning: Convergence in On Policy Case
• Evaluating value of a single policy
• where• d(s) is generally the on-policy ᷜ stationary distrib• ~V(s,w) is the value function approximation
Convergence given infinite amount of data?
Tsitsiklis and Van Roy. An Analysis of Temporal-Difference Learning with Function Approximation. 1997
Monte Carlo Convergence: Linear VFA
• Evaluating value of a single policy
• where• d(s) is generally the on-policy ᷜ stationary distrib• ~V(s,w) is the value function approximation
• Linear VFA: • Monte Carlo converges to min MSE possible!
Tsitsiklis and Van Roy. An Analysis of Temporal-Difference Learning with Function Approximation. 1997
TD Learning Convergence: Linear VFA
• Evaluating value of a single policy
• where• d(s) is generally the on-policy ᷜ stationary distrib• ~V(s,w) is the value function approximation
• Linear VFA: • TD converges to constant factor of best MSE
Tsitsiklis and Van Roy. An Analysis of Temporal-Difference Learning with Function Approximation. 1997
• MC converges to minimum MSE• TD w/infinite experience replay converges to estimate
as if computed MLE MDP model and then computed value of policy• This is often better if the domain is a MDP!• Leverages Markov structure
TD Learning vs Monte Carlo: Finite Data, Lookup Table, Which is Better?
Off Policy Learning
Can use importance sampling with MC to estimate the value of a policy didn’t try (but that has support in behavior policy)
Q-learning with function approximation can diverge (not converge even given infinite data)
• See examples in Chapter 11 (Sutton and Barto)
Deep Learning & Model Free Q learning
• Running stochastic gradient descent• Now use a deep network to approximate Q
• V(s’)• But we don’t know that• Could be use Monte Carlo estimate (sum of
rewards to the end of the episode)
Q-learning target Q-network
Challenges
• Challenge of using function approximation• Local updates (s,a,r,s’) highly correlated• “Target” (approximation to true value of s’) can
change quickly and lead to instabilities
DQN: Q-learning with DL
• Experience replay of mix of prior (si,a
i,r
i,s
i+1) tuples
to update Q(w)• Fix target Q (w_) for number of steps, then update• Optimize MSE between current Q and Q target
• Use stochastic gradient descent
Q-learning target Q-network
Deep RL
• Experience replay is hugely helpful• Target stablization is also helpful• No guarantees on convergence (yet)• Some other influential ideas
• Prioritize experience replay• Double Q (two separate networks, each act as a
“target” for each other)• Dueling: separate value and advantage• Many of these ideas build on prior ideas for RL
with look up table representations
What We’ve Covered So Far
• Markov decision process planning
• Reinforcement learning in finite state and action domains
• Generalization with model-free techniques
• Exploration• e-greedy
• PAC: Rmax
• regret: UCRL
• know what different things guarantee, algorithms that achieve, when might want one criteria or another. Do not have to be able to derive full proofs
Performance of RL Algorithms
• Convergence• Asymptotically optimal
• In limit of infinite data, will converge to optimal ᯋ• E.g. Q-learning with e-greedy action selection • Says nothing about finite-data performance
• Probably approximately correct• Minimize / sublinear regret
Optimism Under Uncertainty
45
R-max (Brafman and Tennenholtz).
Slight modification of R-max (Algorithm 1) pseudo code in
Reinforcement Learning in Finite MDPs: PAC Analysis (Strehl, Li, LIttman 2009)
Rmax / (1-ᶕ)
Regret vs PAC vs ...?• What choice of performance should we care about?• For simplicity, consider episodic setting• Return is the sum of rewards in an episode
Episodes
Ret
urn
1 2 3 … k
Regret Bounds
Episodes
Ret
urn
1 2 3 … k
Optimal return
(ε,δ) - Probably Approximately Correct
Episodes
Ret
urn
1 2 3 … k
Optimal return
(ε,δ) - Probably Approximately Correct
Episodes
Ret
urn
1 2 3 … k
Optimal return
Number of episodes with policies not ε-close to optimal
Uniform-PAC (Dann, Lattimore, Brunskill, arxiv, 2017)
Uniform PAC Bound