21
R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 RL Lecture 5: Monte Carlo Methods Monte Carlo methods learn from complete sample returns Only defined for episodic tasks Monte Carlo methods learn directly from experience On-line: No model necessary and still attains optimality Simulated: No need for a full model

RL Lecture 5: Monte Carlo Methods

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

RL Lecture 5: Monte Carlo Methods

❐  Monte Carlo methods learn from complete sample returnsn  Only defined for episodic tasks

❐  Monte Carlo methods learn directly from experiencen  On-line: No model necessary and still attains optimalityn  Simulated: No need for a full model

Page 2: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

Monte Carlo Policy Evaluation

❐  Goal: learn Vπ(s)❐  Given: some number of episodes under π which contain s❐  Idea: Average returns observed after visits to s

❐  Every-Visit MC: average returns for every time s is visited in an episode

❐  First-visit MC: average returns only for first time s is visited in an episode

❐  Both converge asymptotically

1 2 3 4 5

Page 3: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

First-visit Monte Carlo policy evaluation

Page 4: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Blackjack example

❐  Object: Have your card sum be greater than the dealers without exceeding 21.

❐  States (200 of them): n  current sum (12-21)n  dealer’s showing card (ace-10)n  do I have a useable ace?

❐  Reward: +1 for winning, 0 for a draw, -1 for losing❐  Actions: stick (stop receiving cards), hit (receive another

card)❐  Policy: Stick if my sum is 20 or 21, else hit

Page 5: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Blackjack value functions

+1

−1

ADealer showing 10 12

Play

er s

um21

After 500,000 episodesAfter 10,000 episodes

Usableace

Nousable

ace

Page 6: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

Backup diagram for Monte Carlo

❐  Entire episode included❐  Only one choice at each state

(unlike DP)

❐  MC does not bootstrap

❐  Time required to estimate one state does not depend on the total number of states

terminal state

Page 7: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

e.g., Elastic Membrane (Dirichlet Problem)

The Power of Monte Carlo

How do we compute the shape of the membrane or bubble?

Page 8: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Relaxation Kakutani’s algorithm, 1945

Two Approaches

Page 9: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

Monte Carlo Estimation of Action Values (Q)

❐  Monte Carlo is most useful when a model is not availablen  We want to learn Q*

❐  Qπ(s,a) - average return starting from state s and action a following π

❐  Also converges asymptotically if every state-action pair is visited

❐  Exploring starts: Every state-action pair has a non-zero probability of being the starting pair

Page 10: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Monte Carlo Control

❐  MC policy iteration: Policy evaluation using MC methods followed by policy improvement

❐  Policy improvement step: greedify with respect to value (or action-value) function

π Q

evaluation

improvement

Q →Qπ

π→greedy(Q)

Page 11: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Convergence of MC Control

❐  Policy improvement theorem tells us:Qπ k (s,π k+1(s)) = Qπ k (s, argmax

aQπ k (s, a))

= maxaQπ k (s,a)

≥Qπ k (s,π k (s))

= Vπ k (s)❐  This assumes exploring starts and infinite number of

episodes for MC policy evaluation❐  To solve the latter:

n  update only to a given level of performancen  alternate between evaluation and improvement per

episode

Page 12: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

Monte Carlo Exploring Starts

Fixed point is optimal policy π*

Proof is open question

Page 13: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Usableace

Nousable

ace

20

10A 2 3 4 5 6 7 8 9Dealer showing

Play

er s

um

HIT

STICK 19

21

1112131415161718

π*

10A 2 3 4 5 6 7 8 9

HIT

STICK 2019

21

1112131415161718

V*

21

10 12

A

Dealer showing Play

er s

um

10

A

12

21

+1

−1

Blackjack example continued

❐  Exploring starts❐  Initial policy as described before

Page 14: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

On-policy Monte Carlo Control

)(1

sAε

ε +−

greedy

)(sAε

non-max

❐  On-policy: learn about policy currently executing❐  How do we get rid of exploring starts?

n  Need soft policies: π(s,a) > 0 for all s and an  e.g. ε-soft policy:

❐  Similar to GPI: move policy towards greedy policy (i.e. ε-soft)

❐  Converges to best ε-soft policy

Page 15: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

On-policy MC Control

Page 16: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Off-policy Monte Carlo control

❐  Behavior policy generates behavior in environment❐  Estimation policy is policy being learned about❐  Average returns from behavior policy by probability their

probabilities in the estimation policy

Page 17: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Learning about π while following π "

Page 18: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

Off-policy MC control

Page 19: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Incremental Implementation

❐  MC can be implemented incrementallyn  saves memory

❐  Compute the weighted average of each return

Vn =wkRk

k =1

n

wkk=1

n

[ ]

000

11

11

11

==

+=

−+=

++

++

++

WVwWW

VRWwVV

nnn

nnn

nnn

incremental equivalent non-incremental

Page 20: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Racetrack Exercise

Starting line

Finishline

Starting line

Finishline

❐  States: grid squares, velocity horizontal and vertical

❐  Rewards: -1 on track, -5 off track

❐  Actions: +1, -1, 0 to velocity❐  0 < Velocity < 5❐  Stochastic: 50% of the time it

moves 1 extra square up or right

Page 21: RL Lecture 5: Monte Carlo Methods

R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Summary

❐  MC has several advantages over DP:n  Can learn directly from interaction with environmentn  No need for full modelsn  No need to learn about ALL statesn  Less harm by Markovian violations (later in book)

❐  MC methods provide an alternate policy evaluation process

❐  One issue to watch for: maintaining sufficient explorationn  exploring starts, soft policies

❐  No bootstrapping (as opposed to DP)