21
Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Embed Size (px)

Citation preview

Page 1: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Reinforcement LearningMichael Roberts

With Material From: Reinforcement Learning: An Introduction

Sutton & Barto (1998)

Page 2: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

What is RL?

• Trial & error learning– without model– with model

• Structure

s1 s2

s3

s4

r1

r2

r3

Page 3: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

RL vs. Supervised Learning

• Evaluative vs. Instructional feedback

• Role of exploration

• On-line performance

Page 4: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

K-armed Bandit Problem

Agent

Actions

Average Rewards

10

-5

100

0

0, 0, 5, 10, 35

5, 10, -15, -15, -10

Page 5: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

K-armed Bandit Cont.

• Greedy exploration• ε-greedy • Softmax

Average Reward:

Incremental formula:

where: α = 1 / (k+1)

Probability of choosing action a:

Page 6: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

More General Problems

• More than one state• Delayed rewards

• Markov Decision Process (MDP)– Set of states – Set of actions– Reward function– State transition function

• Table or Function Approximation

Page 7: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)
Page 8: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Example: Recycling Robot

Page 9: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Recycling Robot: Transition Graph

Page 10: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Dynamic Programming

Page 11: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Backup Diagram

.25.25.25

.5.5.3.7.6.4

Rewards 10 5 200 200 -10 1000

Page 12: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Dynamic Programming:Optimal Policy

Page 13: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Backup for Optimal Policy

Page 14: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Performance Metrics

• Eventual convergence to optimality

• Speed of convergence to optimality

• Regret

(Kaelbling, L., Littman, M., & Moore, A. 1996)

Page 15: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Gridworld Example

Page 16: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Initialize V arbitrarily, e.g.           , for all        

Repeat

For each      

until         (a small positive number)

Output a deterministic policy,   such that:

Page 17: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Temporal Difference Learning

• RL without a model• Issue of: temporal credit assignment• Bootstraps like DP

• TD(0):

Page 18: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

TD Learning

• Again, TD(0) =

TD(λ) =

where e is called an eligibility trace

Page 19: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Backup Diagram for TD(λ)

Page 20: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

TD-Gammon (Tesauro)

Page 21: Reinforcement Learning Michael Roberts With Material From: Reinforcement Learning: An Introduction Sutton & Barto (1998)

Additional Work

• POMDP’s

• Macros

• Multi-agent rl

• Multiple reward structures