47
1 Learning to Maximize Reward: Reinforcement Learning 06/16/22 Brian C. Williams 16.412J/6.834J October 28 th , 2002 Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU

Learning to Maximize Reward: Reinforcement Learning

  • Upload
    onawa

  • View
    34

  • Download
    0

Embed Size (px)

DESCRIPTION

Learning to Maximize Reward: Reinforcement Learning. Brian C. Williams 16.412J/6.834J October 28 th , 2002. Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU. 8/6/2014. Reading. Today: Reinforcement Learning - PowerPoint PPT Presentation

Citation preview

Page 1: Learning to Maximize Reward: Reinforcement Learning

1

Learning to Maximize Reward: Reinforcement Learning

04/22/23

Brian C. Williams16.412J/6.834J October 28th, 2002

Slides adapted from:Manuela Veloso,Reid Simmons, &Tom Mitchell, CMU

Page 2: Learning to Maximize Reward: Reinforcement Learning

2

Reading• Today: Reinforcement Learning

• Read 2nd ed AIMA Chapter 19, or1st ed AIMA Chapter 20

• Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) 237-285.

• For Markov Decision Processes• Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4.

• Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998) 237-285.

Page 3: Learning to Maximize Reward: Reinforcement Learning

3

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement

• Q values• Q learning• Multi-step backups

• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

Page 4: Learning to Maximize Reward: Reinforcement Learning

4

Example: TD-Gammon [Tesauro, 1995]

Learns to play Backgammon

Situations: • Board configurations (1020)

Actions:• Moves

Rewards:• +100 if win• - 100 if lose• 0 for all other states

• Trained by playing 1.5 million games against self.

Currently, roughly equal to best human player.

Page 5: Learning to Maximize Reward: Reinforcement Learning

5

Reinforcement Learning ProblemGiven: Repeatedly…• Executed action• Observed state• Observed reward

Learn action policy : S A• Maximizes life reward

r0 + r1 + 2 r2 . . .from any start state.

• Discount: 0 1

Note:• Unsupervised learning• Delayed reward

Agent

Environment

s0 r0

a0 s1a1

r1

s2a2

r2

s3

State Reward Action

Goal: Learn to choose actions that maximize life reward

r0 + r1 + 2 r2 . . .

Page 6: Learning to Maximize Reward: Reinforcement Learning

6

How About Learning the Policy Directly?

1. *: S A2. fill out table entries for * by collecting statistics on

training pairs <s,a>.

3. Where does acome from?

Page 7: Learning to Maximize Reward: Reinforcement Learning

7

How About Learning the Value Function?

1. Have agent learn value function V, denoted V.

2. Given learned V, agent selects optimal action by one step lookahead

*(s) = argmaxar(s,a + V((s, a)]

Problem:• Works well if agent knows the environment model.

• : S x A S• r: S x A

• With no model, agent can’t choose action from V.• With a model, could compute V via value iteration, why

learn it?

Page 8: Learning to Maximize Reward: Reinforcement Learning

8

How About Learning the Model as Well?1. Have agent learn and r by statistics on training instances <st,rt+1,st+1>

2. Compute V by value iteration.Vt+1(s) maxa [r(s,a + V t((s, a))]

3. Agent selects optimal action by one step lookahead*(s) = argmaxar(s,a + V((s, a)]

Problem: A viable strategy for many problems, but …• When do you stop learning the model and compute V?

• May take a long time to converge on model.• Would like to continuously interleave learning and acting,

but repeatedly computing Vis costly.

• How can we avoid learning the model and Vexplicitly?

Page 9: Learning to Maximize Reward: Reinforcement Learning

9

Eliminating the Model with Q Functions*(s) = argmaxar(s,a + V((s, a)]

Key idea:• Define function that encapsulates V, and r:

Q(s,a = r(s,a + V((s, a))

• From learned Q, can choose an optimal action without knowing or r.

*(s) = argmaxaQ(s,a

V = Cumulative reward of being in s.Q = Cumulative reward of being in s and taking action a.

Page 10: Learning to Maximize Reward: Reinforcement Learning

10

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement

• Q values• Q learning• Multi-step backups

• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

Page 11: Learning to Maximize Reward: Reinforcement Learning

11

How Do We Learn Q?Q(st,at = r(st,at + V((st, at))

Idea: • Create update rule similar to Bellman equation.• Perform updates on training examples <st , at , rt+1 , st+1 >

Q(st,at rt+1 + V(st+1 )

How do we eliminate V*?• Q and V* are closely related:

V*(s) = maxa’ Q(s,a’)

• Substituting Q for V*:

Q(st,at rt+1 + maxa’ Q(st+1,a’)

Called a backup

Page 12: Learning to Maximize Reward: Reinforcement Learning

12

Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.

Initially:• For each s, a initialize table entry Q(s, a) 0• Observe initial state s0

Do for all time t:• Select an action at and execute it• Receive immediate reward rt+1

• Observe the new state st+1

• Update the table entry for Q (st, at) as follows:Q(st, at) rt+1+ maxa’ Q(st+1,a’)

• st st+1

Page 13: Learning to Maximize Reward: Reinforcement Learning

13

Example – Q Learning Update100

81

72

63

= 0.9 100

8163

0 reward received

Page 14: Learning to Maximize Reward: Reinforcement Learning

14

Example – Q Learning Update

Q(s1,aright) r(s1,aright) + maxa’ Q(s2,a’) 0 + 0.9max {63, 81, 100} 90

Note: if rewards are non-negative:• For all s, a, n, Qn(s, a) Qn+1(s, a)

• For all s, a, n, 0 Qn(s, a) Q(s, a)

= 0.9 100

8163

100

81

72

63

s1 s2

s2s1Maxrt

aright

0 reward received

90

Page 15: Learning to Maximize Reward: Reinforcement Learning

15

Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

G

10

1010

s1 s2 s3

s4s5s6

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0

Page 16: Learning to Maximize Reward: Reinforcement Learning

16

Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

G

10

1010

s1 s2 s3

s4s5s6

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0

Page 17: Learning to Maximize Reward: Reinforcement Learning

17

Q-Learning Iterations• Start at upper left – move clockwise; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=

10 + 0.8 x 0 = 10

G

10

1010

s1 s2 s3

s4s5s6

Page 18: Learning to Maximize Reward: Reinforcement Learning

18

Q-Learning Iterations• Start at upper left – move clockwise; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=

10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

G

10

1010

s1 s2 s3

s4s5s6

Page 19: Learning to Maximize Reward: Reinforcement Learning

19

Q-Learning Iterations• Start at upper left – move clockwise; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=

10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

10

0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)

= 6.4

G

10

1010

s1 s2 s3

s4s5s6

Page 20: Learning to Maximize Reward: Reinforcement Learning

20

Q-Learning Iterations• Start at upper left – move clockwise; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=

10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

10

0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)

= 6.4

8 10

G

10

1010

s1 s2 s3

s4s5s6

Page 21: Learning to Maximize Reward: Reinforcement Learning

21

Q-Learning Iterations• Start at upper left – move clockwise; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=

10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

10

0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)

= 6.4

8 10

G

10

1010

s1 s2 s3

s4s5s6

Page 22: Learning to Maximize Reward: Reinforcement Learning

22

Example Summary: Value Iteration and Q-Learning

G100

100

90

81

100

90

0

100

G

V*(s) values

G

One Optimal Policy

R(s, a) values

G100

100

Q(s, a) values

81

90

8172

8181

7281

90

Page 23: Learning to Maximize Reward: Reinforcement Learning

23

Exploration vs Exploitation

How do you pick actions as you learn?

1. Greedy Action Selection:• Always select the action that looks best:

*(s) = arg maxaQ(s,a

2. Probabilistic Action Selection:• Likelihood of a is proportional to current Q value.

• P(ai|s) = kQ(s, ai) / j kQ(s, aj)

Page 24: Learning to Maximize Reward: Reinforcement Learning

24

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement

• Q values• Q learning• Multi-step backups

• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

Page 25: Learning to Maximize Reward: Reinforcement Learning

25

TD(): Temporal Difference Learningfrom lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997.

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

Q(1)(st,at = rt + maxaQ(st+1,at

Why not two steps? Q(2)(st,at = rt + rt+1 + 2 maxaQ(st+2,at+1

Or n ?Q(n)(st,at = rt + rt+1 + (n-1) rt+n-1 + n maxaQ(st+n,at+n-1

Blend all of these:

Q(st,at = (1-) [Q(1)(st,at + Q(2)(st,at + 2Q(3)(st,at + …] …

Page 26: Learning to Maximize Reward: Reinforcement Learning

26

Eligibility Traces

Idea: Perform backups on N previous data points, as well as most recent data point.• Select data to backup based on frequency of visitation.• Bias towards frequent data by geometric decay i-j.

Visits to data point <s,a>:

t

Accumulating trace:

t

Page 27: Learning to Maximize Reward: Reinforcement Learning

27

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement• Nondeterministic MDPs:

• Value Iteration• Q Learning

• Function Approximators• Model-based Learning• Summary

Page 28: Learning to Maximize Reward: Reinforcement Learning

28

Nondeterministic MDPsstate transitions become

probabilistic: (s,a,s’)

S1Unemployed

D

R

S2Industry

D

S3Grad School

D

R

S4Academia

D

R

R

0.1

0.9

1.0

0.9

0.1

1.0

0.9

0.1

1.00.90.1

1.0

R – Research pathD – Development pathExample

Page 29: Learning to Maximize Reward: Reinforcement Learning

29

NonDeterministic Case• How do we redefine cumulative reward to handle non-

determinism?• Define V and Q based on expected values:

V(st) = E[rt + rt+1 + 2 rt+2 . . .]

V(st) = E[ i rt+I ]

Q(st,at = E[r(st,at + V((st, at))]

Page 30: Learning to Maximize Reward: Reinforcement Learning

30

Value Iteration for Non-deterministic MDPs

V1(s) := 0 for all st := 1loop t := t + 1 loop for all s in S loop for all a in A

Qt (s ,a) := r(s,a + s’ in S(s,a,s’) V

t(s’) end loop

Vt(s) := maxa [Qt (s,a)]

end loopuntil |V*t+1(s) - V

t (s) | < for all s in S

Page 31: Learning to Maximize Reward: Reinforcement Learning

31

Q Learning for Nondeterministic MDPs

Q* (s) = r(s,a + s’ in S(s,a,s’) maxa’ [Q* (s’,a’)]

• Alter training rule for non-deterministic Qn:

Qn(st, at) (1- n) Qn-1 (st,at) + n [rt+1+ maxa’ Qn-1(st+1,a’)]

where n = 1/(1+visitsn(s,a))

Can still prove convergence of Q [Watkins and Dayan, 92]

Page 32: Learning to Maximize Reward: Reinforcement Learning

32

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

Page 33: Learning to Maximize Reward: Reinforcement Learning

33

Function Approximation

Function Approximators:• Backprop Neural Network• Radial Basis Function Network• CMAC Network• Nearest Neighbor, Memory-based• Decision Tree

gradient-descentmethods

FunctionApproximator targets or error

Q(s,a)s

a

Page 34: Learning to Maximize Reward: Reinforcement Learning

34

Function Approximation Example:Adjusting Network Weights

Function Approximator:• Q(s,a) = f(s,a,w)

Update: Gradient-descent Sarsa:• w w + [rt+1 + Q(st+1,at+1)-Q(st,at)] w f(st,at,w)

FunctionApproximator targets or error

Q(s,a)s

a

weight vector

estimated valuetarget value

StandardBackpropgradient

Page 35: Learning to Maximize Reward: Reinforcement Learning

35

Example: TD-Gammon [Tesauro, 1995]

Learns to play Backgammon

Situations: • Board configurations (1020)

Actions:• Moves

Rewards:• +100 if win• - 100 if lose• 0 for all other states

• Trained by playing 1.5 million games against self.

Currently, roughly equal to best human player.

Page 36: Learning to Maximize Reward: Reinforcement Learning

36

Example: TD-Gammon [Tesauro, 1995]

Hidden Units0 - 160

Random InitialWeights

Raw Board Position(# of pieces at each position)

V(s) predicted probability of winning

On win: Outcome = 1On Loss: Outcome = 0

TD error

V(st+1) – V(st)

Page 37: Learning to Maximize Reward: Reinforcement Learning

37

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

Page 38: Learning to Maximize Reward: Reinforcement Learning

38

Model-based Learning: Certainty-Equivalence MethodFor every step:1. Use new experience to update model parameters.

• Transitions• Rewards

2. Solve the model for V and .• Value iteration.• Policy iteration.

3. Use the policy to choose the next action.

Page 39: Learning to Maximize Reward: Reinforcement Learning

39

Learning the Model

For each state-action pair <s,a> visited accumulate:

1. Mean Transition:

T(s, a, s’) = number-times-seen(s, a s’) number-times-tried(s,a)

2. Mean Reward:

R(s, a)

Page 40: Learning to Maximize Reward: Reinforcement Learning

40

Comparison of Model-based and Model-free methodsTemporal Differencing / Q Learning:

Only does computation for the states the system is actually in.• Good real-time performance • Inefficient use of data

Model-based methods: Computes the best estimates for every state on every time step.

• Efficient use of data• Terrible real-time performance

What is a middle ground?

Page 41: Learning to Maximize Reward: Reinforcement Learning

41

Dyna: A Middle Ground[Sutton, Intro to RL, 97]

At each step, incrementally:1. Update model based on new data2. Update policy based on new data3. Update policy based on updated model

Performance, until optimal, on Grid World:• Q-Learning:

• 531,000 Steps• 531,000 Backups

• Dyna:• 61,908 Steps• 3,055,000 Backups

Page 42: Learning to Maximize Reward: Reinforcement Learning

42

Dyna Algorithm

Given state s:1. Choose action a using estimated policy.2. Observe new state s’ and reward r. 3. Update T and R of model.4. Update V at <s, a>:

V(s) maxa [r(s,a + s’T(s,a,s’)V(s’))]5. Perform k additional updates:

a) Pick k random states sj in {s1, s2, . . . sk}

b) Update each V(sj):

V(sj) maxa [r(sj,a + s’T(sj,a,s’)V(s’))]

Page 43: Learning to Maximize Reward: Reinforcement Learning

43

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

Page 44: Learning to Maximize Reward: Reinforcement Learning

44

Ongoing Research• Handling cases where state is only partially observable• Design of optimal exploration strategies• Extend to continuous action, state• Learn and use S x A S• Scaling up in the size of the state space

• Function approximators (neural net instead of table)• Generalization• Macros• Exploiting substructure

• Multiple learners – Multi-agent reinforcement learning

Page 45: Learning to Maximize Reward: Reinforcement Learning

45

Markov Decision Processes (MDPs)Model:

• Finite set of states, S• Finite set of actions, A• Probabilistic state

transitions, (s,a)• Reward for each state

and action, R(s,a)

Process:• Observe state st in S

• Choose action at in A

• Receive immediate reward rt

• State changes to st+1

Deterministic Example:

G

10

1010

• Legal transitions shown• Reward on unlabeled transitions is 0.

s0 r0

a0 s1a1

r1

s2a2

r2

s3s1 a1

Page 46: Learning to Maximize Reward: Reinforcement Learning

46

Crib Sheet: MDPs by Value IterationInsight: Can calculate optimal values iteratively using

Dynamic Programming.

Algorithm:• Iteratively calculate value using Bellman’s Equation:

V*t+1(s) maxa [r(s,a + Vt((s, a))]

• Terminate when values are “close enough”|V*t+1(s) - V

t (s) | <

• Agent selects optimal action by one step lookahead on V*(s) = argmaxar(s,a + V((s, a)]

Page 47: Learning to Maximize Reward: Reinforcement Learning

47

Crib Sheet: Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.

Initially:• For each s, a initialize table entry Q(s, a) 0• Observe current state s

Do forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s’• Update the table entry for Q (s, a) as follows:

Q(s, a) r+ maxa’ Q(s’,a’)• s s’