Learning to Maximize Reward: Reinforcement Learning

1

Learning to Maximize Reward: Reinforcement Learning

04/22/23

Brian C. Williams16.412J/6.834J October 28th, 2002

Slides adapted from:Manuela Veloso,Reid Simmons, &Tom Mitchell, CMU

2

Reading• Today: Reinforcement Learning

• Read 2nd ed AIMA Chapter 19, or1st ed AIMA Chapter 20

• Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) 237-285.

• For Markov Decision Processes• Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4.

• Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998) 237-285.

3

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement

• Q values• Q learning• Multi-step backups

• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

4

Example: TD-Gammon [Tesauro, 1995]

Learns to play Backgammon

Situations: • Board configurations (1020)

Actions:• Moves

Rewards:• +100 if win• - 100 if lose• 0 for all other states

• Trained by playing 1.5 million games against self.

Currently, roughly equal to best human player.

5

Reinforcement Learning ProblemGiven: Repeatedly…• Executed action• Observed state• Observed reward

Learn action policy : S A• Maximizes life reward

r0 + r1 + 2 r2 . . .from any start state.

• Discount: 0 1

Note:• Unsupervised learning• Delayed reward

Agent

Environment

s0 r0

a0 s1a1

r1

s2a2

r2

s3

State Reward Action

Goal: Learn to choose actions that maximize life reward

r0 + r1 + 2 r2 . . .

6

How About Learning the Policy Directly?

1. *: S A2. fill out table entries for * by collecting statistics on

training pairs <s,a>.

3. Where does acome from?

7

How About Learning the Value Function?

1. Have agent learn value function V, denoted V.

2. Given learned V, agent selects optimal action by one step lookahead

*(s) = argmaxar(s,a + V((s, a)]

Problem:• Works well if agent knows the environment model.

• : S x A S• r: S x A

• With no model, agent can’t choose action from V.• With a model, could compute V via value iteration, why

learn it?

8

How About Learning the Model as Well?1. Have agent learn and r by statistics on training instances <st,rt+1,st+1>

2. Compute V by value iteration.Vt+1(s) maxa [r(s,a + V t((s, a))]

3. Agent selects optimal action by one step lookahead*(s) = argmaxar(s,a + V((s, a)]

Problem: A viable strategy for many problems, but …• When do you stop learning the model and compute V?

• May take a long time to converge on model.• Would like to continuously interleave learning and acting,

but repeatedly computing Vis costly.

• How can we avoid learning the model and Vexplicitly?

9

Eliminating the Model with Q Functions*(s) = argmaxar(s,a + V((s, a)]

Key idea:• Define function that encapsulates V, and r:

Q(s,a = r(s,a + V((s, a))

• From learned Q, can choose an optimal action without knowing or r.

*(s) = argmaxaQ(s,a

V = Cumulative reward of being in s.Q = Cumulative reward of being in s and taking action a.

10





11

How Do We Learn Q?Q(st,at = r(st,at + V((st, at))

Idea: • Create update rule similar to Bellman equation.• Perform updates on training examples <st , at , rt+1 , st+1 >

Q(st,at rt+1 + V(st+1 )

How do we eliminate V*?• Q and V* are closely related:

V*(s) = maxa’ Q(s,a’)

• Substituting Q for V*:

Q(st,at rt+1 + maxa’ Q(st+1,a’)

Called a backup

12

Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.

Initially:• For each s, a initialize table entry Q(s, a) 0• Observe initial state s0

Do for all time t:• Select an action at and execute it• Receive immediate reward rt+1

• Observe the new state st+1

• Update the table entry for Q (st, at) as follows:Q(st, at) rt+1+ maxa’ Q(st+1,a’)

• st st+1

13

Example – Q Learning Update100

81

72

63

= 0.9 100

8163

0 reward received

14

Example – Q Learning Update

Q(s1,aright) r(s1,aright) + maxa’ Q(s2,a’) 0 + 0.9max {63, 81, 100} 90

Note: if rewards are non-negative:• For all s, a, n, Qn(s, a) Qn+1(s, a)

• For all s, a, n, 0 Qn(s, a) Q(s, a)

= 0.9 100

8163

100

81

72

63

s1 s2

s2s1Maxrt

aright

0 reward received

90

15

Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

G

10

1010

s1 s2 s3

s4s5s6

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0

16

Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8


G

10

1010

s1 s2 s3

s4s5s6

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0

17

Q-Learning Iterations• Start at upper left – move clockwise; = 0.8


Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=

10 + 0.8 x 0 = 10

G

10

1010

s1 s2 s3

s4s5s6

18




10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

G

10

1010

s1 s2 s3

s4s5s6

19




10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

10

0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)

= 6.4

G

10

1010

s1 s2 s3

s4s5s6

20




10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

10


= 6.4

8 10

G

10

1010

s1 s2 s3

s4s5s6

21




10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

10


= 6.4

8 10

G

10

1010

s1 s2 s3

s4s5s6

22

Example Summary: Value Iteration and Q-Learning

G100

100

90

81

100

90

0

100

G

V*(s) values

G

One Optimal Policy

R(s, a) values

G100

100

Q(s, a) values

81

90

8172

8181

7281

90

23

Exploration vs Exploitation

How do you pick actions as you learn?

1. Greedy Action Selection:• Always select the action that looks best:

*(s) = arg maxaQ(s,a

2. Probabilistic Action Selection:• Likelihood of a is proportional to current Q value.

• P(ai|s) = kQ(s, ai) / j kQ(s, aj)

24





25

TD(): Temporal Difference Learningfrom lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997.

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

Q(1)(st,at = rt + maxaQ(st+1,at

Why not two steps? Q(2)(st,at = rt + rt+1 + 2 maxaQ(st+2,at+1

Or n ?Q(n)(st,at = rt + rt+1 + (n-1) rt+n-1 + n maxaQ(st+n,at+n-1

Blend all of these:

Q(st,at = (1-) [Q(1)(st,at + Q(2)(st,at + 2Q(3)(st,at + …] …

26

Eligibility Traces

Idea: Perform backups on N previous data points, as well as most recent data point.• Select data to backup based on frequency of visitation.• Bias towards frequent data by geometric decay i-j.

Visits to data point <s,a>:

t

Accumulating trace:

t

27


• Motivation• Learning policies through reinforcement• Nondeterministic MDPs:

• Value Iteration• Q Learning

• Function Approximators• Model-based Learning• Summary

28

Nondeterministic MDPsstate transitions become

probabilistic: (s,a,s’)

S1Unemployed

D

R

S2Industry

D

S3Grad School

D

R

S4Academia

D

R

R

0.1

0.9

1.0

0.9

0.1

1.0

0.9

0.1

1.00.90.1

1.0

R – Research pathD – Development pathExample

29

NonDeterministic Case• How do we redefine cumulative reward to handle non-

determinism?• Define V and Q based on expected values:

V(st) = E[rt + rt+1 + 2 rt+2 . . .]

V(st) = E[ i rt+I ]

Q(st,at = E[r(st,at + V((st, at))]

30

Value Iteration for Non-deterministic MDPs

V1(s) := 0 for all st := 1loop t := t + 1 loop for all s in S loop for all a in A

Qt (s ,a) := r(s,a + s’ in S(s,a,s’) V

t(s’) end loop

Vt(s) := maxa [Qt (s,a)]

end loopuntil |V*t+1(s) - V

t (s) | < for all s in S

31

Q Learning for Nondeterministic MDPs

Q* (s) = r(s,a + s’ in S(s,a,s’) maxa’ [Q* (s’,a’)]

• Alter training rule for non-deterministic Qn:

Qn(st, at) (1- n) Qn-1 (st,at) + n [rt+1+ maxa’ Qn-1(st+1,a’)]

where n = 1/(1+visitsn(s,a))

Can still prove convergence of Q [Watkins and Dayan, 92]

32


• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

33

Function Approximation

Function Approximators:• Backprop Neural Network• Radial Basis Function Network• CMAC Network• Nearest Neighbor, Memory-based• Decision Tree

gradient-descentmethods

FunctionApproximator targets or error

Q(s,a)s

a

34

Function Approximation Example:Adjusting Network Weights

Function Approximator:• Q(s,a) = f(s,a,w)

Update: Gradient-descent Sarsa:• w w + [rt+1 + Q(st+1,at+1)-Q(st,at)] w f(st,at,w)

FunctionApproximator targets or error

Q(s,a)s

a

weight vector

estimated valuetarget value

StandardBackpropgradient

35


Learns to play Backgammon

Situations: • Board configurations (1020)

Actions:• Moves

Rewards:• +100 if win• - 100 if lose• 0 for all other states

• Trained by playing 1.5 million games against self.

Currently, roughly equal to best human player.

36


Hidden Units0 - 160

Random InitialWeights

Raw Board Position(# of pieces at each position)

V(s) predicted probability of winning

On win: Outcome = 1On Loss: Outcome = 0

TD error

V(st+1) – V(st)

37



38

Model-based Learning: Certainty-Equivalence MethodFor every step:1. Use new experience to update model parameters.

• Transitions• Rewards

2. Solve the model for V and .• Value iteration.• Policy iteration.

3. Use the policy to choose the next action.

39

Learning the Model

For each state-action pair <s,a> visited accumulate:

1. Mean Transition:

T(s, a, s’) = number-times-seen(s, a s’) number-times-tried(s,a)

2. Mean Reward:

R(s, a)

40

Comparison of Model-based and Model-free methodsTemporal Differencing / Q Learning:

Only does computation for the states the system is actually in.• Good real-time performance • Inefficient use of data

Model-based methods: Computes the best estimates for every state on every time step.

• Efficient use of data• Terrible real-time performance

What is a middle ground?

41

Dyna: A Middle Ground[Sutton, Intro to RL, 97]

At each step, incrementally:1. Update model based on new data2. Update policy based on new data3. Update policy based on updated model

Performance, until optimal, on Grid World:• Q-Learning:

• 531,000 Steps• 531,000 Backups

• Dyna:• 61,908 Steps• 3,055,000 Backups

42

Dyna Algorithm

Given state s:1. Choose action a using estimated policy.2. Observe new state s’ and reward r. 3. Update T and R of model.4. Update V at <s, a>:

V(s) maxa [r(s,a + s’T(s,a,s’)V(s’))]5. Perform k additional updates:

a) Pick k random states sj in {s1, s2, . . . sk}

b) Update each V(sj):

V(sj) maxa [r(sj,a + s’T(sj,a,s’)V(s’))]

43



44

Ongoing Research• Handling cases where state is only partially observable• Design of optimal exploration strategies• Extend to continuous action, state• Learn and use S x A S• Scaling up in the size of the state space

• Function approximators (neural net instead of table)• Generalization• Macros• Exploiting substructure

• Multiple learners – Multi-agent reinforcement learning

45

Markov Decision Processes (MDPs)Model:

• Finite set of states, S• Finite set of actions, A• Probabilistic state

transitions, (s,a)• Reward for each state

and action, R(s,a)

Process:• Observe state st in S

• Choose action at in A

• Receive immediate reward rt

• State changes to st+1

Deterministic Example:

G

10

1010

• Legal transitions shown• Reward on unlabeled transitions is 0.

s0 r0

a0 s1a1

r1

s2a2

r2

s3s1 a1

46

Crib Sheet: MDPs by Value IterationInsight: Can calculate optimal values iteratively using

Dynamic Programming.

Algorithm:• Iteratively calculate value using Bellman’s Equation:

V*t+1(s) maxa [r(s,a + Vt((s, a))]

• Terminate when values are “close enough”|V*t+1(s) - V

t (s) | <

• Agent selects optimal action by one step lookahead on V*(s) = argmaxar(s,a + V((s, a)]

47

Crib Sheet: Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.

Initially:• For each s, a initialize table entry Q(s, a) 0• Observe current state s

Do forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s’• Update the table entry for Q (s, a) as follows:

Q(s, a) r+ maxa’ Q(s’,a’)• s s’

Documents

Learning to Maximize Reward: Reinforcement Learning