Learning to Maximize Reward: Reinforcement Learning

04/22/23

Brian C. Williams16.412J/6.834J October 28th, 2002

Slides adapted from:Manuela Veloso,Reid Simmons, &Tom Mitchell, CMU

Reading• Today: Reinforcement Learning

• Read 2nd ed AIMA Chapter 19, or1st ed AIMA Chapter 20

• Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) 237-285.

• For Markov Decision Processes• Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4.

• Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998) 237-285.

Markov Decision Processes and Reinforcement Learning

• Motivation• Learning policies through reinforcement

• Q values• Q learning• Multi-step backups

• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

Example: TD-Gammon [Tesauro, 1995]

Learns to play Backgammon

Situations: • Board configurations (1020)

Actions:• Moves

Rewards:• +100 if win• - 100 if lose• 0 for all other states

• Trained by playing 1.5 million games against self.

Currently, roughly equal to best human player.

Reinforcement Learning ProblemGiven: Repeatedly…• Executed action• Observed state• Observed reward

Learn action policy : S A• Maximizes life reward

r0 + r1 + 2 r2 . . .from any start state.

• Discount: 0 1

Note:• Unsupervised learning• Delayed reward

Environment

a0 s1a1

State Reward Action

Goal: Learn to choose actions that maximize life reward

r0 + r1 + 2 r2 . . .

How About Learning the Policy Directly?

1. *: S A2. fill out table entries for * by collecting statistics on

training pairs <s,a>.

3. Where does acome from?

How About Learning the Value Function?

1. Have agent learn value function V, denoted V.

2. Given learned V, agent selects optimal action by one step lookahead

*(s) = argmaxar(s,a + V((s, a)]

Problem:• Works well if agent knows the environment model.

• : S x A S• r: S x A

• With no model, agent can’t choose action from V.• With a model, could compute V via value iteration, why

learn it?

How About Learning the Model as Well?1. Have agent learn and r by statistics on training instances <st,rt+1,st+1>

2. Compute V by value iteration.Vt+1(s) maxa [r(s,a + V t((s, a))]

3. Agent selects optimal action by one step lookahead*(s) = argmaxar(s,a + V((s, a)]

Problem: A viable strategy for many problems, but …• When do you stop learning the model and compute V?

• May take a long time to converge on model.• Would like to continuously interleave learning and acting,

but repeatedly computing Vis costly.

• How can we avoid learning the model and Vexplicitly?

Eliminating the Model with Q Functions*(s) = argmaxar(s,a + V((s, a)]

Key idea:• Define function that encapsulates V, and r:

Q(s,a = r(s,a + V((s, a))

• From learned Q, can choose an optimal action without knowing or r.

*(s) = argmaxaQ(s,a

V = Cumulative reward of being in s.Q = Cumulative reward of being in s and taking action a.

How Do We Learn Q?Q(st,at = r(st,at + V((st, at))

Idea: • Create update rule similar to Bellman equation.• Perform updates on training examples <st , at , rt+1 , st+1 >

Q(st,at rt+1 + V(st+1 )

How do we eliminate V*?• Q and V* are closely related:

V*(s) = maxa’ Q(s,a’)

• Substituting Q for V*:

Q(st,at rt+1 + maxa’ Q(st+1,a’)

Called a backup

Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.

Initially:• For each s, a initialize table entry Q(s, a) 0• Observe initial state s0

Do for all time t:• Select an action at and execute it• Receive immediate reward rt+1

• Observe the new state st+1

• Update the table entry for Q (st, at) as follows:Q(st, at) rt+1+ maxa’ Q(st+1,a’)

• st st+1

Example – Q Learning Update100

= 0.9 100

0 reward received

Example – Q Learning Update

Q(s1,aright) r(s1,aright) + maxa’ Q(s2,a’) 0 + 0.9max {63, 81, 100} 90

Note: if rewards are non-negative:• For all s, a, n, Qn(s, a) Qn+1(s, a)

• For all s, a, n, 0 Qn(s, a) Q(s, a)

= 0.9 100

s2s1Maxrt

aright

0 reward received

Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8

Q(s, a) r+ maxa’ Q(s’,a’)

s1 s2 s3

s4s5s6

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0

Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8

s1 s2 s3

s4s5s6

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0

Q-Learning Iterations• Start at upper left – move clockwise; = 0.8

Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=

10 + 0.8 x 0 = 10

s1 s2 s3

s4s5s6

10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

s1 s2 s3

s4s5s6

10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)

s1 s2 s3

s4s5s6

10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

s1 s2 s3

s4s5s6

10 + 0.8 x 0 = 10

0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x

max{10,0) = 8

s1 s2 s3

s4s5s6

Example Summary: Value Iteration and Q-Learning

V*(s) values

One Optimal Policy

R(s, a) values

Q(s, a) values

Exploration vs Exploitation

How do you pick actions as you learn?

1. Greedy Action Selection:• Always select the action that looks best:

*(s) = arg maxaQ(s,a

2. Probabilistic Action Selection:• Likelihood of a is proportional to current Q value.

• P(ai|s) = kQ(s, ai) / j kQ(s, aj)

TD(): Temporal Difference Learningfrom lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997.

Q learning: reduce discrepancy between successive Q estimates

One step time difference:

Q(1)(st,at = rt + maxaQ(st+1,at

Why not two steps? Q(2)(st,at = rt + rt+1 + 2 maxaQ(st+2,at+1

Or n ?Q(n)(st,at = rt + rt+1 + (n-1) rt+n-1 + n maxaQ(st+n,at+n-1

Blend all of these:

Q(st,at = (1-) [Q(1)(st,at + Q(2)(st,at + 2Q(3)(st,at + …] …

Eligibility Traces

Idea: Perform backups on N previous data points, as well as most recent data point.• Select data to backup based on frequency of visitation.• Bias towards frequent data by geometric decay i-j.

Visits to data point <s,a>:

Accumulating trace:

• Motivation• Learning policies through reinforcement• Nondeterministic MDPs:

• Value Iteration• Q Learning

• Function Approximators• Model-based Learning• Summary

Nondeterministic MDPsstate transitions become

probabilistic: (s,a,s’)

S1Unemployed

S2Industry

S3Grad School

S4Academia

1.00.90.1

R – Research pathD – Development pathExample

NonDeterministic Case• How do we redefine cumulative reward to handle non-

determinism?• Define V and Q based on expected values:

V(st) = E[rt + rt+1 + 2 rt+2 . . .]

V(st) = E[ i rt+I ]

Q(st,at = E[r(st,at + V((st, at))]

Value Iteration for Non-deterministic MDPs

V1(s) := 0 for all st := 1loop t := t + 1 loop for all s in S loop for all a in A

Qt (s ,a) := r(s,a + s’ in S(s,a,s’) V

t(s’) end loop

Vt(s) := maxa [Qt (s,a)]

end loopuntil |V*t+1(s) - V

t (s) | < for all s in S

Q Learning for Nondeterministic MDPs

Q* (s) = r(s,a + s’ in S(s,a,s’) maxa’ [Q* (s’,a’)]

• Alter training rule for non-deterministic Qn:

Qn(st, at) (1- n) Qn-1 (st,at) + n [rt+1+ maxa’ Qn-1(st+1,a’)]

where n = 1/(1+visitsn(s,a))

Can still prove convergence of Q [Watkins and Dayan, 92]

• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary

Function Approximation

Function Approximators:• Backprop Neural Network• Radial Basis Function Network• CMAC Network• Nearest Neighbor, Memory-based• Decision Tree

gradient-descentmethods

FunctionApproximator targets or error

Q(s,a)s

Function Approximation Example:Adjusting Network Weights

Function Approximator:• Q(s,a) = f(s,a,w)

Update: Gradient-descent Sarsa:• w w + [rt+1 + Q(st+1,at+1)-Q(st,at)] w f(st,at,w)

FunctionApproximator targets or error

Q(s,a)s

weight vector

estimated valuetarget value

StandardBackpropgradient

Learns to play Backgammon

Situations: • Board configurations (1020)

Actions:• Moves

Rewards:• +100 if win• - 100 if lose• 0 for all other states

• Trained by playing 1.5 million games against self.

Currently, roughly equal to best human player.

Hidden Units0 - 160

Random InitialWeights

Raw Board Position(# of pieces at each position)

V(s) predicted probability of winning

On win: Outcome = 1On Loss: Outcome = 0

TD error

V(st+1) – V(st)

Model-based Learning: Certainty-Equivalence MethodFor every step:1. Use new experience to update model parameters.

• Transitions• Rewards

2. Solve the model for V and .• Value iteration.• Policy iteration.

3. Use the policy to choose the next action.

Learning the Model

For each state-action pair <s,a> visited accumulate:

1. Mean Transition:

T(s, a, s’) = number-times-seen(s, a s’) number-times-tried(s,a)

2. Mean Reward:

R(s, a)

Comparison of Model-based and Model-free methodsTemporal Differencing / Q Learning:

Only does computation for the states the system is actually in.• Good real-time performance • Inefficient use of data

Model-based methods: Computes the best estimates for every state on every time step.

• Efficient use of data• Terrible real-time performance

What is a middle ground?

Dyna: A Middle Ground[Sutton, Intro to RL, 97]

At each step, incrementally:1. Update model based on new data2. Update policy based on new data3. Update policy based on updated model

Performance, until optimal, on Grid World:• Q-Learning:

• 531,000 Steps• 531,000 Backups

• Dyna:• 61,908 Steps• 3,055,000 Backups

Dyna Algorithm

Given state s:1. Choose action a using estimated policy.2. Observe new state s’ and reward r. 3. Update T and R of model.4. Update V at <s, a>:

V(s) maxa [r(s,a + s’T(s,a,s’)V(s’))]5. Perform k additional updates:

a) Pick k random states sj in {s1, s2, . . . sk}

b) Update each V(sj):

V(sj) maxa [r(sj,a + s’T(sj,a,s’)V(s’))]

Ongoing Research• Handling cases where state is only partially observable• Design of optimal exploration strategies• Extend to continuous action, state• Learn and use S x A S• Scaling up in the size of the state space

• Function approximators (neural net instead of table)• Generalization• Macros• Exploiting substructure

• Multiple learners – Multi-agent reinforcement learning

Markov Decision Processes (MDPs)Model:

• Finite set of states, S• Finite set of actions, A• Probabilistic state

transitions, (s,a)• Reward for each state

and action, R(s,a)

Process:• Observe state st in S

• Choose action at in A

• Receive immediate reward rt

• State changes to st+1

Deterministic Example:

• Legal transitions shown• Reward on unlabeled transitions is 0.

a0 s1a1

s3s1 a1

Crib Sheet: MDPs by Value IterationInsight: Can calculate optimal values iteratively using

Dynamic Programming.

Algorithm:• Iteratively calculate value using Bellman’s Equation:

V*t+1(s) maxa [r(s,a + Vt((s, a))]

• Terminate when values are “close enough”|V*t+1(s) - V

t (s) | <

• Agent selects optimal action by one step lookahead on V*(s) = argmaxar(s,a + V((s, a)]

Crib Sheet: Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.

Initially:• For each s, a initialize table entry Q(s, a) 0• Observe current state s

Do forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s’• Update the table entry for Q (s, a) as follows:

Q(s, a) r+ maxa’ Q(s’,a’)• s s’

Learning to Maximize Reward: Reinforcement Learning

Documents

Maximize Your Torah Learning Potential - YU

Intrinsic Reward Driven Imitation Learning via Generative Modelproceedings.mlr.press/v119/yu20d/yu20d.pdfIntrinsic Reward Driven Imitation Learning via Generative Model of intrinsic

JOURNAL OF LA Generative Adversarial Reward Learning for

Policy teaching through reward function learning

Placebo Intervention Enhances Reward Learning in Healthy ...ihrke.github.io/papers/Turi2017placebo.pdf · Reward Learning in Healthy Individuals Zsolt Turi1,, Matthias Mittner 2,,

MAXIMIZE ALL STUDENTS MATHEMATICAL LEARNING IN …

Introduction to Deep Reinforcement LearningWhat is reinforcement learning? Learning “how to act” from direct interaction with the environment with the goal to maximize a “reward”

Active Reward Learning with a Novel Acquisition Function

Hybrid Reward Architecture for Reinforcement Learning · PDF fileHybrid Reward Architecture for Reinforcement Learning ... The generalisation properties of their Deep Q-Networks

Learning reward functions from diverse

Learning Non-Myopically from Human-Generated Reward · Learning Non-Myopically from Human-Generated Reward ... distribution of start states for each learning episode. ... Learning

Active Preference-Based Learning of Reward Functionspeople.eecs.berkeley.edu/~anca/papers/RSS17_comparisons.pdf · 2017. 7. 7. · III. Learning Reward Weights from Preferences of

Learning Non-Myopically from Human-Generated Reward › users › pstone › Papers › bib2html-links › iui13-knox.pdflearning from demonstration [1], learning from human reward

Topic 5 Business and Costs of Production. Definition: Cost and Profit As consumers maximize utility, Producers maximize profit Profit is the reward for

Reward Hierarchical Temporal Memory - KAISTbrain.kaist.ac.kr/document/HSC/295.pdf · new reward-predicting event to a reward event in TD learning, the value of the reward-predicting

Reward learning from human preferences and …

Landmark Based Reward Shaping in Reinforcement Learning

Maximize your e-Learning Strategy

Maximize teaching. Maximize learning....Designed to make teaching easier while being considerate of your budget. Maximize teaching. Maximize learning. Do More on a Bigger Screen •

Stochastic Knapsack - math.tu-berlin.de · Stochastic Knapsack Recall: deterministic knapsack problem Jobs with reward and size; budget B Maximize reward s.t. total size ≤B