Upload
onawa
View
34
Download
0
Embed Size (px)
DESCRIPTION
Learning to Maximize Reward: Reinforcement Learning. Brian C. Williams 16.412J/6.834J October 28 th , 2002. Slides adapted from: Manuela Veloso, Reid Simmons, & Tom Mitchell, CMU. 8/6/2014. Reading. Today: Reinforcement Learning - PowerPoint PPT Presentation
Citation preview
1
Learning to Maximize Reward: Reinforcement Learning
04/22/23
Brian C. Williams16.412J/6.834J October 28th, 2002
Slides adapted from:Manuela Veloso,Reid Simmons, &Tom Mitchell, CMU
2
Reading• Today: Reinforcement Learning
• Read 2nd ed AIMA Chapter 19, or1st ed AIMA Chapter 20
• Read “Reinforcement Learning: A Survey” by L. Kaebling, M. Littman and A. Moore, Journal of Artificial Intelligence Research 4 (1996) 237-285.
• For Markov Decision Processes• Read 1st/2nd ed AIMA Chapter 17 sections 1 – 4.
• Optional Reading: : Planning and Acting in Partially Observable Stochastic Domains, by L. Kaebling, M. Littman and A. Cassandra, Elsevier (1998) 237-285.
3
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement
• Q values• Q learning• Multi-step backups
• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
4
Example: TD-Gammon [Tesauro, 1995]
Learns to play Backgammon
Situations: • Board configurations (1020)
Actions:• Moves
Rewards:• +100 if win• - 100 if lose• 0 for all other states
• Trained by playing 1.5 million games against self.
Currently, roughly equal to best human player.
5
Reinforcement Learning ProblemGiven: Repeatedly…• Executed action• Observed state• Observed reward
Learn action policy : S A• Maximizes life reward
r0 + r1 + 2 r2 . . .from any start state.
• Discount: 0 1
Note:• Unsupervised learning• Delayed reward
Agent
Environment
s0 r0
a0 s1a1
r1
s2a2
r2
s3
State Reward Action
Goal: Learn to choose actions that maximize life reward
r0 + r1 + 2 r2 . . .
6
How About Learning the Policy Directly?
1. *: S A2. fill out table entries for * by collecting statistics on
training pairs <s,a>.
3. Where does acome from?
7
How About Learning the Value Function?
1. Have agent learn value function V, denoted V.
2. Given learned V, agent selects optimal action by one step lookahead
*(s) = argmaxar(s,a + V((s, a)]
Problem:• Works well if agent knows the environment model.
• : S x A S• r: S x A
• With no model, agent can’t choose action from V.• With a model, could compute V via value iteration, why
learn it?
8
How About Learning the Model as Well?1. Have agent learn and r by statistics on training instances <st,rt+1,st+1>
2. Compute V by value iteration.Vt+1(s) maxa [r(s,a + V t((s, a))]
3. Agent selects optimal action by one step lookahead*(s) = argmaxar(s,a + V((s, a)]
Problem: A viable strategy for many problems, but …• When do you stop learning the model and compute V?
• May take a long time to converge on model.• Would like to continuously interleave learning and acting,
but repeatedly computing Vis costly.
• How can we avoid learning the model and Vexplicitly?
9
Eliminating the Model with Q Functions*(s) = argmaxar(s,a + V((s, a)]
Key idea:• Define function that encapsulates V, and r:
Q(s,a = r(s,a + V((s, a))
• From learned Q, can choose an optimal action without knowing or r.
*(s) = argmaxaQ(s,a
V = Cumulative reward of being in s.Q = Cumulative reward of being in s and taking action a.
10
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement
• Q values• Q learning• Multi-step backups
• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
11
How Do We Learn Q?Q(st,at = r(st,at + V((st, at))
Idea: • Create update rule similar to Bellman equation.• Perform updates on training examples <st , at , rt+1 , st+1 >
Q(st,at rt+1 + V(st+1 )
How do we eliminate V*?• Q and V* are closely related:
V*(s) = maxa’ Q(s,a’)
• Substituting Q for V*:
Q(st,at rt+1 + maxa’ Q(st+1,a’)
Called a backup
12
Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.
Initially:• For each s, a initialize table entry Q(s, a) 0• Observe initial state s0
Do for all time t:• Select an action at and execute it• Receive immediate reward rt+1
• Observe the new state st+1
• Update the table entry for Q (st, at) as follows:Q(st, at) rt+1+ maxa’ Q(st+1,a’)
• st st+1
13
Example – Q Learning Update100
81
72
63
= 0.9 100
8163
0 reward received
14
Example – Q Learning Update
Q(s1,aright) r(s1,aright) + maxa’ Q(s2,a’) 0 + 0.9max {63, 81, 100} 90
Note: if rewards are non-negative:• For all s, a, n, Qn(s, a) Qn+1(s, a)
• For all s, a, n, 0 Qn(s, a) Q(s, a)
= 0.9 100
8163
100
81
72
63
s1 s2
s2s1Maxrt
aright
0 reward received
90
15
Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
G
10
1010
s1 s2 s3
s4s5s6
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0
16
Q-Learning Iterations: Episodic• Start at upper left – move clockwise; table initially 0; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
G
10
1010
s1 s2 s3
s4s5s6
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0
17
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
G
10
1010
s1 s2 s3
s4s5s6
18
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x
max{10,0) = 8
G
10
1010
s1 s2 s3
s4s5s6
19
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x
max{10,0) = 8
10
0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)
= 6.4
G
10
1010
s1 s2 s3
s4s5s6
20
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x
max{10,0) = 8
10
0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)
= 6.4
8 10
G
10
1010
s1 s2 s3
s4s5s6
21
Q-Learning Iterations• Start at upper left – move clockwise; = 0.8
Q(s, a) r+ maxa’ Q(s’,a’)
Q(S1,E) Q(s2,E) Q(s3,S) Q(s4,W)0 0 0 r+ maxa’ {Q(s5,loop)}=
10 + 0.8 x 0 = 10
0 0 r+ maxa’ {Q(s4,W), Q(s4,N)} = 0 + 0.8 x
max{10,0) = 8
10
0 r+ maxa’ {Q(s3,W), Q(s3,S)} = 0 + 0.8 x max{0,8)
= 6.4
8 10
G
10
1010
s1 s2 s3
s4s5s6
22
Example Summary: Value Iteration and Q-Learning
G100
100
90
81
100
90
0
100
G
V*(s) values
G
One Optimal Policy
R(s, a) values
G100
100
Q(s, a) values
81
90
8172
8181
7281
90
23
Exploration vs Exploitation
How do you pick actions as you learn?
1. Greedy Action Selection:• Always select the action that looks best:
*(s) = arg maxaQ(s,a
2. Probabilistic Action Selection:• Likelihood of a is proportional to current Q value.
• P(ai|s) = kQ(s, ai) / j kQ(s, aj)
24
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement
• Q values• Q learning• Multi-step backups
• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
25
TD(): Temporal Difference Learningfrom lecture slides: Machine Learning, T. Mitchell, McGraw Hill, 1997.
Q learning: reduce discrepancy between successive Q estimates
One step time difference:
Q(1)(st,at = rt + maxaQ(st+1,at
Why not two steps? Q(2)(st,at = rt + rt+1 + 2 maxaQ(st+2,at+1
Or n ?Q(n)(st,at = rt + rt+1 + (n-1) rt+n-1 + n maxaQ(st+n,at+n-1
Blend all of these:
Q(st,at = (1-) [Q(1)(st,at + Q(2)(st,at + 2Q(3)(st,at + …] …
26
Eligibility Traces
Idea: Perform backups on N previous data points, as well as most recent data point.• Select data to backup based on frequency of visitation.• Bias towards frequent data by geometric decay i-j.
Visits to data point <s,a>:
t
Accumulating trace:
t
27
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement• Nondeterministic MDPs:
• Value Iteration• Q Learning
• Function Approximators• Model-based Learning• Summary
28
Nondeterministic MDPsstate transitions become
probabilistic: (s,a,s’)
S1Unemployed
D
R
S2Industry
D
S3Grad School
D
R
S4Academia
D
R
R
0.1
0.9
1.0
0.9
0.1
1.0
0.9
0.1
1.00.90.1
1.0
R – Research pathD – Development pathExample
29
NonDeterministic Case• How do we redefine cumulative reward to handle non-
determinism?• Define V and Q based on expected values:
V(st) = E[rt + rt+1 + 2 rt+2 . . .]
V(st) = E[ i rt+I ]
Q(st,at = E[r(st,at + V((st, at))]
30
Value Iteration for Non-deterministic MDPs
V1(s) := 0 for all st := 1loop t := t + 1 loop for all s in S loop for all a in A
Qt (s ,a) := r(s,a + s’ in S(s,a,s’) V
t(s’) end loop
Vt(s) := maxa [Qt (s,a)]
end loopuntil |V*t+1(s) - V
t (s) | < for all s in S
31
Q Learning for Nondeterministic MDPs
Q* (s) = r(s,a + s’ in S(s,a,s’) maxa’ [Q* (s’,a’)]
• Alter training rule for non-deterministic Qn:
Qn(st, at) (1- n) Qn-1 (st,at) + n [rt+1+ maxa’ Qn-1(st+1,a’)]
where n = 1/(1+visitsn(s,a))
Can still prove convergence of Q [Watkins and Dayan, 92]
32
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
33
Function Approximation
Function Approximators:• Backprop Neural Network• Radial Basis Function Network• CMAC Network• Nearest Neighbor, Memory-based• Decision Tree
gradient-descentmethods
FunctionApproximator targets or error
Q(s,a)s
a
34
Function Approximation Example:Adjusting Network Weights
Function Approximator:• Q(s,a) = f(s,a,w)
Update: Gradient-descent Sarsa:• w w + [rt+1 + Q(st+1,at+1)-Q(st,at)] w f(st,at,w)
FunctionApproximator targets or error
Q(s,a)s
a
weight vector
estimated valuetarget value
StandardBackpropgradient
35
Example: TD-Gammon [Tesauro, 1995]
Learns to play Backgammon
Situations: • Board configurations (1020)
Actions:• Moves
Rewards:• +100 if win• - 100 if lose• 0 for all other states
• Trained by playing 1.5 million games against self.
Currently, roughly equal to best human player.
36
Example: TD-Gammon [Tesauro, 1995]
Hidden Units0 - 160
Random InitialWeights
Raw Board Position(# of pieces at each position)
V(s) predicted probability of winning
On win: Outcome = 1On Loss: Outcome = 0
TD error
V(st+1) – V(st)
37
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
38
Model-based Learning: Certainty-Equivalence MethodFor every step:1. Use new experience to update model parameters.
• Transitions• Rewards
2. Solve the model for V and .• Value iteration.• Policy iteration.
3. Use the policy to choose the next action.
39
Learning the Model
For each state-action pair <s,a> visited accumulate:
1. Mean Transition:
T(s, a, s’) = number-times-seen(s, a s’) number-times-tried(s,a)
2. Mean Reward:
R(s, a)
40
Comparison of Model-based and Model-free methodsTemporal Differencing / Q Learning:
Only does computation for the states the system is actually in.• Good real-time performance • Inefficient use of data
Model-based methods: Computes the best estimates for every state on every time step.
• Efficient use of data• Terrible real-time performance
What is a middle ground?
41
Dyna: A Middle Ground[Sutton, Intro to RL, 97]
At each step, incrementally:1. Update model based on new data2. Update policy based on new data3. Update policy based on updated model
Performance, until optimal, on Grid World:• Q-Learning:
• 531,000 Steps• 531,000 Backups
• Dyna:• 61,908 Steps• 3,055,000 Backups
42
Dyna Algorithm
Given state s:1. Choose action a using estimated policy.2. Observe new state s’ and reward r. 3. Update T and R of model.4. Update V at <s, a>:
V(s) maxa [r(s,a + s’T(s,a,s’)V(s’))]5. Perform k additional updates:
a) Pick k random states sj in {s1, s2, . . . sk}
b) Update each V(sj):
V(sj) maxa [r(sj,a + s’T(sj,a,s’)V(s’))]
43
Markov Decision Processes and Reinforcement Learning
• Motivation• Learning policies through reinforcement• Nondeterministic MDPs• Function Approximators• Model-based Learning• Summary
44
Ongoing Research• Handling cases where state is only partially observable• Design of optimal exploration strategies• Extend to continuous action, state• Learn and use S x A S• Scaling up in the size of the state space
• Function approximators (neural net instead of table)• Generalization• Macros• Exploiting substructure
• Multiple learners – Multi-agent reinforcement learning
45
Markov Decision Processes (MDPs)Model:
• Finite set of states, S• Finite set of actions, A• Probabilistic state
transitions, (s,a)• Reward for each state
and action, R(s,a)
Process:• Observe state st in S
• Choose action at in A
• Receive immediate reward rt
• State changes to st+1
Deterministic Example:
G
10
1010
• Legal transitions shown• Reward on unlabeled transitions is 0.
s0 r0
a0 s1a1
r1
s2a2
r2
s3s1 a1
46
Crib Sheet: MDPs by Value IterationInsight: Can calculate optimal values iteratively using
Dynamic Programming.
Algorithm:• Iteratively calculate value using Bellman’s Equation:
V*t+1(s) maxa [r(s,a + Vt((s, a))]
• Terminate when values are “close enough”|V*t+1(s) - V
t (s) | <
• Agent selects optimal action by one step lookahead on V*(s) = argmaxar(s,a + V((s, a)]
47
Crib Sheet: Q-Learning for Deterministic WorldsLet Q denote the current approximation to Q.
Initially:• For each s, a initialize table entry Q(s, a) 0• Observe current state s
Do forever:• Select an action a and execute it• Receive immediate reward r• Observe the new state s’• Update the table entry for Q (s, a) as follows:
Q(s, a) r+ maxa’ Q(s’,a’)• s s’