Upload
zaynah
View
43
Download
0
Embed Size (px)
DESCRIPTION
Reinforcement Learning. Agenda. Online learning Reinforcement learning Model-free vs. model-based Passive vs. active learning Exploration-exploitation tradeoff. Incremental (“Online”) Function Learning. Data is streaming into learner x 1 ,y 1 , …, x n ,y n y i = f(x i ) - PowerPoint PPT Presentation
Citation preview
REINFORCEMENT LEARNING
AGENDA Online learning Reinforcement learning
Model-free vs. model-based Passive vs. active learning Exploration-exploitation tradeoff
3
INCREMENTAL (“ONLINE”) FUNCTION LEARNING Data is streaming into learner
x1,y1, …, xn,yn yi = f(xi) Observes xn+1 and must make prediction for
next time step yn+1 “Batch” approach:
Store all data at step n Use your learner of choice on all data up to time
n, predict for time n+1 Can we do this using less memory?
4
EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qn = 1/n Si=1…n yi
qn+1 = 1/(n+1) Si=1…n+1 yi = 1/(n+1) (yn+1 + Si=1…n yi) = 1/(n+1) (yn+1 + n qn) = qn + 1/(n+1) (yn+1 - qn)
q5
5
EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qt = 1/n Si=1…n yi
qn+1 = 1/(n+1) Si=1…n+1 yi = 1/(n+1) (yn+1 + Si=1…n yi) = 1/(n+1) (yn+1 + n qn) = qn + 1/(n+1) (yn+1 - qn)
q5
y6
6
EXAMPLE: MEAN ESTIMATION yi = q + error term (no x’s) Current estimate qt = 1/n Si=1…n yi
qn+1 = 1/(n+1) Si=1…n+1 yi = 1/(n+1) (yn+1 + Si=1…n yi) = 1/(n+1) (yn+1 + n qn) = qn + 1/(n+1) (yn+1 - qn)
q5 q6 = 5/6 q5 + 1/6 y6
7
EXAMPLE: MEAN ESTIMATION qn+1 = qn + 1/(n+1) (yn+1 - qn) Only need to store n, qn
q5 q6 = 5/6 q6 + 1/6 y6
LEARNING RATES In fact, qn+1 = qn + an (yn+1 - qn) converges to
the mean for any an such that: an 0 as n San San
2 C < O(1/n) does the trick If an is close to 1, then the estimate shifts
strongly to recent data; close to 0, and the old estimate is preserved
REINFORCEMENT LEARNING RL problem: given only
observations of actions, states, and rewards, learn a (near) optimal policy
No prior knowledge of transition or reward models
We consider: fully-observable, episodic environment, finite state space, uncertainty in action (MDP)
WHAT TO LEARN?
Policy p Action-utilityfunction Q(s,a)
Less online deliberation
More online deliberationUtility
function UModel of R and TLearn:
Online: p(s) arg maxa Q(s,a)
Model free Model based
Simpler execution Faster learning?
arg maxaSs P(s’|s,a)U(s’)
Solve MDP
Method: Learning from demonstration
Q-learning, SARSA
Direct utility estimation, TD-learning
Adaptive dynamic programming
FIRST STEPS: PASSIVE RL Observe execution trials of an agent that
acts according to some unobserved policy p Problem: estimate the utility function Up
[Recall Up(s) = E[St gt R(St)] where St is the random variable denoting the distribution of states at time t]
DIRECT UTILITY ESTIMATION
1. Observe trials t(i)=(s0(i),a1
(i),s1(i),r1
(i),…,ati(i),sti
(i),rti(i)) for i=1,…,n
2. For each state sS:3. Find all trials t(i) that pass through s4. Compute subsequent utility Ut(i)(s)=St=k to ti gt-k rt
(i)
5. Set Up(s) to the average observed utility
3
21
4321
+1-10
0000
0
00 0 3
21
4321
+1-10.66
0.390.610.660.71
0.76
0.870.81 0.92
ONLINE IMPLEMENTATION
1. Store counts N[s] and estimated utilities Up(s)2. After a trial t, for each state s in the trial:
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(Ut(s)-Up(s))
3
21
4321
+1-10
0000
0
00 0 3
21
4321
+1-10.66
0.390.610.660.71
0.76
0.870.81 0.92
• Simply supervised learning on trials• Slow learning, because Bellman equation is
not used to pass knowledge between adjacent states
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10
0000
0
00 0
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10
0000
0
00 0
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10
000-0.02
0
00 0 With learning rate a=0.5
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10
000-0.02
-0.02
-0.02-0.02 0 With learning rate a=0.5
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10
000-0.02
-0.02
-0.02-0.02 0.48 With learning rate a=0.5
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10
000-0.04
-0.04
0.21-0.04 0.72 With learning rate a=0.5
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10
000-0.06
-0.06
0.440.07 0.84 With learning rate a=0.5
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10
000-0.08
-0.03
0.620.23 0.42 With learning rate a=0.5
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10.19
000-0.08
-0.03
0.620.23 0.42 With learning rate a=0.5
TEMPORAL DIFFERENCE LEARNING
1. Store counts N[s] and estimated utilities Up(s)2. For each observed transition (s,r,a,s’):
3. Set N[s] N[s]+14. Adjust utility Up(s) Up(s)+a(N[s])(r+gUp(s’)-Up(s))
3
21
4321
+1-10.19
000-0.08
-0.03
0.620.23 0.69
• For any s, distribution of s’ approaches P(s’|s,p(s))
• Uses relationships between adjacent states to adjust utilities toward equilibrium
• Faster learning than direct estimation
With learning rate a=0.5
“OFFLINE” INTERPRETATION OF TD LEARNING
1. Observe trials t(i)=(s0(i),a1
(i),s1(i),r1
(i),…,ati(i),sti
(i),rti(i)) for i=1,…,n
2. For each state sS:3. Find all trials t(i) that pass through s4. Extract local history at (s,r(i),a(i),s’(i)) for each trial5. Set up constraint Up(s) = r(i) + gUp(s’(i))
6. Solve all constraints in least squares fashion using stochastic gradient descent
[Recall linear system in policy iteration: u = r+Tpu]
3
21
4321
+1-10
0000
0
00 0 3
21
4321
+1-1?
1. Store counts N[s],N[s,a],N[s,a,s’],estimated rewards R(s), and transition model P(s’|s,a)
2. For each observed transition (s,r,a,s’):3. Set N[s] N[s]+1, N[s,a] N[s,a]+1, N[s,a,s’]
N[s,a,s’]+14. Adjust reward R(s) R(s)+a(N[s])(r-R(s))5. Set P(s’|s,a) = N[s,a,s’]/N[s,a]6.Solve policy evaluation using P, R, p
ADAPTIVE DYNAMIC PROGRAMMING
3
21
4321
+1-10
0000
0
00 0
• Faster learning than TD, because Bellman equation is exploited across all states
• Modified policy evaluation algorithms make updates faster than solving linear system (O(n3))
+1-1
-.04
-.04 -.04-.04
-.04
-.04
?
-.04
-.04
?
R(s)
P(s’|s,a)
ACTIVE RL Rather than assume a policy is given, can we
use the learned utilities to pick good actions? At each state s, the agent must learn
outcomes for all actions, not just the action p(s)
GREEDY RL Maintain current estimates Up(s) Idea: At state s, take action a that maximizesSs’ P(s’|s,a) Up(s’)
Very seldom works well! Why?
EXPLORATION VS. EXPLOITATION Greedy strategy purely exploits its current
knowledge The quality of this knowledge improves only for
those states that the agent observes often A good learner must perform exploration in
order to improve its knowledge about states that are not often observed But pure exploration is useless (and costly) if it is
never exploited
RESTAURANT PROBLEM
OPTIMISTIC EXPLORATION STRATEGY Behave initially as if there were wonderful
rewards scattered all over the place Define a modified optimistic Bellman update
U+(s) R(s)+g maxa f( Ss P(s’|s,a)U+(s’) , N[s,a])
Truncated exploration function:
f(u,n) = R+ if n < Ne u otherwise
[Here the agent will try each action in each state at least Ne times.]
COMPLEXITY Truncated: at least Ne·|S|·|A| steps are
needed in order to explore every action in every state Some costly explorations might not be
necessary, or the reward from far-off explorations may be highly discounted
Convergence to optimal policy guaranteed only if each action is tried in each state an infinite number of times!
This works with ADP… But how to perform action selection in TD? Must also learn the transition model P(s’|s,a)
Q-VALUES Learning U is not enough for action selection
because a transition model is needed Solution: learn Q-values: Q(s,a) is the utility
of choosing action a in state s Shift Bellman equation
U(s) = maxa Q(s,a) Q(s,a) = R(s) + g Ss P(s’|s,a) maxa’ Q(s’,a’)
So far, everything is the same… but what about the learning rule?
Q-LEARNING UPDATE Recall TD:
Update: U(s) U(s)+a(N[s])(r+gU(s’)-U(s)) Select action: a arg maxa f( Ss P(s’|s,a)U(s’) , N[s,a])
Q-Learning: Update: Q(s,a) Q(s,a)+a(N[s,a])(r+g maxa’Q(s’,a’)-
Q(s,a)) Select action: a arg maxa f( Q(s,a) , N[s,a])
Key difference: average over P(s’|s,a) is “baked in” to the Q function
Q-learning is therefore a model-free active learner
MORE ISSUES IN RL Model-free vs. model-based
Model-based techniques are typically better at incorporating prior knowledge
Generalization Value function approximation Policy search methods
LARGE SCALE APPLICATIONS Game playing
TD-Gammon: neural network representation of Q-functions, trained via self-play
Robot control
RECAP Online learning: low memory overhead TD-learning: learn U through incremental
updates. Cheap, somewhat slow learning. ADP: learn P and R, derive U through policy
evaluation. Fast learning but computationally expensive.
Q-learning: learn state-action function Q(s,a), allows model-free action selection
Action selection requires trading off exploration vs. exploitation Infinite exploration needed to guarantee that the
optimal policy is found!
INCREMENTAL LEAST SQUARES Recall Least Squares estimate
q = (ATA)-1 AT b Where A is matrix of x(i)’s, b is vector of y(i)’s
(laid out in rows)
37
A = x(1)
x(2)
x(N)
…
b = y(1)
y(2)
y(N)
…NxM Nx1
DELTA RULE FOR LINEAR LEAST SQUARES Delta rule (Widrow-Hoff rule): stochastic
gradient descentq(t+1) = q(t)+a x (y-q(t)Tx)
O(n) time and space
38
INCREMENTAL LEAST SQUARES Let A(t), b(t) be A matrix, b vector up to time t
q(t) = (A(t)TA(t))-1 A(t)T b(t)
39
A(t+1) =
x(t+1)
b(t+1) =
y(t+1)
(T+1)xM
(t+1)x1
b(t)A(t)
INCREMENTAL LEAST SQUARES Let A(t), b(t) be A matrix, b vector up to time t
q(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)
A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)
40
A(t+1) =
x(t+1)
b(t+1) =
y(t+1)
(T+1)xM
(t+1)x1
b(t)A(t)
INCREMENTAL LEAST SQUARES Let A(t), b(t) be A matrix, b vector up to time t
q(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)
A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)
A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T
41
A(t+1) =
x(t+1)
b(t+1) =
y(t+1)
(T+1)xM
(t+1)x1
b(t)A(t)
INCREMENTAL LEAST SQUARES Let A(t), b(t) be A matrix, b vector up to time t
q(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)
A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)
A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T
42
A(t+1) =
x(t+1)
b(t+1) =
y(t+1)
(T+1)xM
(t+1)x1
b(t)A(t)
INCREMENTAL LEAST SQUARES Let A(t), b(t) be A matrix, b vector up to time t
q(t+1) = (A(t+1)TA(t+1))-1 A(t+1)T b(t+1)
A(t+1)T b(t+1) =A(t)T b(t) + y(t+1)x(t+1)
A(t+1)TA(t+1) = A(t)TA(t) + x(t+1)x(t+1)T
Sherman-Morrison Update (Y + xxT)-1 = Y-1 - Y-1
xxT Y-1 / (1 – xT Y-1 x)
43
INCREMENTAL LEAST SQUARES Putting it all together Store
p(t) = A(t)Tb(t)
Q(t) = (A(t)TA(t))-1
Updatep(t+1) = p(t) + y x
Q(t+1) = Q(t) - Q(t)
xxT Q(t) / (1 – xT Q(t) x)q(t+1) = Q(t+1)p(t+1)
O(M2) time and space instead of O(M3+MN) time and O(MN) space for OLS
True least squares estimator for any t, (delta rule works only for large t) 44