Upload
others
View
4
Download
3
Embed Size (px)
Citation preview
Reinforcement Learning
CS 760@UW-Madison
Goals for the lecture
you should understand the following concepts• the reinforcement learning task• Markov decision process• value functions• Q functions• value iteration• Q learning• exploration vs. exploitation tradeoff• compact representations of Q functions• reinforcement learning example
2
Reinforcement learning (RL)Task of an agent embedded in an environment
repeat forever1) sense world2) reason3) choose an action to perform4) get feedback (usually a reward in ℝ)5) learn
the environment may be the physical world or an artificial one
3
Reinforcement learning
agent
environment
state reward action
s0 s1 s2
a0 a1 a2
r0 r1 r2
• set of states S• set of actions A• at each time t, agent observes state
st ∈ S then chooses action at ∈ A• then receives reward rt and moves
to state st+1.
5
RL as Markov decision process (MDP)
• Markov assumption
• also assume reward is Markovian
Goal: learn a policy π : S → A for choosing actions that maximizes
for every possible starting state s0 6
),|(,...),,,|( 1111 tttttttt assPasassP +--+ =
),|(,...),,,|( 1111 tttttttt asrPasasrP +--+ =
10 where...][ 22
1 <£+++ ++ ggg ttt rrrE
agent
environment
state reward
s0 s1 s2
a0 a1 a2
r0 r1 r2
action
The grid world
G
00
0
0
0
0
0
0
100
0
0
100
0
• As a running example:
• each arrow represents an action a and the associatednumber represents deterministic reward r(s, a)
77
Value function for a policy
• given a policy π : S → A define
assuming action sequence chosenaccording to π starting at state s
• we want the optimal policy π* where
π* = argmaxπ V
π (s) for all s
• we’ll denote the value function for this optimal policy as V*(s).
8
å¥
=
=0
][)(t
tt rEsV gp
Value function for a policy π
• Suppose π is shown by red arrows, γ = 0.9
G
00
0
0
0
0
0
0
100
0
0
100
0
Vπ(s) values are shown in red
100
0
90
8173
66
9
• The Bellman’s equation (for deterministic transition):
V ⇡(s) = r(s,⇡(s)) + �V ⇡(s0)<latexit sha1_base64="1YMNnmORAdhGZX3UjfPNFuwFXaA=">AAACEnicbVDLSgMxFM3UV62vUZdugkXsoJSZKuhGKLpxWcE+oDOWTJppQ5PMkGSEUvoNbvwVNy4UcevKnX9j2s5CqwcunJxzL7n3hAmjSrvul5VbWFxaXsmvFtbWNza37O2dhopTiUkdxyyWrRApwqggdU01I61EEsRDRprh4GriN++JVDQWt3qYkICjnqARxUgbqWM7jTs/oSXlwAsoS+p49nDgEfR7iHMEM//Q6dhFt+xOAf8SLyNFkKHWsT/9boxTToTGDCnV9txEByMkNcWMjAt+qkiC8AD1SNtQgThRwWh60hgeGKULo1iaEhpO1Z8TI8SVGvLQdHKk+2rem4j/ee1UR+fBiIok1UTg2UdRyqCO4SQf2KWSYM2GhiAsqdkV4j6SCGuTYsGE4M2f/Jc0KmXvpFy5OS1WL7M48mAP7IMS8MAZqIJrUAN1gMEDeAIv4NV6tJ6tN+t91pqzspld8AvWxzcDRpqM</latexit>
Value function for an optimal policy π*
• Suppose π* is shown by red arrows, γ = 0.9
G
00
0
0
0
0
0
0
100
0
0
100
0
V*(s) values are shown in red
100
0
90
10090
81
10
• The Bellman’s equation (for deterministic transition):
V ⇡(s) = r(s,⇡(s)) + �V ⇡(s0)<latexit sha1_base64="1YMNnmORAdhGZX3UjfPNFuwFXaA=">AAACEnicbVDLSgMxFM3UV62vUZdugkXsoJSZKuhGKLpxWcE+oDOWTJppQ5PMkGSEUvoNbvwVNy4UcevKnX9j2s5CqwcunJxzL7n3hAmjSrvul5VbWFxaXsmvFtbWNza37O2dhopTiUkdxyyWrRApwqggdU01I61EEsRDRprh4GriN++JVDQWt3qYkICjnqARxUgbqWM7jTs/oSXlwAsoS+p49nDgEfR7iHMEM//Q6dhFt+xOAf8SLyNFkKHWsT/9boxTToTGDCnV9txEByMkNcWMjAt+qkiC8AD1SNtQgThRwWh60hgeGKULo1iaEhpO1Z8TI8SVGvLQdHKk+2rem4j/ee1UR+fBiIok1UTg2UdRyqCO4SQf2KWSYM2GhiAsqdkV4j6SCGuTYsGE4M2f/Jc0KmXvpFy5OS1WL7M48mAP7IMS8MAZqIJrUAN1gMEDeAIv4NV6tJ6tN+t91pqzspld8AvWxzcDRpqM</latexit>
Using a value function
If we know V*(s), r(st, a), and P(st | st-1, at-1) we can compute π*(s)
11
úû
ùêë
é =+= åÎ
+Î Ss
tttAa
t sVasssPasrs )(),|(),(maxarg)( *1
* gp
Value iteration for learning V*(s)
initialize V(s) arbitrarilyloop until policy good enough{
loop for s ∈ S{
loop for a ∈ A{
}
}}
12
åÎ
+¬Ss
sVassPasrasQ'
)'(),|'(),(),( g
),(max)( asQsV a¬
Value iteration for learning V*(s)
• V(s) converges to V*(s)
• works even if we randomly traverse environment instead of looping through each state and action methodically
• but we must visit each state infinitely often
• implication: we can do online learning as an agent roams around its environment
• assumes we have a model of the world: i.e. know P(st | st-1, at-1)
• What if we don’t?
13
Using a Q function
define a new function, closely related to V*
if agent knows Q(s, a), it can choose optimal action without knowing P(s’ | s, a)
and it can learn Q(s, a) without knowing P(s’ | s, a)
14
[ ] [ ])'())(,()( *)(,|'
*** sVEssrEsV sss p
gp +¬
[ ] [ ])'(),(),( *,|' sVEasrEasQ assg+¬
),(max)(* asQsV a¬),(maxarg)(* asQs a¬p
Q values
G
00
0
0
0
0
00
100
0
0
100
0
r(s, a) (immediate reward) values
G
100
0
90
10090
8181
7281
81
72
90
81
Q(s, a) values
G
100
0
90
10090
81
V*(s) values
15
Q learning for deterministic worlds
for each s, a initialize table entryobserve current state sdo forever
select an action a and execute itreceive immediate reward robserve the new state s’update table entry
s← s’
16
)','(ˆmax),(ˆ ' asQrasQ ag+¬
0),(ˆ ¬asQ
Updating Q
10072
6381
10090
6381
aright
17
90 }100,81,63max{9.00
)',(ˆmax),(ˆ2'1
¬+¬
+¬ asQrasQ aright g
Q learning for nondeterministic worlds
for each s, a initialize table entryobserve current state sdo forever
select an action a and execute itreceive immediate reward robserve the new state s’update table entry
s← s’
α n =
11+ visitsn (s,a)
where αn is a parameter dependenton the number of visits to the given(s, a) pair
18
0),(ˆ ¬asQ
[ ])','(ˆmax),(ˆ)1(),(ˆ 1'1 asQrasQasQ nannnn -- ++-¬ gaa
Convergence of Q learning
• Q learning will converge to the correct Q function
• in the deterministic case
• in the nondeterministic case (using the update rule just presented)
• in practice it is likely to take many, many iterations
19
Q’s vs. V’s
• Which action do we choose when we’re in a given state?• V’s (model-based)
• need to have a ‘next state’ function to generate all possible states
• choose next state with highest V value.• Q’s (model-free)
• need only know which actions are legal• generally choose next state with highest Q value.
V V
V
Q
Q
20
Exploration vs. Exploitation
• in order to learn about better alternatives, we shouldn’t always follow the current policy (exploitation)
• sometimes, we should select random actions (exploration)
• one way to do this: select actions probabilistically according to:
where c > 0 is a constant that determines how strongly selection favors actions with higher Q values
21
å=
j
asQ
asQ
ij
i
ccsaP
),(ˆ
),(ˆ
)|(
Q learning with a table
As described so far, Q learning entails filling in a huge table
A table is a very verbose way torepresent a function
s0 s1 s2 . . . sn
a1
a2
a3
.
.
.ak
. . . Q(s2, a3)
.
.
.
actions
states
22
Q(s, a1)
Q(s, a2)
Q(s, ak)
Representing Q functions more compactly
We can use some other function representation (e.g. a neural net) to compactly encode a substitute for the big table
encoding of the state (s)
each input unit encodes a property of the state (e.g., a sensor value)
23
Why use a compact Q function?
1. Full Q table may not fit in memory for realistic problems2. Can generalize across states, thereby speeding up
convergencei.e. one instance ‘fills’ many cells in the Q table
Notes1. When generalizing across states, cannot use α=12. Convergence proofs only apply to Q tables3. Some work on bounding errors caused by using compact
representations (e.g. Singh & Yee, Machine Learning 1994)
24
Given: 100 Boolean-valued features10 possible actions
Size of Q table10 × 2100 entries
Size of Q net (assume 100 hidden units)100 × 100 + 100 × 10 = 11,000 weights
Q tables vs. Q nets
weights between inputs and HU’s
weights between HU’s and outputs
25
Q learning with function approximation
1. measure sensors, sense state s0
2. predict for each action a3. select action a to take (with randomization to ensure
exploration)4. apply action a in the real world5. sense new state s1 and immediate reward r6. calculate action a’ that maximizes7. train with new instance
Q̂n (s0,a)
Q̂n (s1,a ')
27
[ ])',(ˆmax),(ˆ)1( 1'0
0
asQrasQy
s
agaa ++-¬
=x
Calculate Q-value you would have put into Q-table, and use it as the training label
ML example: reinforcement learning to control an autonomous helicopter
video of Stanford University autonomous helicopter from http://heli.stanford.edu/28
Stanford autonomous helicoptersensing the helicopter’s state
• orientation sensoraccelerometerrate gyromagnetometer
• GPS receiver (“2cm accuracy as long as its antenna is pointing towards the sky”)
• ground-based cameras
actions to control the helicopter
1. Expert pilot demonstrates the airshow several times
2. Learn a reward function based on desired trajectory3. Learn a dynamics model4. Find the optimal control policy for learned reward and dynamics
model5. Autonomously fly the airshow
6. Learn an improved dynamics model. Go back to step 4
Experimental setup for helicopter
Learning dynamics model P(st+1 | st, a)• state represented by helicopter’s
x, y, z( )
!x, !y, !z( )ω x ,ω y ,ω z( )
u1,u2,u3,u4( )• action represented by manipulations of 4 controls
position
velocity
angular velocity
• dynamics model predicts accelerations as a function of current state and actions
• accelerations are integrated to compute the predicted next state
Learning dynamics model P(st+1 | st, a)
• A, B, C, D represent model parameters• g represents gravity vector• w’s are random variables representing noise and unmodeled effects
• linear regression task!
dynamicsmodel
!!zb
Learning a desired trajectory• repeated expert demonstrations are often suboptimal in different ways• given a set of M demonstrated trajectories
state on jth step of trajectory kaction on jth step of trajectory k
• try to infer the implicit desired trajectory
1,...,0,1,...,0for -=-=úû
ùêë
é= MkNjus
y kj
kjk
j
,...,Htus
zt
tt 0for *
*
=úû
ùêë
é=
Learning a desired trajectory
Figure from Coates et al.,CACM 2009
colored lines: demonstrations of two loopsblack line: inferred trajectory
Learning reward function
• EM is used to infer desired trajectory from set of demonstrated trajectories
• The reward function is based on deviations from the desired trajectory
Finding the optimal control policy• finding the control policy is a reinforcement learning task
• RL learning methods described earlier don’t quite apply because state and action spaces are both continuous
• A special type of Markov decision process in which the optimal policy can be found efficiently• reward is represented as a linear function of state and action vectors• next state is represented as a linear function of current state and action
vectors• They use an iterative approach that finds an approximate solution because the
reward function used is quadratic
úû
ùêë
é¬ å pp p |),(maxarg*
tt asrE
THANK YOUSome of the slides in these lectures have been adapted/borrowed from materials developed by Yingyu Liang, Mark Craven, David
Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.