Reinforcement Learning - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/13_reinforcement-learning.pdf · Reinforcement learning agent environment state reward

Reinforcement Learning

CS 760@UW-Madison

Goals for the lecture

you should understand the following concepts• the reinforcement learning task• Markov decision process• value functions• Q functions• value iteration• Q learning• exploration vs. exploitation tradeoff• compact representations of Q functions• reinforcement learning example

2

Reinforcement learning (RL)Task of an agent embedded in an environment

repeat forever1) sense world2) reason3) choose an action to perform4) get feedback (usually a reward in ℝ)5) learn

the environment may be the physical world or an artificial one

3

Reinforcement learning

agent

environment

state reward action

s0 s1 s2

a0 a1 a2

r0 r1 r2

• set of states S• set of actions A• at each time t, agent observes state

st ∈ S then chooses action at ∈ A• then receives reward rt and moves

to state st+1.

5

RL as Markov decision process (MDP)

• Markov assumption

• also assume reward is Markovian

Goal: learn a policy π : S → A for choosing actions that maximizes

for every possible starting state s0 6

),|(,...),,,|( 1111 tttttttt assPasassP +--+ =

),|(,...),,,|( 1111 tttttttt asrPasasrP +--+ =

10 where...][ 22

1 <£+++ ++ ggg ttt rrrE

agent

environment

state reward

s0 s1 s2

a0 a1 a2

r0 r1 r2

action

The grid world

G

00

0

0

0

0

0

0

100

0

0

100

0

• As a running example:

• each arrow represents an action a and the associatednumber represents deterministic reward r(s, a)

77

Value function for a policy

• given a policy π : S → A define

assuming action sequence chosenaccording to π starting at state s

• we want the optimal policy π* where

π* = argmaxπ V

π (s) for all s

• we’ll denote the value function for this optimal policy as V*(s).

8

å¥

=

=0

][)(t

tt rEsV gp

Value function for a policy π

• Suppose π is shown by red arrows, γ = 0.9

G

00

0

0

0

0

0

0

100

0

0

100

0

Vπ(s) values are shown in red

100

0

90

8173

66

9

• The Bellman’s equation (for deterministic transition):

V ⇡(s) = r(s,⇡(s)) + �V ⇡(s0)<latexit sha1_base64="1YMNnmORAdhGZX3UjfPNFuwFXaA=">AAACEnicbVDLSgMxFM3UV62vUZdugkXsoJSZKuhGKLpxWcE+oDOWTJppQ5PMkGSEUvoNbvwVNy4UcevKnX9j2s5CqwcunJxzL7n3hAmjSrvul5VbWFxaXsmvFtbWNza37O2dhopTiUkdxyyWrRApwqggdU01I61EEsRDRprh4GriN++JVDQWt3qYkICjnqARxUgbqWM7jTs/oSXlwAsoS+p49nDgEfR7iHMEM//Q6dhFt+xOAf8SLyNFkKHWsT/9boxTToTGDCnV9txEByMkNcWMjAt+qkiC8AD1SNtQgThRwWh60hgeGKULo1iaEhpO1Z8TI8SVGvLQdHKk+2rem4j/ee1UR+fBiIok1UTg2UdRyqCO4SQf2KWSYM2GhiAsqdkV4j6SCGuTYsGE4M2f/Jc0KmXvpFy5OS1WL7M48mAP7IMS8MAZqIJrUAN1gMEDeAIv4NV6tJ6tN+t91pqzspld8AvWxzcDRpqM</latexit>

Value function for an optimal policy π*

• Suppose π* is shown by red arrows, γ = 0.9

G

00

0

0

0

0

0

0

100

0

0

100

0

V*(s) values are shown in red

100

0

90

10090

81

10

• The Bellman’s equation (for deterministic transition):

V ⇡(s) = r(s,⇡(s)) + �V ⇡(s0)<latexit sha1_base64="1YMNnmORAdhGZX3UjfPNFuwFXaA=">AAACEnicbVDLSgMxFM3UV62vUZdugkXsoJSZKuhGKLpxWcE+oDOWTJppQ5PMkGSEUvoNbvwVNy4UcevKnX9j2s5CqwcunJxzL7n3hAmjSrvul5VbWFxaXsmvFtbWNza37O2dhopTiUkdxyyWrRApwqggdU01I61EEsRDRprh4GriN++JVDQWt3qYkICjnqARxUgbqWM7jTs/oSXlwAsoS+p49nDgEfR7iHMEM//Q6dhFt+xOAf8SLyNFkKHWsT/9boxTToTGDCnV9txEByMkNcWMjAt+qkiC8AD1SNtQgThRwWh60hgeGKULo1iaEhpO1Z8TI8SVGvLQdHKk+2rem4j/ee1UR+fBiIok1UTg2UdRyqCO4SQf2KWSYM2GhiAsqdkV4j6SCGuTYsGE4M2f/Jc0KmXvpFy5OS1WL7M48mAP7IMS8MAZqIJrUAN1gMEDeAIv4NV6tJ6tN+t91pqzspld8AvWxzcDRpqM</latexit>

Using a value function

If we know V*(s), r(st, a), and P(st | st-1, at-1) we can compute π*(s)

11

úû

ùêë

é =+= åÎ

+Î Ss

tttAa

t sVasssPasrs )(),|(),(maxarg)( *1

* gp

Value iteration for learning V*(s)

initialize V(s) arbitrarilyloop until policy good enough{

loop for s ∈ S{

loop for a ∈ A{

}

}}

12

åÎ

+¬Ss

sVassPasrasQ'

)'(),|'(),(),( g

),(max)( asQsV a¬

Value iteration for learning V*(s)

• V(s) converges to V*(s)

• works even if we randomly traverse environment instead of looping through each state and action methodically

• but we must visit each state infinitely often

• implication: we can do online learning as an agent roams around its environment

• assumes we have a model of the world: i.e. know P(st | st-1, at-1)

• What if we don’t?

13

Using a Q function

define a new function, closely related to V*

if agent knows Q(s, a), it can choose optimal action without knowing P(s’ | s, a)

and it can learn Q(s, a) without knowing P(s’ | s, a)

14

[ ] [ ])'())(,()( *)(,|'

*** sVEssrEsV sss p

gp +¬

[ ] [ ])'(),(),( *,|' sVEasrEasQ assg+¬

),(max)(* asQsV a¬),(maxarg)(* asQs a¬p

Q values

G

00

0

0

0

0

00

100

0

0

100

0

r(s, a) (immediate reward) values

G

100

0

90

10090

8181

7281

81

72

90

81

Q(s, a) values

G

100

0

90

10090

81

V*(s) values

15

Q learning for deterministic worlds

for each s, a initialize table entryobserve current state sdo forever

select an action a and execute itreceive immediate reward robserve the new state s’update table entry

s← s’

16

)','(ˆmax),(ˆ ' asQrasQ ag+¬

0),(ˆ ¬asQ

Updating Q

10072

6381

10090

6381

aright

17

90 }100,81,63max{9.00

)',(ˆmax),(ˆ2'1

¬+¬

+¬ asQrasQ aright g

Q learning for nondeterministic worlds

for each s, a initialize table entryobserve current state sdo forever

select an action a and execute itreceive immediate reward robserve the new state s’update table entry

s← s’

α n =

11+ visitsn (s,a)

where αn is a parameter dependenton the number of visits to the given(s, a) pair

18

0),(ˆ ¬asQ

[ ])','(ˆmax),(ˆ)1(),(ˆ 1'1 asQrasQasQ nannnn -- ++-¬ gaa

Convergence of Q learning

• Q learning will converge to the correct Q function

• in the deterministic case

• in the nondeterministic case (using the update rule just presented)

• in practice it is likely to take many, many iterations

19

Q’s vs. V’s

• Which action do we choose when we’re in a given state?• V’s (model-based)

• need to have a ‘next state’ function to generate all possible states

• choose next state with highest V value.• Q’s (model-free)

• need only know which actions are legal• generally choose next state with highest Q value.

V V

V

Q

Q

20

Exploration vs. Exploitation

• in order to learn about better alternatives, we shouldn’t always follow the current policy (exploitation)

• sometimes, we should select random actions (exploration)

• one way to do this: select actions probabilistically according to:

where c > 0 is a constant that determines how strongly selection favors actions with higher Q values

21

å=

j

asQ

asQ

ij

i

ccsaP

),(ˆ

),(ˆ

)|(

Q learning with a table

As described so far, Q learning entails filling in a huge table

A table is a very verbose way torepresent a function

s0 s1 s2 . . . sn

a1

a2

a3

.

.

.ak

. . . Q(s2, a3)

.

.

.

actions

states

22

Q(s, a1)

Q(s, a2)

Q(s, ak)

Representing Q functions more compactly

We can use some other function representation (e.g. a neural net) to compactly encode a substitute for the big table

encoding of the state (s)

each input unit encodes a property of the state (e.g., a sensor value)

23

Why use a compact Q function?

1. Full Q table may not fit in memory for realistic problems2. Can generalize across states, thereby speeding up

convergencei.e. one instance ‘fills’ many cells in the Q table

Notes1. When generalizing across states, cannot use α=12. Convergence proofs only apply to Q tables3. Some work on bounding errors caused by using compact

representations (e.g. Singh & Yee, Machine Learning 1994)

24

Given: 100 Boolean-valued features10 possible actions

Size of Q table10 × 2100 entries

Size of Q net (assume 100 hidden units)100 × 100 + 100 × 10 = 11,000 weights

Q tables vs. Q nets

weights between inputs and HU’s

weights between HU’s and outputs

25

Q learning with function approximation

1. measure sensors, sense state s0

2. predict for each action a3. select action a to take (with randomization to ensure

exploration)4. apply action a in the real world5. sense new state s1 and immediate reward r6. calculate action a’ that maximizes7. train with new instance

Q̂n (s0,a)

Q̂n (s1,a ')

27

[ ])',(ˆmax),(ˆ)1( 1'0

0

asQrasQy

s

agaa ++-¬

=x

Calculate Q-value you would have put into Q-table, and use it as the training label

ML example: reinforcement learning to control an autonomous helicopter

video of Stanford University autonomous helicopter from http://heli.stanford.edu/28

Stanford autonomous helicoptersensing the helicopter’s state

• orientation sensoraccelerometerrate gyromagnetometer

• GPS receiver (“2cm accuracy as long as its antenna is pointing towards the sky”)

• ground-based cameras

actions to control the helicopter

1. Expert pilot demonstrates the airshow several times

2. Learn a reward function based on desired trajectory3. Learn a dynamics model4. Find the optimal control policy for learned reward and dynamics

model5. Autonomously fly the airshow

6. Learn an improved dynamics model. Go back to step 4

Experimental setup for helicopter

Learning dynamics model P(st+1 | st, a)• state represented by helicopter’s

x, y, z( )

!x, !y, !z( )ω x ,ω y ,ω z( )

u1,u2,u3,u4( )• action represented by manipulations of 4 controls

position

velocity

angular velocity

• dynamics model predicts accelerations as a function of current state and actions

• accelerations are integrated to compute the predicted next state

Learning dynamics model P(st+1 | st, a)

• A, B, C, D represent model parameters• g represents gravity vector• w’s are random variables representing noise and unmodeled effects

• linear regression task!

dynamicsmodel

!!zb

Learning a desired trajectory• repeated expert demonstrations are often suboptimal in different ways• given a set of M demonstrated trajectories

state on jth step of trajectory kaction on jth step of trajectory k

• try to infer the implicit desired trajectory

1,...,0,1,...,0for -=-=úû

ùêë

é= MkNjus

y kj

kjk

j

,...,Htus

zt

tt 0for *

*

=úû

ùêë

é=

Learning a desired trajectory

Figure from Coates et al.,CACM 2009

colored lines: demonstrations of two loopsblack line: inferred trajectory

Learning reward function

• EM is used to infer desired trajectory from set of demonstrated trajectories

• The reward function is based on deviations from the desired trajectory

Finding the optimal control policy• finding the control policy is a reinforcement learning task

• RL learning methods described earlier don’t quite apply because state and action spaces are both continuous

• A special type of Markov decision process in which the optimal policy can be found efficiently• reward is represented as a linear function of state and action vectors• next state is represented as a linear function of current state and action

vectors• They use an iterative approach that finds an approximate solution because the

reward function used is quadratic

úû

ùêë

é¬ å pp p |),(maxarg*

tt asrE

THANK YOUSome of the slides in these lectures have been adapted/borrowed from materials developed by Yingyu Liang, Mark Craven, David

Page, Jude Shavlik, Tom Mitchell, Nina Balcan, Elad Hazan, Tom Dietterich, and Pedro Domingos.

Documents

Reinforcement Learning - University of Wisconsin–Madisonpages.cs.wisc.edu/~jerryzhu/cs760/13_reinforcement-learning.pdf · Reinforcement learning agent environment state reward