Concurrent Hierarchical Reinforcement Learning

1

Concurrent Hierarchical Reinforcement LearningBhaskara Marthi,Stuart Russell,David Latham

UC Berkeley

Carlos Guestrin CMU

2

Example problem: Stratagus

Challenges Large state space Large action space Complex policies

3

Writing agents for Stratagus

Game company : N person-years to write “AI” for game

Standard RL – learn from tabula rasa

Hierarchical RL : concurrent partial program + learning optimal completion

4

Outline

Running example Background Our contributions

Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition

Results

5

Running example

Peasants can move, pickup and dropoff

Penalty for collision Cost-of-living each step Reward for dropping off

resources Goal : gather 10 gold +

10 wood

6

Outline



Results

7

Reinforcement Learning

Learning algorithm

Policy

a

s,r

8

Q-functions

Represent policies implicitly using a Q-function Qπ(s,a) = “Expected total reward if I do action a in

environment state s and follow policy π thereafter”

Fragment of example Q-function

s a Q(s,a)... ... ...

Peas1Loc=(2,3),Peas2Loc=(4,2),Gold=8,Wood=3 (East,Pickup) 7.5

Peas1Loc=(4,4), Peas2Loc=(6,3), Gold=4, Wood=7 (West, North) -12

Peas1Loc=(5,1), Peas2Loc=(1,2), Gold=8, Wood=3 (South, West) 1.9

... ... ...

9

Hierarchical RL and ALisp

Learning algorithm

Completion

a

s,r

Partial program

10

(defun top () (loop do (choose ‘top-choice

(gather-wood)(gather-gold))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff)))

(defun nav (dest) (until (= (pos (get-state))

dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

Single-threaded Alisp program

Program state includes Program counter Call stack Global variables

11

Q-functions

Represent completions using Q-function Joint state ω = env state + program state MDP + partial program = SMDP over {ω} Qπ(ω,u) = “Expected total reward if I make choice

u in ω and follow completion π thereafter”

Example Q-function

ω u Q(ω,u)... ... ...

At nav-choice, Pos=(2,3), Dest=(6,5) North 15

At resource-choice, Gold=7, Wood=3 Gather-wood -42

... ... ...

12

Top TopTop

GatherWood GatherGold

Nav(Forest1) Nav(Mine2)

Qr Qc Qe

Q

• Temporal decomposition Q = Qr+Qc+Qe where

• Qr(ω,u) = reward while doing u

• Qc(ω,u) = reward in current subroutine after doing u

• Qe(ω,u) = reward after current subroutine

Temporal Q-decomposition

13

Q-learning for ALisp

Temporal decomposition => state abstraction E.g., while navigating, Qr doesn’t depend on gold reserves

State abstraction => fewer parameters, faster learning, better generalization

Algorithm for learning temporally decomposed Q-function (Andre+Russell 02)

Yields optimal completion of partial program

14

Outline



Results

15

Multieffector domains

Domains where the action is a vector Each component of action vector

corresponds to an effector Stratagus: each unit/building is an effector #actions exponential in #effectors

16

HRL in multieffector domains

Single Alisp program Partial program must essentially implement multiple

control stacks Independencies between tasks not used Temporal decomposition is lost

Separate Alisp program for each effector Hard to achieve coordination among effectors

Our approach - single multithreaded partial program to control all effectors

17

Threads and effectors

Threads = tasks Each effector assigned to

a thread Threads can be

created/destroyed Effectors can be moved

between threads Set of effectors can

change over time

Get-gold Get-wood

Defend-Base

Get-wood

18

Concurrent Alisp syntax

Extension of ALisp New/modified operations :

(action (e1 a1) ... (en an)) each effector ei does ai

(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it

(reassign effectors dest-thread) reassign effectors to thread dest-thread

(my-effectors) return list of effectors assigned to calling thread

Program should also specify how to assign new effectors on each step to a thread

19








20








21








22








23

(defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas

(first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas))))



(defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood))))



(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff)))

(defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

An example Concurrent ALisp programAn example single-threaded ALisp program

24

defun gather-wood ()

(setf dest (choose *forests*))

(nav dest)

(action ‘get-wood)

(nav *home-base-loc*)

(action ‘put-wood))

(defun gather-gold ()

(setf dest (choose *goldmines*))

(nav dest)

(action ‘get-gold)

(nav *home-base-loc*)

(action ‘put-gold))

(defun nav (Wood1)

(loop

until (at-pos Wood1)

(setf d (choose ‘(N S E W R)))

do (action d)

))

(defun nav (Gold2)

(loop

until (at-pos Gold2)

(setf d (choose ‘(N S E W R)))

do (action d)

))

Thread 2

Thread 1

Environment timestep 27Environment timestep 26

Peasant 1

Peasant 2

PausedMaking joint choice

PausedMaking joint choice

Running

Running

Waiting for joint action

Waiting for joint action

Concurrent Alisp semantics

25

Threads execute independently until they hit a choice or action

Wait until all threads are at a choice or action If all effectors have been assigned an action, do that

joint action in environment Otherwise, make joint choice

Can be converted to a semi-Markov decision process assuming partial program is deadlock-free and has no race conditions

Concurrent Alisp semantics

26

Q-functions

To complete partial program, at each choice state ω, need to specify choices for all choosing threads

So Q(ω,u) as before, except u is a joint choice

Example Q-function

ω u Q(ω,u)... ... ...

Peas1 at NavChoice, Peas2 at DropoffGold, Peas3 at ForestChoice, Pos1=(2,3), Pos3=(7,4), Gold=12, Wood=14

(Peas1:East, Peas3:Forest2)

15.7

... ... ...

27

Outline



Results

28

Making joint choices at runtime

Large state space Large choice space

At state ω, want to choose argmaxu Q(ω,u)

# joint choices exponential in # choosing threads

Use relational linear value function approximation with coordination graphs (Guestrin et al 2003) to represent, maximize Q efficiently

29

Basic learning procedure

Apply Q-learning on equivalent SMDP Boltzmann exploration, maximization on each

step done efficiently using coordination graph Theorem: in tabular case, converges to the

optimal completion of the partial program

30

Outline



Results

31

Problems with basic learning method

Temporal decomposition of Q-function lost No credit assignment among threads

Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly

Peasant 2 thinks he’s done very well!! Significantly slows learning as number of

peasants increases

32

Threadwise decomposition

Idea : decompose reward among threads (Russell+Zimdars, 2003)

E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants

Qj(ω,u) = “Expected total reward received by thread j if we make joint choice u and make globally optimal choices thereafter”

Threadwise Q-decomposition Q = Q1+…Qn

33

Peasant 3 thread

Learning threadwise decomposition

Peasant 2 thread

Peasant 1 thread

Top thread

a

r

r1

r2 r3

Action state

decomp

34

Peasant 3 thread

Learning threadwise decomposition

Peasant 2 thread

Peasant 1 thread

Top thread

Choice state ω

Q1(ω,·) Q3(ω,·)

Q2(ω,·)

+

Q (ω,·)argmax = u

Q-update

Q-update

Q-update

35

Outline



Results

36

Peasant 3 thread

Threadwise and temporal decomposition

Qj = Qj,r + Qj,c + Qj,e where Qj,r (ω,u) = Expected reward gained by thread j

while doing u Qj,c (ω,u) = Expected reward gained by thread j

after u but before leaving current subroutine Qj,e (ω,u) = Expected reward gained by thread j

after current subroutine

Qj,r Qj,c Qj,e

Peasant 2 thread

Qj,r Qj,c Qj,e

Peasant 1 thread

Qj,r Qj,c Qj,e

Top thread

37

Outline



Results

38

Resource gathering with 15 peasants

Num steps learning (x 1000)

Reward of learnt policy

Flat

39Num steps learning (x 1000)


Flat

Undecomposed




Flat

Undecomposed

Threadwise




Flat

Undecomposed

Threadwise

Threadwise + Temporal


42

Conclusion

Contributions Concurrent ALisp language for partial programs Algorithms for choice selection, learning Threadwise and temporal decomposition of Q-

function

Encode prior knowledge and use structure in Q-function for fast learning in very complex multieffector domains

Questions?

Documents

Concurrent Hierarchical Reinforcement Learning