42
1 Concurrent Hierarchical Reinforcement Learning Bhaskara Marthi, Stuart Russell, David Latham UC Berkeley Carlos Guestrin CMU

Concurrent Hierarchical Reinforcement Learning

  • Upload
    jasia

  • View
    46

  • Download
    0

Embed Size (px)

DESCRIPTION

Concurrent Hierarchical Reinforcement Learning. Example problem: Stratagus. Challenges Large state space Large action space Complex policies. Writing agents for Stratagus. Game company : N person-years to write “AI” for game. - PowerPoint PPT Presentation

Citation preview

Page 1: Concurrent Hierarchical   Reinforcement Learning

1

Concurrent Hierarchical Reinforcement LearningBhaskara Marthi,Stuart Russell,David Latham

UC Berkeley

Carlos Guestrin CMU

Page 2: Concurrent Hierarchical   Reinforcement Learning

2

Example problem: Stratagus

Challenges Large state space Large action space Complex policies

Page 3: Concurrent Hierarchical   Reinforcement Learning

3

Writing agents for Stratagus

Game company : N person-years to write “AI” for game

Standard RL – learn from tabula rasa

Hierarchical RL : concurrent partial program + learning optimal completion

Page 4: Concurrent Hierarchical   Reinforcement Learning

4

Outline

Running example Background Our contributions

Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition

Results

Page 5: Concurrent Hierarchical   Reinforcement Learning

5

Running example

Peasants can move, pickup and dropoff

Penalty for collision Cost-of-living each step Reward for dropping off

resources Goal : gather 10 gold +

10 wood

Page 6: Concurrent Hierarchical   Reinforcement Learning

6

Outline

Running example Background Our contributions

Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition

Results

Page 7: Concurrent Hierarchical   Reinforcement Learning

7

Reinforcement Learning

Learning algorithm

Policy

a

s,r

Page 8: Concurrent Hierarchical   Reinforcement Learning

8

Q-functions

Represent policies implicitly using a Q-function Qπ(s,a) = “Expected total reward if I do action a in

environment state s and follow policy π thereafter”

Fragment of example Q-function

s a Q(s,a)... ... ...

Peas1Loc=(2,3),Peas2Loc=(4,2),Gold=8,Wood=3 (East,Pickup) 7.5

Peas1Loc=(4,4), Peas2Loc=(6,3), Gold=4, Wood=7 (West, North) -12

Peas1Loc=(5,1), Peas2Loc=(1,2), Gold=8, Wood=3 (South, West) 1.9

... ... ...

Page 9: Concurrent Hierarchical   Reinforcement Learning

9

Hierarchical RL and ALisp

Learning algorithm

Completion

a

s,r

Partial program

Page 10: Concurrent Hierarchical   Reinforcement Learning

10

(defun top () (loop do (choose ‘top-choice

(gather-wood)(gather-gold))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest)) (action ‘get-gold) (nav *base-loc*)) (action ‘dropoff)))

(defun nav (dest) (until (= (pos (get-state))

dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

Single-threaded Alisp program

Program state includes Program counter Call stack Global variables

Page 11: Concurrent Hierarchical   Reinforcement Learning

11

Q-functions

Represent completions using Q-function Joint state ω = env state + program state MDP + partial program = SMDP over {ω} Qπ(ω,u) = “Expected total reward if I make choice

u in ω and follow completion π thereafter”

Example Q-function

ω u Q(ω,u)... ... ...

At nav-choice, Pos=(2,3), Dest=(6,5) North 15

At resource-choice, Gold=7, Wood=3 Gather-wood -42

... ... ...

Page 12: Concurrent Hierarchical   Reinforcement Learning

12

Top TopTop

GatherWood GatherGold

Nav(Forest1) Nav(Mine2)

Qr Qc Qe

Q

• Temporal decomposition Q = Qr+Qc+Qe where

• Qr(ω,u) = reward while doing u

• Qc(ω,u) = reward in current subroutine after doing u

• Qe(ω,u) = reward after current subroutine

Temporal Q-decomposition

Page 13: Concurrent Hierarchical   Reinforcement Learning

13

Q-learning for ALisp

Temporal decomposition => state abstraction E.g., while navigating, Qr doesn’t depend on gold reserves

State abstraction => fewer parameters, faster learning, better generalization

Algorithm for learning temporally decomposed Q-function (Andre+Russell 02)

Yields optimal completion of partial program

Page 14: Concurrent Hierarchical   Reinforcement Learning

14

Outline

Running example Background Our contributions

Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition

Results

Page 15: Concurrent Hierarchical   Reinforcement Learning

15

Multieffector domains

Domains where the action is a vector Each component of action vector

corresponds to an effector Stratagus: each unit/building is an effector #actions exponential in #effectors

Page 16: Concurrent Hierarchical   Reinforcement Learning

16

HRL in multieffector domains

Single Alisp program Partial program must essentially implement multiple

control stacks Independencies between tasks not used Temporal decomposition is lost

Separate Alisp program for each effector Hard to achieve coordination among effectors

Our approach - single multithreaded partial program to control all effectors

Page 17: Concurrent Hierarchical   Reinforcement Learning

17

Threads and effectors

Threads = tasks Each effector assigned to

a thread Threads can be

created/destroyed Effectors can be moved

between threads Set of effectors can

change over time

Get-gold Get-wood

Defend-Base

Get-wood

Page 18: Concurrent Hierarchical   Reinforcement Learning

18

Concurrent Alisp syntax

Extension of ALisp New/modified operations :

(action (e1 a1) ... (en an)) each effector ei does ai

(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it

(reassign effectors dest-thread) reassign effectors to thread dest-thread

(my-effectors) return list of effectors assigned to calling thread

Program should also specify how to assign new effectors on each step to a thread

Page 19: Concurrent Hierarchical   Reinforcement Learning

19

Concurrent Alisp syntax

Extension of ALisp New/modified operations :

(action (e1 a1) ... (en an)) each effector ei does ai

(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it

(reassign effectors dest-thread) reassign effectors to thread dest-thread

(my-effectors) return list of effectors assigned to calling thread

Program should also specify how to assign new effectors on each step to a thread

Page 20: Concurrent Hierarchical   Reinforcement Learning

20

Concurrent Alisp syntax

Extension of ALisp New/modified operations :

(action (e1 a1) ... (en an)) each effector ei does ai

(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it

(reassign effectors dest-thread) reassign effectors to thread dest-thread

(my-effectors) return list of effectors assigned to calling thread

Program should also specify how to assign new effectors on each step to a thread

Page 21: Concurrent Hierarchical   Reinforcement Learning

21

Concurrent Alisp syntax

Extension of ALisp New/modified operations :

(action (e1 a1) ... (en an)) each effector ei does ai

(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it

(reassign effectors dest-thread) reassign effectors to thread dest-thread

(my-effectors) return list of effectors assigned to calling thread

Program should also specify how to assign new effectors on each step to a thread

Page 22: Concurrent Hierarchical   Reinforcement Learning

22

Concurrent Alisp syntax

Extension of ALisp New/modified operations :

(action (e1 a1) ... (en an)) each effector ei does ai

(spawn thread fn effectors) start new thread that calls function fn, and reassign effectors to it

(reassign effectors dest-thread) reassign effectors to thread dest-thread

(my-effectors) return list of effectors assigned to calling thread

Program should also specify how to assign new effectors on each step to a thread

Page 23: Concurrent Hierarchical   Reinforcement Learning

23

(defun top () (loop do (until (my-effectors) (choose ‘dummy)) (setf peas

(first (my-effectors)) (choose ‘top-choice (spawn gather-wood peas) (spawn gather-gold peas))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun top () (loop do (choose ‘top-choice (gather-gold) (gather-wood))))

(defun gather-wood () (with-choice ‘forest-choice

(dest *forest-list*) (nav dest) (action ‘get-wood) (nav *base-loc*) (action ‘dropoff)))

(defun gather-gold () (with-choice ‘mine-choice

(dest *goldmine-list*) (nav dest) (action ‘get-gold) (nav *base-loc*) (action ‘dropoff)))

(defun nav (dest) (until (= (my-pos) dest) (with-choice ‘nav-choice (move ‘(N S E W NOOP)) (action move))))

An example Concurrent ALisp programAn example single-threaded ALisp program

Page 24: Concurrent Hierarchical   Reinforcement Learning

24

defun gather-wood ()

(setf dest (choose *forests*))

(nav dest)

(action ‘get-wood)

(nav *home-base-loc*)

(action ‘put-wood))

(defun gather-gold ()

(setf dest (choose *goldmines*))

(nav dest)

(action ‘get-gold)

(nav *home-base-loc*)

(action ‘put-gold))

(defun nav (Wood1)

(loop

until (at-pos Wood1)

(setf d (choose ‘(N S E W R)))

do (action d)

))

(defun nav (Gold2)

(loop

until (at-pos Gold2)

(setf d (choose ‘(N S E W R)))

do (action d)

))

Thread 2

Thread 1

Environment timestep 27Environment timestep 26

Peasant 1

Peasant 2

PausedMaking joint choice

PausedMaking joint choice

Running

Running

Waiting for joint action

Waiting for joint action

Concurrent Alisp semantics

Page 25: Concurrent Hierarchical   Reinforcement Learning

25

Threads execute independently until they hit a choice or action

Wait until all threads are at a choice or action If all effectors have been assigned an action, do that

joint action in environment Otherwise, make joint choice

Can be converted to a semi-Markov decision process assuming partial program is deadlock-free and has no race conditions

Concurrent Alisp semantics

Page 26: Concurrent Hierarchical   Reinforcement Learning

26

Q-functions

To complete partial program, at each choice state ω, need to specify choices for all choosing threads

So Q(ω,u) as before, except u is a joint choice

Example Q-function

ω u Q(ω,u)... ... ...

Peas1 at NavChoice, Peas2 at DropoffGold, Peas3 at ForestChoice, Pos1=(2,3), Pos3=(7,4), Gold=12, Wood=14

(Peas1:East, Peas3:Forest2)

15.7

... ... ...

Page 27: Concurrent Hierarchical   Reinforcement Learning

27

Outline

Running example Background Our contributions

Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition

Results

Page 28: Concurrent Hierarchical   Reinforcement Learning

28

Making joint choices at runtime

Large state space Large choice space

At state ω, want to choose argmaxu Q(ω,u)

# joint choices exponential in # choosing threads

Use relational linear value function approximation with coordination graphs (Guestrin et al 2003) to represent, maximize Q efficiently

Page 29: Concurrent Hierarchical   Reinforcement Learning

29

Basic learning procedure

Apply Q-learning on equivalent SMDP Boltzmann exploration, maximization on each

step done efficiently using coordination graph Theorem: in tabular case, converges to the

optimal completion of the partial program

Page 30: Concurrent Hierarchical   Reinforcement Learning

30

Outline

Running example Background Our contributions

Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition

Results

Page 31: Concurrent Hierarchical   Reinforcement Learning

31

Problems with basic learning method

Temporal decomposition of Q-function lost No credit assignment among threads

Suppose peasant 1 drops off some gold at base, while peasant 2 wanders aimlessly

Peasant 2 thinks he’s done very well!! Significantly slows learning as number of

peasants increases

Page 32: Concurrent Hierarchical   Reinforcement Learning

32

Threadwise decomposition

Idea : decompose reward among threads (Russell+Zimdars, 2003)

E.g., rewards for thread j only when peasant j drops off resources or collides with other peasants

Qj(ω,u) = “Expected total reward received by thread j if we make joint choice u and make globally optimal choices thereafter”

Threadwise Q-decomposition Q = Q1+…Qn

Page 33: Concurrent Hierarchical   Reinforcement Learning

33

Peasant 3 thread

Learning threadwise decomposition

Peasant 2 thread

Peasant 1 thread

Top thread

a

r

r1

r2 r3

Action state

decomp

Page 34: Concurrent Hierarchical   Reinforcement Learning

34

Peasant 3 thread

Learning threadwise decomposition

Peasant 2 thread

Peasant 1 thread

Top thread

Choice state ω

Q1(ω,·) Q3(ω,·)

Q2(ω,·)

+

Q (ω,·)argmax = u

Q-update

Q-update

Q-update

Page 35: Concurrent Hierarchical   Reinforcement Learning

35

Outline

Running example Background Our contributions

Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition

Results

Page 36: Concurrent Hierarchical   Reinforcement Learning

36

Peasant 3 thread

Threadwise and temporal decomposition

Qj = Qj,r + Qj,c + Qj,e where Qj,r (ω,u) = Expected reward gained by thread j

while doing u Qj,c (ω,u) = Expected reward gained by thread j

after u but before leaving current subroutine Qj,e (ω,u) = Expected reward gained by thread j

after current subroutine

Qj,r Qj,c Qj,e

Peasant 2 thread

Qj,r Qj,c Qj,e

Peasant 1 thread

Qj,r Qj,c Qj,e

Top thread

Page 37: Concurrent Hierarchical   Reinforcement Learning

37

Outline

Running example Background Our contributions

Concurrent Alisp language for partial programs Basic learning method … with threadwise decomposition … and temporal decomposition

Results

Page 38: Concurrent Hierarchical   Reinforcement Learning

38

Resource gathering with 15 peasants

Num steps learning (x 1000)

Reward of learnt policy

Flat

Page 39: Concurrent Hierarchical   Reinforcement Learning

39Num steps learning (x 1000)

Reward of learnt policy

Flat

Undecomposed

Resource gathering with 15 peasants

Page 40: Concurrent Hierarchical   Reinforcement Learning

40Num steps learning (x 1000)

Reward of learnt policy

Flat

Undecomposed

Threadwise

Resource gathering with 15 peasants

Page 41: Concurrent Hierarchical   Reinforcement Learning

41Num steps learning (x 1000)

Reward of learnt policy

Flat

Undecomposed

Threadwise

Threadwise + Temporal

Resource gathering with 15 peasants

Page 42: Concurrent Hierarchical   Reinforcement Learning

42

Conclusion

Contributions Concurrent ALisp language for partial programs Algorithms for choice selection, learning Threadwise and temporal decomposition of Q-

function

Encode prior knowledge and use structure in Q-function for fast learning in very complex multieffector domains

Questions?