FeUdalNetworks for Hierarchical Reinforcement Learning

FeUdal NetworksforHierarchicalReinforcementLearning

byArtem BachysnkyiComputationalNeuroscienceSeminar

UniversityofTartu3May2017

Reinforcementlearning

The basic reinforcement learning model consists of:

• a set of environment and agent states S • a set of actions A of the agent• policies of transitioning from states to actions• rules that determine the scalar immediate reward of a

transition • rules that describe what the agent observes.

ATARIgames

Standart approach

• useanaction-repeatheuristic,whereeachactiontranslatesintoseveralconsecutiveactionsintheenvironment

• notapplicableinnon-Marcovian environmentsthatrequirememory• can’tlearnontheweakrewardsignal

Feudalreinforcementlearningintuition

• levelsofhierarchywithinanagentcommunicateviaexplicitgoals• goalscanbegeneratedinatop-downfashion• goalsettingcanbedecoupledfromgoalachievement

Manager-Workermodel

Manager:• setsgoalsatalowertemporalresolution

Worker:• operatesatahighertemporalresolution

• producesprimitiveactions• followsthegoalsbyanintrinsicreward

Mainproposals

• aconsistent,end-to-enddifferentiablemodel• approximatetransitionpolicygradientupdatefortrainingtheManager• useofgoalsthataredirectionalratherthanabsolute• dilatedLSTMfortheManagerRNNdesign

FuN modeldescription


ℎ",ℎ# – internalstates𝑈% – workersoutput𝜙 – maps𝑔% into𝑤%𝜋 – vectorofprobabilitiesoverprimitiveactions

𝑠% – latentstaterepresentation𝑔% – goalvector𝑥% – observationfromtheenvironment𝑧% – sharedintermediaterepresentation

Learning

Learningsteps:1. receivesanobservationfromtheenvironment2. selectanactionfromafiniteset3. theenvironmentrespondswithanewobservationandascalar

reward4. theprocesscontinuesuntiltheterminalstateisreached

LearningBadidea:

trainfeudalnetworkend-to-endusingapolicygradientalgorithmoperatingontheactionstakenbytheWorker

Goodidea:independentlytrainManagertopredictadvantageousdirectionsinstatespaceandtointrinsicallyrewardtheWorkertofollowthesedirections

Theagentsgoal

Maximizethediscountedreturn

where

Theagent’sbehaviour isdefinedbyitsaction-selectionpolicyπ.FuN producesadistributionoverpossibleactions.

Managersupdaterule

where

– valuefunctionestimatefromtheinternalcritic

– cosinesimilarity

– advantagefunction

Workersintrinsicreward

where

𝑐 – horizon

TheWorkerspolicy

Advantageauthorcritic

Advantagefunction

Architecturedetails

𝑓012310%– ConvolutionalNeuralNetwork:1. 168x8filters,stride42.324x4fil- ters ofstride23.fullyconnectedlayerhas256hiddenunits*eachlayerisfollowedbyarectifiednon-linearity

𝑓"40531 – anotherfullyconnectedlayer𝑓#266 – standardLSTM𝑓"266 – dilatedLSTM


DilatedLSTMStateofthenetworkwith𝑟 separategroupsofsub-states

Attime𝑡 wecanindicatewich groupofcoresisupdated

Ateachtimesteponlythecorrespondingpartofthestateisupdatedandtheoutputispooledacrossthepreviouscoutputs.ThisallowsthergroupsofcoresinsidethedLSTM topreservethememoriesforlongperiods.

*Intheexperimentsr=10.

Experiments:ATARI

Experiments:Montezuma’srevenge

https://www.youtube.com/watch?v=_zbg9rs5QZY

Experiments:Montezuma’srevenge

Experiments:Non-matchandT-maze

Experiments:Watermaze

Experiments:transitionpolicygradient

Experiments:Temporalresolution

Experiments:DilateLSTMagentbaseline

Documents

FeUdalNetworks for Hierarchical Reinforcement Learning