29
An Introduction to Reinforcement Learning Anand Subramoney anand [at] igi.tugraz.at Institute for Theoretical Computer Science, TU Graz http://www.igi.tugraz.at/ Machine Learning Graz Meetup 12 th October 2017

An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

AnIntroductiontoReinforcementLearning

AnandSubramoneyanand [at]igi.tugraz.at

InstituteforTheoreticalComputerScience,TUGrazhttp://www.igi.tugraz.at/

MachineLearningGrazMeetup12th October2017

Page 2: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Outline

• Introduction• Valueestimation• Q-learning• Policygradient• DQN• A3C

Page 3: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

WhatisReinforcementLearning?

• Learninganagentwhileinteracting withtheenvironment• Theagentreceivesa“reward”foreachactionittakes• Thegoaloftheagentistomaximizetherewarditreceives• Theagentisnottoldwhatthe”right”actionis.i.e.itisnotsupervised

Page 4: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Notation

• Thestateoftheenvironmentis𝑠" attime𝑡• Examplesofstate:the(x,y)coordinates,imagepixelsetc.

• Ateachtimestep𝑡,theagenttakesaction𝑎" (knowing𝑠")• Examplesofaction:Moveright/left/up/down,accelerationofcaretc.

• Thentheagentgetsareward𝑟"• Couldbe0/1orpointsinthegame

• Theagentplaysforone“episode”• Called“episodic”RL• E.g.onegameuntilitwins/losesetc.• Non-episodicalsopossible

Page 5: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Notation

• Model:𝒫''() = Pr{𝑠"/0 = 𝑠1|𝑠" = 𝑠, 𝑎" = 𝑎}

• Whatisthenextstategiventhecurrentstateandactiontaken?• Theenvironmentcanbestochastic,inwhichcasethisisaprobabilitydistribution

• Reward:ℛ''() = 𝐸{𝑟"/0|𝑠" = 𝑠, 𝑎" = 𝑎, 𝑠"/0 = 𝑠1}

• Expectedvalueofrewardwhengoingfromonestatetoanothertakingacertainaction• Inthemostgeneralcase,therewardisnotdeterministic

Page 6: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Policy

• Theagenthasacertainmappingbetweenstateandaction• Thisiscalledthepolicy oftheagent• Denotedby𝜋(𝑠, 𝑎)• Inthestochasticcase,it’stheprobabilitydistributionoveractionsatagivenstate𝜋 𝒔, 𝒂 = P(𝒂"|𝒔")

Page 7: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Thegoalofreinforcementlearning

• Istofindapolicythatmaximizesthetotalexpectedreward• alsocalledthe“return”

• Inanepisode• 𝛾 iscalledthe“discountingfactor”

• Small 𝛾 producesshortsighted,largeg far-sightedpolicies.• Risalwaysfiniteif𝛾 < 1 andthelocalrewardsrarefromaboundedsetofnumbers.

𝑅" = 𝑟"/0 + 𝛾𝑟"/Y + 𝛾Y𝑟"/Z + ⋯ = \𝛾]𝑟"/]/0

^

]_`

Page 8: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Exampleenvironment

Theagentreceives-0.001rewardeverystep.Whenitreachesthegoalorapit,itobtainsrewardsof+1.0or-1.0resp.andtheepisodeisterminated.

Page 9: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Thegoalofreinforcementlearning

• Howcantheagentquantifythedesirabilityofintermediatestates(whereno,ornorelevantrewardisgiven)?

• Thedifficultyis,thatthedesirabilityofintermediatestatesdependson:• TheconcreteselectionofactionsAFTERbeinginsuchanintermediatestate,• ANDonthedesirabilityofsubsequentintermediatestates.

• Thevaluefunctionallowsustodothis

Page 10: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Thevaluefunction

• Definedas:• 𝑉b 𝑠 = 𝐸b 𝑅" 𝑠" = 𝑠 = 𝐸b{∑ 𝛾]𝑟"/]/0|𝑠" = 𝑠^

]_` }

• Thevalueofastatesistheexpectedreturnstartingfromthatstatesandfollowingpolicy𝜋• SatisfiestheBellmanequations

Bellman equation for Vp :

Vp (s) = p (s,a) Ps ¢ s a Rs ¢ s

a + gV p( ¢ s )[ ]¢ s å

– a system of S simultaneous linear equations

Notethatit’sarecursiveformulationofthevaluefunction

Page 11: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Examplevaluefunction

Page 12: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Calculatingthevaluefunction

• Ifthemodel𝒫''() andrewardℛ''(

) areknown,calculate𝑉b 𝑠 usingiterativepolicyevaluation.

http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_dp.html

Page 13: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Whyvaluefunction?

• There existsanaturalpartialorderonallpossiblepolicies:

𝜋1 ≥ 𝜋𝑖𝑓𝑎𝑛𝑑𝑜𝑛𝑙𝑦𝑖𝑓𝑉b( 𝑠 ≥ 𝑉b 𝑠 𝑓𝑜𝑟𝑎𝑙𝑙𝑠 ∈ 𝑆

• Definition: Apolicy 𝜋1 iscalledoptimalif 𝜋1 ≥ 𝜋forallpolicies 𝜋

• Existenceofatleastoneoptimalpolicyisguaranteed,andtheysatisfyBellmanOptimalityequations.

Page 14: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Theaction-valuefunction

• Definedas:• 𝑄b 𝑠, 𝑎 = 𝐸b 𝑅" 𝑠" = 𝑠, 𝑎" = 𝑎 = 𝐸b{∑ 𝛾]𝑟"/]/0|𝑠" = 𝑠^

]_` , 𝑎" = 𝑎}

• Thisiscalledthe“Qfunction”• Thevalueoftakingaction𝑎instate𝑠 followingpolicy𝜋 thereafter• AlsosatisfiestheBellmanequations

Qp (s,a) = Ep rt +1 + gV p(st +1 ) st = s, at = a{ }= Ps ¢ s

a

¢ s å Rs ¢ s

a +g Vp ( ¢ s )[ ]

Page 15: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Findinganoptimalpolicy

• Defineanewpolicy𝜋1 thatisgreedywithrespectto𝑉b

• Forallstates𝑠:𝜋1 = 𝑎𝑟𝑔𝑚𝑎𝑥)𝑄b 𝑠, 𝑎• Thispolicysatisfies𝑄b 𝑠, 𝜋1 𝑠 ≥ 𝑉b 𝑠• Canbeshownthat:• 𝜋1 ≥ 𝜋 for𝛾 < 1• Eventuallyconvergestoanoptimalpolicy

• Thisworksonlyif𝑉b 𝑠 canbecalculated

Page 16: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

OtherwaystocalculateV/Q

• Monte-carlo policyevaluation• Sampleoneepisodeandupdatethevaluefunctionforeachstate• 𝑉 𝑠" ⟵ 𝑉 𝑠" + 𝛼 𝑅" − 𝑉 𝑠"• Asymptoticallyconvergestothetruevaluefunction

• TemporalDifference(TD)Learning• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑉 𝑠" ⟵ 𝑉 𝑠" + 𝛼(𝑟"/0 + 𝛾𝑉 𝑠"/0 − 𝑉 𝑠" )

TemporalDifference

Page 17: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

LearningQ-function(SARSA)

• Qcanbeusedtodefineapolicy• takeactiona = 𝑎𝑟𝑔𝑚𝑎𝑥)𝑄(𝑠, 𝑎) ateverystatewithprobability1 − 𝜖• Withprobability𝜖 takearandomaction(exploration)

• UsetemporaldifferencelearningtolearnQ-function• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑄 𝑠", 𝑎" ⟵ 𝑄 𝑠", 𝑎" + 𝛼(𝑟"/0 + 𝛾𝑄 𝑠"/0, 𝑎"/0 − 𝑄 𝑠", 𝑎" )

• 𝑎"/0forlearningcanbeusedfromthispolicy• CalledSARSA

Page 18: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Q-learning

• UsetemporaldifferencelearningtolearnQ-function• Foreachstepofeachepisode:• Takeaction𝑎,observereward𝑟"/0andnextstate𝑠"/0• 𝑄 𝑠", 𝑎" ⟵ 𝑄 𝑠", 𝑎" + 𝛼(𝑟"/0 + 𝛾max) 𝑄 𝑠"/0, 𝑎 − 𝑄 𝑠", 𝑎" )

• Q-learningrequiresforconvergencetotheoptimalpolicythatrewardsaresampledforeachpair(s,a)infinitelyoften.

• http://cs.stanford.edu/people/karpathy/reinforcejs/gridworld_td.html

Page 19: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Functionapproximation

• TheQ-functioncanbeapproximatedwithaneuralnetwork(oranyotherfunctionapproximator)

• Thetargetsforthenetworkwouldbe𝑟"/0 + 𝛾max) 𝑄 𝑠"/0, 𝑎

• Traintheneuralnetworkwithbackpropagation

Page 20: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Thegoalofreinforcementlearning(repeated)

• Istofindapolicythatmaximizesthetotalexpectedreward• alsocalledthe“return”

• 𝛾 iscalledthe“discountingfactor”

• Small 𝛾 producesshortsighted,largeg far-sightedpolicies.• Risalwaysfiniteif𝛾 < 1 andthelocalrewardsrarefromaboundedsetofnumbers.

𝑅" = 𝑟"/0 + 𝛾𝑟"/Y + 𝛾Y𝑟"/Z + ⋯ = \𝛾]𝑟"/]/0

^

]_`

Page 21: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

PolicyGradient

• Whynotlearnthepolicydirectly?• Definecostfunctionasthetotalexpectedreward:

𝐽 𝜃 = 𝐸 \𝑎]𝑟]

}

]_`

= 𝐸{𝑟 𝜏 }

• 𝑎] issomediscountingfactor• 𝑟] isrewardatstepk• 𝜏 isatrajectoryand𝑟 𝜏 =∑ 𝑎]𝑟]}

]_`

• Learnthisusinggradientascent:

𝜃"/0 = 𝜃" + 𝜂𝛻�𝐽 𝜃

• Problems?• CannotcalculategradientofJ

Page 22: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

PolicyGradient

• Itispossibletoempiricallyestimatethegradient(Williams1992)

𝛻�𝐽 𝜃 = 𝐸{𝛻� log 𝑝�(𝜏)(𝑟 𝜏 − 𝑏)}

=\𝛻� log 𝜋�(𝑎"|𝑠")�

"_`

(𝑅" − 𝑏)

• Usesthelog-likelihoodtrick(orREINFORCEtrick)• Baselineisusedtoreducevarianceofgradientestimator• Baselinedoesn’tintroducebias• DEMO

Page 23: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

DQNandA3C

Page 24: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

DQN

• Mnih,V.etal. Human-levelcontrolthroughdeepreinforcementlearning.Nature 518, 529–533(2015).• UsesadeepneuralnetworktolearntheQ-values

Page 25: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

DQN:Twokeyideas

• Episodereplay:• StoreearlierstepsandapplyQ-learningupdatesinrandombatchesfromthismemory

• UpdatepolicynetworkonlyonceeveryCsteps

Page 26: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

DQN

Page 27: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

A3C

• Mnih,V.etal. AsynchronousMethodsforDeepReinforcementLearning.arXiv:1602.01783[cs] (2016).

• A3C:AsynchronousAdvantageActorCritic• Usespolicygradientwithabaselinethatisthevaluefunction

𝛻�𝐽 𝜃 =\𝛻� log 𝜋�(𝑎"|𝑠")�

"_`

(𝑅" − 𝑉(𝑠"))

AdvantageActor

Critic

Page 28: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

A3C

Page 29: An Introduction to Reinforcement LearningWhat is Reinforcement Learning? •Learning an agent while interactingwith the environment •The agent receives a “reward” for each action

Resources

• Book:ReinforcementLearningAnIntroduction,RichardSuttonandAndrewBarto• AvailableonlineonAndrewBarto’s website:http://www.incompleteideas.net/sutton/book/the-book-1st.html

• Course:AutonomouslyLearningSystemsIGITUGraz• 2016website:http://www.igi.tugraz.at/lehre/Autonomously_learning_systems/WS16/• Nextcoursein2018• Lectureslidesavailablethere

• DQN:https://deepmind.com/research/dqn/• OpenAI Gym:https://gym.openai.com/envs• DeepReinforcementLearning:PongfromPixels(AndrejKarpathy):https://karpathy.github.io/2016/05/31/rl/• Book:DeepLearning,IanGoodfellow,Yoshua Bengio andAaronCourville

• Availableonline:http://www.deeplearningbook.org• RLPy:https://rlpy.readthedocs.io/en/latest/ (python2.7only)