Introduction ofReinforcement Learning
Deep Reinforcement Learning
Reference
โข Textbook: Reinforcement Learning: An Introduction
โข http://incompleteideas.net/sutton/book/the-book.html
โข Lectures of David Silver
โข http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, around 1:30 each)
โข http://videolectures.net/rldm2015_silver_reinforcement_learning/ (Deep Reinforcement Learning )
โข Lectures of John Schulman
โข https://youtu.be/aUrX-rP_ss4
Scenario of Reinforcement Learning
Agent
Environment
Observation Action
RewardDonโt do that
State Change the environment
Scenario of Reinforcement Learning
Agent
Environment
Observation
RewardThank you.
Agent learns to take actions maximizing expected reward.
https://yoast.com/how-to-clean-site-structure/
State
Action
Change the environment
Machine Learning โ Looking for a Function
Environment
Observation Action
Reward
Function input
Used to pick the best function
Function output
Action = ฯ( Observation )
Actor/Policy
Learning to play Go
Environment
Observation Action
Reward
Next Move
Learning to play Go
Environment
Observation Action
Reward
If win, reward = 1
If loss, reward = -1
reward = 0 in most cases
Agent learns to take actions maximizing expected reward.
Learning to play Go
โข Supervised:
โข Reinforcement Learning
Next move:โ5-5โ
Next move:โ3-3โ
First move โฆโฆ many moves โฆโฆ Win!
Alpha Go is supervised learning + reinforcement learning.
Learning from teacher
Learning from experience
(Two agents play with each other.)
Learning a chat-bot
โข Machine obtains feedback from user
โข Chat-bot learns to maximize the expected reward
https://image.freepik.com/free-vector/variety-of-human-avatars_23-2147506285.jpg
How are you?
Bye bye โบ
Hello
Hi โบ
-10 3
http://www.freepik.com/free-vector/variety-of-human-avatars_766615.htm
Learning a chat-bot
โข Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)
How old are you?
See you.
See you.
See you.
How old are you?
I am 16.
I though you were 12.
What make you think so?
Learning a chat-bot
โข By this approach, we can generate a lot of dialogues.
โข Use some pre-defined rules to evaluate the goodness of a dialogue
Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4
Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8
Machine learns from the evaluation
Deep Reinforcement Learning for Dialogue Generation
https://arxiv.org/pdf/1606.01541v3.pdf
Learning a chat-bot
โข Supervised
โข Reinforcement
Helloโบ
Agent
โฆโฆ
Agent
โฆโฆ. โฆโฆ. โฆโฆ
Bad
โHelloโ Say โHiโ
โBye byeโ Say โGood byeโ
More applications
โข Flying Helicopterโข https://www.youtube.com/watch?v=0JL04JJjocc
โข Drivingโข https://www.youtube.com/watch?v=0xo1Ldx3L5Q
โข Robotโข https://www.youtube.com/watch?v=370cT-OAzzM
โข Google Cuts Its Giant Electricity Bill With DeepMind-Powered AIโข http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-
giant-electricity-bill-with-deepmind-powered-ai
โข Text generation โข https://www.youtube.com/watch?v=pbQ4qe8EwLo
Example: Playing Video Game
โข Widely studies: โข Gym: https://gym.openai.com/
โข Universe: https://openai.com/blog/universe/
Machine learns to play video games as human players
โข Machine learns to take proper action itself
โข What machine observes is pixels
Example: Playing Video Game
โข Space invader
fire
Score (reward)
Kill the aliens
Termination: all the aliens are killed, or your spaceship is destroyed.
shield
Example: Playing Video Game
โข Space invader
โข Play yourself: http://www.2600online.com/spaceinvaders.html
โข How about machine: https://gym.openai.com/evaluations/eval_Eduozx4HRyqgTCVk9ltw
Start with observation ๐ 1 Observation ๐ 2 Observation ๐ 3
Example: Playing Video Game
Action ๐1: โrightโ
Obtain reward ๐1 = 0
Action ๐2 : โfireโ
(kill an alien)
Obtain reward ๐2 = 5
Usually there is some randomness in the environment
Start with observation ๐ 1 Observation ๐ 2 Observation ๐ 3
Example: Playing Video Game
After many turns
Action ๐๐
Obtain reward ๐๐
Game Over(spaceship destroyed)
This is an episode.
Learn to maximize the expected cumulative reward per episode
Properties of Reinforcement Learningโข Reward delay
โข In space invader, only โfireโ obtains reward
โข Although the moving before โfireโ is important
โข In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward
โข Agentโs actions affect the subsequent data it receives
โข E.g. Exploration
Outline
Policy-based Value-based
Learning an Actor Learning a CriticActor + Critic
Model-based Approach
Model-free Approach
Alpha Go: policy-based + value-based + model-based
Policy-based ApproachLearning an Actor
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple โฆโฆ
Neural Network as Actor
Neural network as Actor
โข Input of neural network: the observation of machine represented as a vector or a matrix
โข Output neural network : each action corresponds to a neuron in output layer
โฆโฆ
NN as actor
pixelsfire
right
leftProbability of taking
the action
What is the benefit of using network instead of lookup table?
0.7
0.2
0.1
generalization
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple โฆโฆ
Neural Network as Actor
Goodness of Actor
โข Review: Supervised learning
1x
2x
โฆโฆ
256x
โฆโฆ
โฆโฆ
โฆโฆ
โฆโฆ
โฆโฆ
y1
y2
y10
Loss ๐
โ1โ
โฆโฆ
1
0
0
โฆโฆ target
Softm
axAs close as
possibleGiven a set of parameters ๐
๐ฟ =
๐=1
๐
๐๐
Total Loss:
Training Example
Goodness of Actor
โข Given an actor ๐๐ ๐ with network parameter ๐
โข Use the actor ๐๐ ๐ to play the video gameโข Start with observation ๐ 1โข Machine decides to take ๐1โข Machine obtains reward ๐1โข Machine sees observation ๐ 2โข Machine decides to take ๐2โข Machine obtains reward ๐2โข Machine sees observation ๐ 3โข โฆโฆ
โข Machine decides to take ๐๐โข Machine obtains reward ๐๐
Total reward: ๐ ๐ = ฯ๐ก=1๐ ๐๐ก
Even with the same actor, ๐ ๐ is different each time
We define เดค๐ ๐ as the expected value of Rฮธ
เดค๐ ๐ evaluates the goodness of an actor ๐๐ ๐
Randomness in the actor and the game
END
Goodness of Actor
โข ๐ = ๐ 1, ๐1, ๐1, ๐ 2, ๐2, ๐2, โฏ , ๐ ๐ , ๐๐ , ๐๐
๐ ๐|๐ =
๐ ๐ 1 ๐ ๐1|๐ 1, ๐ ๐ ๐1, ๐ 2|๐ 1, ๐1 ๐ ๐2|๐ 2, ๐ ๐ ๐2, ๐ 3|๐ 2, ๐2 โฏ
= ๐ ๐ 1 เท
๐ก=1
๐
๐ ๐๐ก|๐ ๐ก, ๐ ๐ ๐๐ก , ๐ ๐ก+1|๐ ๐ก , ๐๐ก
Control by your actor ๐๐
not related to your actor
Actor๐๐
left
right
fire
๐ ๐ก
0.1
0.2
0.7
๐ ๐๐ก = "๐๐๐๐"|๐ ๐ก , ๐= 0.7
We define เดค๐ ๐ as the expected value of Rฮธ
Goodness of Actor
โข An episode is considered as a trajectory ๐
โข ๐ = ๐ 1, ๐1, ๐1, ๐ 2, ๐2, ๐2, โฏ , ๐ ๐ , ๐๐ , ๐๐โข ๐ ๐ = ฯ๐ก=1
๐ ๐๐กโข If you use an actor to play the game, each ๐ has a
probability to be sampled
โข The probability depends on actor parameter ๐: ๐ ๐|๐
เดค๐ ๐ =
๐
๐ ๐ ๐ ๐|๐
Sum over all possible trajectory
Use ๐๐ to play the game N times, obtain ๐1, ๐2, โฏ , ๐๐
Sampling ๐ from ๐ ๐|๐N times
โ1
๐
๐=1
๐
๐ ๐๐
Step 1: define a set of function
Step 2: goodness of
function
Step 3: pick the best function
Three Steps for Deep Learning
Deep Learning is so simple โฆโฆ
Neural Network as Actor
Gradient Ascent
โข Problem statement
โข Gradient ascent
โข Start with ๐0
โข ๐1 โ ๐0 + ๐๐ป เดค๐ ๐0
โข ๐2 โ ๐1 + ๐๐ป เดค๐ ๐1
โข โฆโฆ
๐โ = ๐๐๐max๐
เดค๐ ๐
๐ป เดค๐ ๐
๐ = ๐ค1, ๐ค2, โฏ , ๐1, โฏ
=
ฮค๐ เดค๐ ๐ ๐๐ค1
ฮค๐ เดค๐ ๐ ๐๐ค2
โฎฮค๐ เดค๐ ๐ ๐๐1โฎ
Policy Gradient เดค๐ ๐ =
๐
๐ ๐ ๐ ๐|๐ ๐ป เดค๐ ๐ =?
๐ป เดค๐ ๐ =
๐
๐ ๐ ๐ป๐ ๐|๐ =
๐
๐ ๐ ๐ ๐|๐๐ป๐ ๐|๐
๐ ๐|๐
=
๐
๐ ๐ ๐ ๐|๐ ๐ป๐๐๐๐ ๐|๐
Use ๐๐ to play the game N times, Obtain ๐1, ๐2, โฏ , ๐๐
โ1
๐
๐=1
๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐|๐
๐ ๐ do not have to be differentiable
It can even be a black box.
๐๐๐๐ ๐ ๐ฅ
๐๐ฅ=
1
๐ ๐ฅ
๐๐ ๐ฅ
๐๐ฅ
Policy Gradient
โข ๐ = ๐ 1, ๐1, ๐1, ๐ 2, ๐2, ๐2, โฏ , ๐ ๐ , ๐๐ , ๐๐
๐ ๐|๐ = ๐ ๐ 1 เท
๐ก=1
๐
๐ ๐๐ก|๐ ๐ก, ๐ ๐ ๐๐ก , ๐ ๐ก+1|๐ ๐ก, ๐๐ก
๐ป๐๐๐๐ ๐|๐ =?
๐๐๐๐ ๐|๐
= ๐๐๐๐ ๐ 1 +
๐ก=1
๐
๐๐๐๐ ๐๐ก|๐ ๐ก , ๐ + ๐๐๐๐ ๐๐ก , ๐ ๐ก+1|๐ ๐ก , ๐๐ก
๐ป๐๐๐๐ ๐|๐ =
๐ก=1
๐
๐ป๐๐๐๐ ๐๐ก|๐ ๐ก , ๐Ignore the terms not related to ๐
Policy Gradient
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐|๐
๐ป๐๐๐๐ ๐|๐
=
๐ก=1
๐
๐ป๐๐๐๐ ๐๐ก|๐ ๐ก , ๐
=1
๐
๐=1
๐
๐ ๐๐
๐ก=1
๐๐
๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
๐๐๐๐ค โ ๐๐๐๐ + ๐๐ป เดค๐ ๐๐๐๐
=1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
If in ๐๐ machine takes ๐๐ก๐ when seeing ๐ ๐ก
๐ in
๐ ๐๐ is positive Tuning ๐ to increase ๐ ๐๐ก๐|๐ ๐ก
๐
๐ ๐๐ is negative Tuning ๐ to decrease ๐ ๐๐ก๐|๐ ๐ก
๐
It is very important to consider the cumulative reward ๐ ๐๐ of the whole trajectory ๐๐ instead of immediate reward ๐๐ก
๐
What if we replace ๐ ๐๐ with ๐๐ก
๐ โฆโฆ
Policy Gradient
๐1: ๐ 11, ๐1
1 ๐ ๐1
๐ 21, ๐2
1
โฆโฆ
๐ ๐1
โฆโฆ
๐2: ๐ 12, ๐1
2 ๐ ๐2
๐ 22, ๐2
2
โฆโฆ
๐ ๐2
โฆโฆ
๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ =
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
Given actor parameter ๐
UpdateModel
DataCollection
Policy Gradient
โฆโฆ
fire
right
left
s
1
0
0
โ
๐=1
3
เท๐ฆ๐๐๐๐๐ฆ๐Minimize:
Maximize:
๐๐๐๐ "๐๐๐๐ก"|๐
๐ฆ๐ เท๐ฆ๐
๐๐๐๐ฆ๐ =
๐ โ ๐ + ๐๐ป๐๐๐๐ "๐๐๐๐ก"|๐
Considered as Classification Problem
Policy Gradient๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ =
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
๐1: ๐ 11, ๐1
1 ๐ ๐1
๐ 21, ๐2
1
โฆโฆ
๐ ๐1
โฆโฆ
๐2: ๐ 12, ๐1
2 ๐ ๐2
๐ 22, ๐2
2
โฆโฆ
๐ ๐2
โฆโฆ
Given actor parameter ๐
โฆ
๐ 11
NN
fire
right
left
๐11 = ๐๐๐๐ก
10
0
๐ 12
NN
fire
right
left
๐12 = ๐๐๐๐
00
1
Policy Gradient๐ โ ๐ + ๐๐ป เดค๐ ๐
๐ป เดค๐ ๐ =
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
๐1: ๐ 11, ๐1
1 ๐ ๐1
๐ 21, ๐2
1
โฆโฆ
๐ ๐1
โฆโฆ
๐2: ๐ 12, ๐1
2 ๐ ๐2
๐ 22, ๐2
2
โฆโฆ
๐ ๐2
โฆโฆ
Given actor parameter ๐
โฆ
๐ 11
NN ๐11 = ๐๐๐๐ก
2
2
1
1 ๐ 11
NN ๐11 = ๐๐๐๐ก
๐ 12
NN ๐12 = ๐๐๐๐
Each training data is weighted by ๐ ๐๐
Add a Baseline๐๐๐๐ค โ ๐๐๐๐ + ๐๐ป เดค๐ ๐๐๐๐
๐ป เดค๐ ๐ โ1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐ , ๐
It is possible that ๐ ๐๐
is always positive.
Ideal case
Samplingโฆโฆ
a b c a b c
It is probability โฆ
a b c
Not sampled
a b c
The probability of the actions not sampled will decrease.
1
๐
๐=1
๐
๐ก=1
๐๐
๐ ๐๐ โ ๐ ๐ป๐๐๐๐ ๐๐ก๐|๐ ๐ก
๐, ๐
Value-based ApproachLearning a Critic
Critic
โข A critic does not determine the action.
โข Given an actor ฯ, it evaluates the how good the actor is
http://combiboilersleeds.com/picaso/critics/critics-4.html
An actor can be found from a critic.
e.g. Q-learning
Critic
โข State value function ๐๐ ๐
โข When using actor ๐, the cumulated reward expects to be obtained after seeing observation (state) s
๐๐s๐๐ ๐
scalar
๐๐ ๐ is large ๐๐ ๐ is smaller
Critic
๐ไปฅๅ็้ฟๅ ๅคง้ฆฌๆญฅ้ฃ = bad
๐่ฎๅผท็้ฟๅ ๅคง้ฆฌๆญฅ้ฃ = good
How to estimate ๐๐ ๐
โข Monte-Carlo based approachโข The critic watches ๐ playing the game
After seeing ๐ ๐,
Until the end of the episode, the cumulated reward is ๐บ๐
After seeing ๐ ๐,
Until the end of the episode, the cumulated reward is ๐บ๐
๐๐ ๐ ๐๐๐๐ ๐ ๐บ๐
๐๐ ๐ ๐๐๐๐ ๐ ๐บ๐
How to estimate ๐๐ ๐
โข Temporal-difference approach
โฏ๐ ๐ก , ๐๐ก , ๐๐ก , ๐ ๐ก+1โฏ
๐๐ ๐ ๐ก๐๐๐ ๐ก
๐๐ ๐ ๐ก+1๐๐๐ ๐ก+1
๐๐ ๐ ๐ก + ๐๐ก = ๐๐ ๐ ๐ก+1
๐๐ ๐ ๐ก โ ๐๐ ๐ ๐ก+1 ๐๐ก-
Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.
MC v.s. TD
๐๐ ๐ ๐๐๐๐ ๐ ๐บ๐Larger variance
unbiased
๐๐ ๐ ๐ก๐๐๐ ๐ก ๐๐ ๐ ๐ก+1 ๐๐ ๐ ๐ก+1๐ +
Smaller varianceMay be biased
โฆ
MC v.s. TD
โข The critic has the following 8 episodesโข ๐ ๐ , ๐ = 0, ๐ ๐ , ๐ = 0, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 1, END
โข ๐ ๐ , ๐ = 0, END
[Sutton, v2, Example 6.4]
(The actions are ignored here.)
๐๐ ๐ ๐ =?
๐๐ ๐ ๐ = 3/4
0? 3/4?
Monte-Carlo:
Temporal-difference:
๐๐ ๐ ๐ = 0
๐๐ ๐ ๐ + ๐ = ๐๐ ๐ ๐3/43/4 0
Another Critic
โข State-action value function ๐๐ ๐ , ๐
โข When using actor ๐, the cumulated reward expects to be obtained after seeing observation s and taking a
๐๐s ๐๐ ๐ , ๐
scalara
๐๐ ๐ , ๐ = ๐๐๐๐ก
๐๐ ๐ , ๐ = ๐๐๐๐
๐๐ ๐ , ๐ = ๐๐๐โ๐ก๐๐
for discrete action only
s
Q-Learning
๐ interacts with the environment
Learning ๐๐ ๐ , ๐Find a new actor ๐โฒ โbetterโ than ๐
TD or MC
?
๐ = ๐โฒ
Q-Learning
โข Given ๐๐ ๐ , ๐ , find a new actor ๐โฒ โbetterโ than ๐
โข โBetterโ: ๐๐โฒ ๐ โฅ ๐๐ ๐ , for all state s
๐โฒ ๐ = ๐๐๐max๐
๐๐ ๐ , ๐
โข๐โฒ does not have extra parameters. It depends on Q
โขNot suitable for continuous action a
Deep Reinforcement LearningActor-Critic
Actor-Critic
๐ interacts with the environment
Learning ๐๐ ๐ , ๐ , ๐๐ ๐
Update actor from ๐ โ ๐โ based on ๐๐ ๐ , ๐ , ๐๐ ๐
TD or MC๐ = ๐โฒ
Actor-Critic
โข Tips
โข The parameters of actor ๐ ๐ and critic ๐๐ ๐ can be shared
Network๐
Network
fire
right
left
Network
๐๐ ๐
Source of image: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2#.68x6na7o9
Asynchronous
๐2
๐1
1. Copy global parameters
2. Sampling some data
3. Compute gradients
4. Update global models
โ๐
โ๐
๐1
+๐โ๐๐1
(other workers also update models)
Demo of A3C
โข Racing Car (DeepMind)
https://www.youtube.com/watch?v=0xo1Ldx3L5Q
Demo of A3C
โข Visual Doom AI Competition @ CIG 2016
โข https://www.youtube.com/watch?v=94EPSjQH38Y
Concluding Remarks
Policy-based Value-based
Learning an Actor Learning a CriticActor + Critic
Model-based Approach
Model-free Approach