Transcript
Page 1: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Introduction ofReinforcement Learning

Page 2: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Deep Reinforcement Learning

Page 3: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Reference

โ€ข Textbook: Reinforcement Learning: An Introduction

โ€ข http://incompleteideas.net/sutton/book/the-book.html

โ€ข Lectures of David Silver

โ€ข http://www0.cs.ucl.ac.uk/staff/D.Silver/web/Teaching.html (10 lectures, around 1:30 each)

โ€ข http://videolectures.net/rldm2015_silver_reinforcement_learning/ (Deep Reinforcement Learning )

โ€ข Lectures of John Schulman

โ€ข https://youtu.be/aUrX-rP_ss4

Page 4: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Scenario of Reinforcement Learning

Agent

Environment

Observation Action

RewardDonโ€™t do that

State Change the environment

Page 5: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Scenario of Reinforcement Learning

Agent

Environment

Observation

RewardThank you.

Agent learns to take actions maximizing expected reward.

https://yoast.com/how-to-clean-site-structure/

State

Action

Change the environment

Page 6: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Machine Learning โ‰ˆ Looking for a Function

Environment

Observation Action

Reward

Function input

Used to pick the best function

Function output

Action = ฯ€( Observation )

Actor/Policy

Page 7: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Learning to play Go

Environment

Observation Action

Reward

Next Move

Page 8: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Learning to play Go

Environment

Observation Action

Reward

If win, reward = 1

If loss, reward = -1

reward = 0 in most cases

Agent learns to take actions maximizing expected reward.

Page 9: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Learning to play Go

โ€ข Supervised:

โ€ข Reinforcement Learning

Next move:โ€œ5-5โ€

Next move:โ€œ3-3โ€

First move โ€ฆโ€ฆ many moves โ€ฆโ€ฆ Win!

Alpha Go is supervised learning + reinforcement learning.

Learning from teacher

Learning from experience

(Two agents play with each other.)

Page 10: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Learning a chat-bot

โ€ข Machine obtains feedback from user

โ€ข Chat-bot learns to maximize the expected reward

https://image.freepik.com/free-vector/variety-of-human-avatars_23-2147506285.jpg

How are you?

Bye bye โ˜บ

Hello

Hi โ˜บ

-10 3

http://www.freepik.com/free-vector/variety-of-human-avatars_766615.htm

Page 11: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Learning a chat-bot

โ€ข Let two agents talk to each other (sometimes generate good dialogue, sometimes bad)

How old are you?

See you.

See you.

See you.

How old are you?

I am 16.

I though you were 12.

What make you think so?

Page 12: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Learning a chat-bot

โ€ข By this approach, we can generate a lot of dialogues.

โ€ข Use some pre-defined rules to evaluate the goodness of a dialogue

Dialogue 1 Dialogue 2 Dialogue 3 Dialogue 4

Dialogue 5 Dialogue 6 Dialogue 7 Dialogue 8

Machine learns from the evaluation

Deep Reinforcement Learning for Dialogue Generation

https://arxiv.org/pdf/1606.01541v3.pdf

Page 13: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Learning a chat-bot

โ€ข Supervised

โ€ข Reinforcement

Helloโ˜บ

Agent

โ€ฆโ€ฆ

Agent

โ€ฆโ€ฆ. โ€ฆโ€ฆ. โ€ฆโ€ฆ

Bad

โ€œHelloโ€ Say โ€œHiโ€

โ€œBye byeโ€ Say โ€œGood byeโ€

Page 14: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

More applications

โ€ข Flying Helicopterโ€ข https://www.youtube.com/watch?v=0JL04JJjocc

โ€ข Drivingโ€ข https://www.youtube.com/watch?v=0xo1Ldx3L5Q

โ€ข Robotโ€ข https://www.youtube.com/watch?v=370cT-OAzzM

โ€ข Google Cuts Its Giant Electricity Bill With DeepMind-Powered AIโ€ข http://www.bloomberg.com/news/articles/2016-07-19/google-cuts-its-

giant-electricity-bill-with-deepmind-powered-ai

โ€ข Text generation โ€ข https://www.youtube.com/watch?v=pbQ4qe8EwLo

Page 15: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Example: Playing Video Game

โ€ข Widely studies: โ€ข Gym: https://gym.openai.com/

โ€ข Universe: https://openai.com/blog/universe/

Machine learns to play video games as human players

โžข Machine learns to take proper action itself

โžข What machine observes is pixels

Page 16: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Example: Playing Video Game

โ€ข Space invader

fire

Score (reward)

Kill the aliens

Termination: all the aliens are killed, or your spaceship is destroyed.

shield

Page 17: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Example: Playing Video Game

โ€ข Space invader

โ€ข Play yourself: http://www.2600online.com/spaceinvaders.html

โ€ข How about machine: https://gym.openai.com/evaluations/eval_Eduozx4HRyqgTCVk9ltw

Page 18: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Start with observation ๐‘ 1 Observation ๐‘ 2 Observation ๐‘ 3

Example: Playing Video Game

Action ๐‘Ž1: โ€œrightโ€

Obtain reward ๐‘Ÿ1 = 0

Action ๐‘Ž2 : โ€œfireโ€

(kill an alien)

Obtain reward ๐‘Ÿ2 = 5

Usually there is some randomness in the environment

Page 19: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Start with observation ๐‘ 1 Observation ๐‘ 2 Observation ๐‘ 3

Example: Playing Video Game

After many turns

Action ๐‘Ž๐‘‡

Obtain reward ๐‘Ÿ๐‘‡

Game Over(spaceship destroyed)

This is an episode.

Learn to maximize the expected cumulative reward per episode

Page 20: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Properties of Reinforcement Learningโ€ข Reward delay

โ€ข In space invader, only โ€œfireโ€ obtains reward

โ€ข Although the moving before โ€œfireโ€ is important

โ€ข In Go playing, it may be better to sacrifice immediate reward to gain more long-term reward

โ€ข Agentโ€™s actions affect the subsequent data it receives

โ€ข E.g. Exploration

Page 21: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Outline

Policy-based Value-based

Learning an Actor Learning a CriticActor + Critic

Model-based Approach

Model-free Approach

Alpha Go: policy-based + value-based + model-based

Page 22: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Policy-based ApproachLearning an Actor

Page 23: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Step 1: define a set of function

Step 2: goodness of

function

Step 3: pick the best function

Three Steps for Deep Learning

Deep Learning is so simple โ€ฆโ€ฆ

Neural Network as Actor

Page 24: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Neural network as Actor

โ€ข Input of neural network: the observation of machine represented as a vector or a matrix

โ€ข Output neural network : each action corresponds to a neuron in output layer

โ€ฆโ€ฆ

NN as actor

pixelsfire

right

leftProbability of taking

the action

What is the benefit of using network instead of lookup table?

0.7

0.2

0.1

generalization

Page 25: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Step 1: define a set of function

Step 2: goodness of

function

Step 3: pick the best function

Three Steps for Deep Learning

Deep Learning is so simple โ€ฆโ€ฆ

Neural Network as Actor

Page 26: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Goodness of Actor

โ€ข Review: Supervised learning

1x

2x

โ€ฆโ€ฆ

256x

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

โ€ฆโ€ฆ

y1

y2

y10

Loss ๐‘™

โ€œ1โ€

โ€ฆโ€ฆ

1

0

0

โ€ฆโ€ฆ target

Softm

axAs close as

possibleGiven a set of parameters ๐œƒ

๐ฟ =

๐‘›=1

๐‘

๐‘™๐‘›

Total Loss:

Training Example

Page 27: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Goodness of Actor

โ€ข Given an actor ๐œ‹๐œƒ ๐‘  with network parameter ๐œƒ

โ€ข Use the actor ๐œ‹๐œƒ ๐‘  to play the video gameโ€ข Start with observation ๐‘ 1โ€ข Machine decides to take ๐‘Ž1โ€ข Machine obtains reward ๐‘Ÿ1โ€ข Machine sees observation ๐‘ 2โ€ข Machine decides to take ๐‘Ž2โ€ข Machine obtains reward ๐‘Ÿ2โ€ข Machine sees observation ๐‘ 3โ€ข โ€ฆโ€ฆ

โ€ข Machine decides to take ๐‘Ž๐‘‡โ€ข Machine obtains reward ๐‘Ÿ๐‘‡

Total reward: ๐‘…๐œƒ = ฯƒ๐‘ก=1๐‘‡ ๐‘Ÿ๐‘ก

Even with the same actor, ๐‘…๐œƒ is different each time

We define เดค๐‘…๐œƒ as the expected value of Rฮธ

เดค๐‘…๐œƒ evaluates the goodness of an actor ๐œ‹๐œƒ ๐‘ 

Randomness in the actor and the game

END

Page 28: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Goodness of Actor

โ€ข ๐œ = ๐‘ 1, ๐‘Ž1, ๐‘Ÿ1, ๐‘ 2, ๐‘Ž2, ๐‘Ÿ2, โ‹ฏ , ๐‘ ๐‘‡ , ๐‘Ž๐‘‡ , ๐‘Ÿ๐‘‡

๐‘ƒ ๐œ|๐œƒ =

๐‘ ๐‘ 1 ๐‘ ๐‘Ž1|๐‘ 1, ๐œƒ ๐‘ ๐‘Ÿ1, ๐‘ 2|๐‘ 1, ๐‘Ž1 ๐‘ ๐‘Ž2|๐‘ 2, ๐œƒ ๐‘ ๐‘Ÿ2, ๐‘ 3|๐‘ 2, ๐‘Ž2 โ‹ฏ

= ๐‘ ๐‘ 1 เท‘

๐‘ก=1

๐‘‡

๐‘ ๐‘Ž๐‘ก|๐‘ ๐‘ก, ๐œƒ ๐‘ ๐‘Ÿ๐‘ก , ๐‘ ๐‘ก+1|๐‘ ๐‘ก , ๐‘Ž๐‘ก

Control by your actor ๐œ‹๐œƒ

not related to your actor

Actor๐œ‹๐œƒ

left

right

fire

๐‘ ๐‘ก

0.1

0.2

0.7

๐‘ ๐‘Ž๐‘ก = "๐‘“๐‘–๐‘Ÿ๐‘’"|๐‘ ๐‘ก , ๐œƒ= 0.7

We define เดค๐‘…๐œƒ as the expected value of Rฮธ

Page 29: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Goodness of Actor

โ€ข An episode is considered as a trajectory ๐œ

โ€ข ๐œ = ๐‘ 1, ๐‘Ž1, ๐‘Ÿ1, ๐‘ 2, ๐‘Ž2, ๐‘Ÿ2, โ‹ฏ , ๐‘ ๐‘‡ , ๐‘Ž๐‘‡ , ๐‘Ÿ๐‘‡โ€ข ๐‘… ๐œ = ฯƒ๐‘ก=1

๐‘‡ ๐‘Ÿ๐‘กโ€ข If you use an actor to play the game, each ๐œ has a

probability to be sampled

โ€ข The probability depends on actor parameter ๐œƒ: ๐‘ƒ ๐œ|๐œƒ

เดค๐‘…๐œƒ =

๐œ

๐‘… ๐œ ๐‘ƒ ๐œ|๐œƒ

Sum over all possible trajectory

Use ๐œ‹๐œƒ to play the game N times, obtain ๐œ1, ๐œ2, โ‹ฏ , ๐œ๐‘

Sampling ๐œ from ๐‘ƒ ๐œ|๐œƒN times

โ‰ˆ1

๐‘

๐‘›=1

๐‘

๐‘… ๐œ๐‘›

Page 30: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Step 1: define a set of function

Step 2: goodness of

function

Step 3: pick the best function

Three Steps for Deep Learning

Deep Learning is so simple โ€ฆโ€ฆ

Neural Network as Actor

Page 31: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Gradient Ascent

โ€ข Problem statement

โ€ข Gradient ascent

โ€ข Start with ๐œƒ0

โ€ข ๐œƒ1 โ† ๐œƒ0 + ๐œ‚๐›ป เดค๐‘…๐œƒ0

โ€ข ๐œƒ2 โ† ๐œƒ1 + ๐œ‚๐›ป เดค๐‘…๐œƒ1

โ€ข โ€ฆโ€ฆ

๐œƒโˆ— = ๐‘Ž๐‘Ÿ๐‘”max๐œƒ

เดค๐‘…๐œƒ

๐›ป เดค๐‘…๐œƒ

๐œƒ = ๐‘ค1, ๐‘ค2, โ‹ฏ , ๐‘1, โ‹ฏ

=

ฮค๐œ• เดค๐‘…๐œƒ ๐œ•๐‘ค1

ฮค๐œ• เดค๐‘…๐œƒ ๐œ•๐‘ค2

โ‹ฎฮค๐œ• เดค๐‘…๐œƒ ๐œ•๐‘1โ‹ฎ

Page 32: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Policy Gradient เดค๐‘…๐œƒ =

๐œ

๐‘… ๐œ ๐‘ƒ ๐œ|๐œƒ ๐›ป เดค๐‘…๐œƒ =?

๐›ป เดค๐‘…๐œƒ =

๐œ

๐‘… ๐œ ๐›ป๐‘ƒ ๐œ|๐œƒ =

๐œ

๐‘… ๐œ ๐‘ƒ ๐œ|๐œƒ๐›ป๐‘ƒ ๐œ|๐œƒ

๐‘ƒ ๐œ|๐œƒ

=

๐œ

๐‘… ๐œ ๐‘ƒ ๐œ|๐œƒ ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ ๐œ|๐œƒ

Use ๐œ‹๐œƒ to play the game N times, Obtain ๐œ1, ๐œ2, โ‹ฏ , ๐œ๐‘

โ‰ˆ1

๐‘

๐‘›=1

๐‘

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ ๐œ๐‘›|๐œƒ

๐‘… ๐œ do not have to be differentiable

It can even be a black box.

๐‘‘๐‘™๐‘œ๐‘” ๐‘“ ๐‘ฅ

๐‘‘๐‘ฅ=

1

๐‘“ ๐‘ฅ

๐‘‘๐‘“ ๐‘ฅ

๐‘‘๐‘ฅ

Page 33: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Policy Gradient

โ€ข ๐œ = ๐‘ 1, ๐‘Ž1, ๐‘Ÿ1, ๐‘ 2, ๐‘Ž2, ๐‘Ÿ2, โ‹ฏ , ๐‘ ๐‘‡ , ๐‘Ž๐‘‡ , ๐‘Ÿ๐‘‡

๐‘ƒ ๐œ|๐œƒ = ๐‘ ๐‘ 1 เท‘

๐‘ก=1

๐‘‡

๐‘ ๐‘Ž๐‘ก|๐‘ ๐‘ก, ๐œƒ ๐‘ ๐‘Ÿ๐‘ก , ๐‘ ๐‘ก+1|๐‘ ๐‘ก, ๐‘Ž๐‘ก

๐›ป๐‘™๐‘œ๐‘”๐‘ƒ ๐œ|๐œƒ =?

๐‘™๐‘œ๐‘”๐‘ƒ ๐œ|๐œƒ

= ๐‘™๐‘œ๐‘”๐‘ ๐‘ 1 +

๐‘ก=1

๐‘‡

๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก|๐‘ ๐‘ก , ๐œƒ + ๐‘™๐‘œ๐‘”๐‘ ๐‘Ÿ๐‘ก , ๐‘ ๐‘ก+1|๐‘ ๐‘ก , ๐‘Ž๐‘ก

๐›ป๐‘™๐‘œ๐‘”๐‘ƒ ๐œ|๐œƒ =

๐‘ก=1

๐‘‡

๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก|๐‘ ๐‘ก , ๐œƒIgnore the terms not related to ๐œƒ

Page 34: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Policy Gradient

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘›=1

๐‘

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘ƒ ๐œ๐‘›|๐œƒ

๐›ป๐‘™๐‘œ๐‘”๐‘ƒ ๐œ|๐œƒ

=

๐‘ก=1

๐‘‡

๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก|๐‘ ๐‘ก , ๐œƒ

=1

๐‘

๐‘›=1

๐‘

๐‘… ๐œ๐‘›

๐‘ก=1

๐‘‡๐‘›

๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›, ๐œƒ

๐œƒ๐‘›๐‘’๐‘ค โ† ๐œƒ๐‘œ๐‘™๐‘‘ + ๐œ‚๐›ป เดค๐‘…๐œƒ๐‘œ๐‘™๐‘‘

=1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›, ๐œƒ

If in ๐œ๐‘› machine takes ๐‘Ž๐‘ก๐‘› when seeing ๐‘ ๐‘ก

๐‘› in

๐‘… ๐œ๐‘› is positive Tuning ๐œƒ to increase ๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

๐‘… ๐œ๐‘› is negative Tuning ๐œƒ to decrease ๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›

It is very important to consider the cumulative reward ๐‘… ๐œ๐‘› of the whole trajectory ๐œ๐‘› instead of immediate reward ๐‘Ÿ๐‘ก

๐‘›

What if we replace ๐‘… ๐œ๐‘› with ๐‘Ÿ๐‘ก

๐‘› โ€ฆโ€ฆ

Page 35: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Policy Gradient

๐œ1: ๐‘ 11, ๐‘Ž1

1 ๐‘… ๐œ1

๐‘ 21, ๐‘Ž2

1

โ€ฆโ€ฆ

๐‘… ๐œ1

โ€ฆโ€ฆ

๐œ2: ๐‘ 12, ๐‘Ž1

2 ๐‘… ๐œ2

๐‘ 22, ๐‘Ž2

2

โ€ฆโ€ฆ

๐‘… ๐œ2

โ€ฆโ€ฆ

๐œƒ โ† ๐œƒ + ๐œ‚๐›ป เดค๐‘…๐œƒ

๐›ป เดค๐‘…๐œƒ =

1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›, ๐œƒ

Given actor parameter ๐œƒ

UpdateModel

DataCollection

Page 36: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Policy Gradient

โ€ฆโ€ฆ

fire

right

left

s

1

0

0

โˆ’

๐‘–=1

3

เทœ๐‘ฆ๐‘–๐‘™๐‘œ๐‘”๐‘ฆ๐‘–Minimize:

Maximize:

๐‘™๐‘œ๐‘”๐‘ƒ "๐‘™๐‘’๐‘“๐‘ก"|๐‘ 

๐‘ฆ๐‘– เทœ๐‘ฆ๐‘–

๐‘™๐‘œ๐‘”๐‘ฆ๐‘– =

๐œƒ โ† ๐œƒ + ๐œ‚๐›ป๐‘™๐‘œ๐‘”๐‘ƒ "๐‘™๐‘’๐‘“๐‘ก"|๐‘ 

Considered as Classification Problem

Page 37: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Policy Gradient๐œƒ โ† ๐œƒ + ๐œ‚๐›ป เดค๐‘…๐œƒ

๐›ป เดค๐‘…๐œƒ =

1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›, ๐œƒ

๐œ1: ๐‘ 11, ๐‘Ž1

1 ๐‘… ๐œ1

๐‘ 21, ๐‘Ž2

1

โ€ฆโ€ฆ

๐‘… ๐œ1

โ€ฆโ€ฆ

๐œ2: ๐‘ 12, ๐‘Ž1

2 ๐‘… ๐œ2

๐‘ 22, ๐‘Ž2

2

โ€ฆโ€ฆ

๐‘… ๐œ2

โ€ฆโ€ฆ

Given actor parameter ๐œƒ

โ€ฆ

๐‘ 11

NN

fire

right

left

๐‘Ž11 = ๐‘™๐‘’๐‘“๐‘ก

10

0

๐‘ 12

NN

fire

right

left

๐‘Ž12 = ๐‘“๐‘–๐‘Ÿ๐‘’

00

1

Page 38: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Policy Gradient๐œƒ โ† ๐œƒ + ๐œ‚๐›ป เดค๐‘…๐œƒ

๐›ป เดค๐‘…๐œƒ =

1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›, ๐œƒ

๐œ1: ๐‘ 11, ๐‘Ž1

1 ๐‘… ๐œ1

๐‘ 21, ๐‘Ž2

1

โ€ฆโ€ฆ

๐‘… ๐œ1

โ€ฆโ€ฆ

๐œ2: ๐‘ 12, ๐‘Ž1

2 ๐‘… ๐œ2

๐‘ 22, ๐‘Ž2

2

โ€ฆโ€ฆ

๐‘… ๐œ2

โ€ฆโ€ฆ

Given actor parameter ๐œƒ

โ€ฆ

๐‘ 11

NN ๐‘Ž11 = ๐‘™๐‘’๐‘“๐‘ก

2

2

1

1 ๐‘ 11

NN ๐‘Ž11 = ๐‘™๐‘’๐‘“๐‘ก

๐‘ 12

NN ๐‘Ž12 = ๐‘“๐‘–๐‘Ÿ๐‘’

Each training data is weighted by ๐‘… ๐œ๐‘›

Page 39: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Add a Baseline๐œƒ๐‘›๐‘’๐‘ค โ† ๐œƒ๐‘œ๐‘™๐‘‘ + ๐œ‚๐›ป เดค๐‘…๐œƒ๐‘œ๐‘™๐‘‘

๐›ป เดค๐‘…๐œƒ โ‰ˆ1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› ๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘› , ๐œƒ

It is possible that ๐‘… ๐œ๐‘›

is always positive.

Ideal case

Samplingโ€ฆโ€ฆ

a b c a b c

It is probability โ€ฆ

a b c

Not sampled

a b c

The probability of the actions not sampled will decrease.

1

๐‘

๐‘›=1

๐‘

๐‘ก=1

๐‘‡๐‘›

๐‘… ๐œ๐‘› โˆ’ ๐‘ ๐›ป๐‘™๐‘œ๐‘”๐‘ ๐‘Ž๐‘ก๐‘›|๐‘ ๐‘ก

๐‘›, ๐œƒ

Page 40: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Value-based ApproachLearning a Critic

Page 41: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Critic

โ€ข A critic does not determine the action.

โ€ข Given an actor ฯ€, it evaluates the how good the actor is

http://combiboilersleeds.com/picaso/critics/critics-4.html

An actor can be found from a critic.

e.g. Q-learning

Page 42: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Critic

โ€ข State value function ๐‘‰๐œ‹ ๐‘ 

โ€ข When using actor ๐œ‹, the cumulated reward expects to be obtained after seeing observation (state) s

๐‘‰๐œ‹s๐‘‰๐œ‹ ๐‘ 

scalar

๐‘‰๐œ‹ ๐‘  is large ๐‘‰๐œ‹ ๐‘  is smaller

Page 43: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Critic

๐‘‰ไปฅๅ‰็š„้˜ฟๅ…‰ ๅคง้ฆฌๆญฅ้ฃ› = bad

๐‘‰่ฎŠๅผท็š„้˜ฟๅ…‰ ๅคง้ฆฌๆญฅ้ฃ› = good

Page 44: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

How to estimate ๐‘‰๐œ‹ ๐‘ 

โ€ข Monte-Carlo based approachโ€ข The critic watches ๐œ‹ playing the game

After seeing ๐‘ ๐‘Ž,

Until the end of the episode, the cumulated reward is ๐บ๐‘Ž

After seeing ๐‘ ๐‘,

Until the end of the episode, the cumulated reward is ๐บ๐‘

๐‘‰๐œ‹ ๐‘ ๐‘Ž๐‘‰๐œ‹๐‘ ๐‘Ž ๐บ๐‘Ž

๐‘‰๐œ‹ ๐‘ ๐‘๐‘‰๐œ‹๐‘ ๐‘ ๐บ๐‘

Page 45: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

How to estimate ๐‘‰๐œ‹ ๐‘ 

โ€ข Temporal-difference approach

โ‹ฏ๐‘ ๐‘ก , ๐‘Ž๐‘ก , ๐‘Ÿ๐‘ก , ๐‘ ๐‘ก+1โ‹ฏ

๐‘‰๐œ‹ ๐‘ ๐‘ก๐‘‰๐œ‹๐‘ ๐‘ก

๐‘‰๐œ‹ ๐‘ ๐‘ก+1๐‘‰๐œ‹๐‘ ๐‘ก+1

๐‘‰๐œ‹ ๐‘ ๐‘ก + ๐‘Ÿ๐‘ก = ๐‘‰๐œ‹ ๐‘ ๐‘ก+1

๐‘‰๐œ‹ ๐‘ ๐‘ก โˆ’ ๐‘‰๐œ‹ ๐‘ ๐‘ก+1 ๐‘Ÿ๐‘ก-

Some applications have very long episodes, so that delaying all learning until an episode's end is too slow.

Page 46: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

MC v.s. TD

๐‘‰๐œ‹ ๐‘ ๐‘Ž๐‘‰๐œ‹๐‘ ๐‘Ž ๐บ๐‘ŽLarger variance

unbiased

๐‘‰๐œ‹ ๐‘ ๐‘ก๐‘‰๐œ‹๐‘ ๐‘ก ๐‘‰๐œ‹ ๐‘ ๐‘ก+1 ๐‘‰๐œ‹ ๐‘ ๐‘ก+1๐‘Ÿ +

Smaller varianceMay be biased

โ€ฆ

Page 47: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

MC v.s. TD

โ€ข The critic has the following 8 episodesโ€ข ๐‘ ๐‘Ž , ๐‘Ÿ = 0, ๐‘ ๐‘ , ๐‘Ÿ = 0, END

โ€ข ๐‘ ๐‘ , ๐‘Ÿ = 1, END

โ€ข ๐‘ ๐‘ , ๐‘Ÿ = 1, END

โ€ข ๐‘ ๐‘ , ๐‘Ÿ = 1, END

โ€ข ๐‘ ๐‘ , ๐‘Ÿ = 1, END

โ€ข ๐‘ ๐‘ , ๐‘Ÿ = 1, END

โ€ข ๐‘ ๐‘ , ๐‘Ÿ = 1, END

โ€ข ๐‘ ๐‘ , ๐‘Ÿ = 0, END

[Sutton, v2, Example 6.4]

(The actions are ignored here.)

๐‘‰๐œ‹ ๐‘ ๐‘Ž =?

๐‘‰๐œ‹ ๐‘ ๐‘ = 3/4

0? 3/4?

Monte-Carlo:

Temporal-difference:

๐‘‰๐œ‹ ๐‘ ๐‘Ž = 0

๐‘‰๐œ‹ ๐‘ ๐‘ + ๐‘Ÿ = ๐‘‰๐œ‹ ๐‘ ๐‘Ž3/43/4 0

Page 48: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Another Critic

โ€ข State-action value function ๐‘„๐œ‹ ๐‘ , ๐‘Ž

โ€ข When using actor ๐œ‹, the cumulated reward expects to be obtained after seeing observation s and taking a

๐‘„๐œ‹s ๐‘„๐œ‹ ๐‘ , ๐‘Ž

scalara

๐‘„๐œ‹ ๐‘ , ๐‘Ž = ๐‘™๐‘’๐‘“๐‘ก

๐‘„๐œ‹ ๐‘ , ๐‘Ž = ๐‘“๐‘–๐‘Ÿ๐‘’

๐‘„๐œ‹ ๐‘ , ๐‘Ž = ๐‘Ÿ๐‘–๐‘”โ„Ž๐‘ก๐‘„๐œ‹

for discrete action only

s

Page 49: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Q-Learning

๐œ‹ interacts with the environment

Learning ๐‘„๐œ‹ ๐‘ , ๐‘ŽFind a new actor ๐œ‹โ€ฒ โ€œbetterโ€ than ๐œ‹

TD or MC

?

๐œ‹ = ๐œ‹โ€ฒ

Page 50: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Q-Learning

โ€ข Given ๐‘„๐œ‹ ๐‘ , ๐‘Ž , find a new actor ๐œ‹โ€ฒ โ€œbetterโ€ than ๐œ‹

โ€ข โ€œBetterโ€: ๐‘‰๐œ‹โ€ฒ ๐‘  โ‰ฅ ๐‘‰๐œ‹ ๐‘  , for all state s

๐œ‹โ€ฒ ๐‘  = ๐‘Ž๐‘Ÿ๐‘”max๐‘Ž

๐‘„๐œ‹ ๐‘ , ๐‘Ž

โžข๐œ‹โ€ฒ does not have extra parameters. It depends on Q

โžขNot suitable for continuous action a

Page 51: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Deep Reinforcement LearningActor-Critic

Page 52: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Actor-Critic

๐œ‹ interacts with the environment

Learning ๐‘„๐œ‹ ๐‘ , ๐‘Ž , ๐‘‰๐œ‹ ๐‘ 

Update actor from ๐œ‹ โ†’ ๐œ‹โ€™ based on ๐‘„๐œ‹ ๐‘ , ๐‘Ž , ๐‘‰๐œ‹ ๐‘ 

TD or MC๐œ‹ = ๐œ‹โ€ฒ

Page 53: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Actor-Critic

โ€ข Tips

โ€ข The parameters of actor ๐œ‹ ๐‘  and critic ๐‘‰๐œ‹ ๐‘ can be shared

Network๐‘ 

Network

fire

right

left

Network

๐‘‰๐œ‹ ๐‘ 

Page 54: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Source of image: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-8-asynchronous-actor-critic-agents-a3c-c88f72a5e9f2#.68x6na7o9

Asynchronous

๐œƒ2

๐œƒ1

1. Copy global parameters

2. Sampling some data

3. Compute gradients

4. Update global models

โˆ†๐œƒ

โˆ†๐œƒ

๐œƒ1

+๐œ‚โˆ†๐œƒ๐œƒ1

(other workers also update models)

Page 55: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Demo of A3C

โ€ข Racing Car (DeepMind)

https://www.youtube.com/watch?v=0xo1Ldx3L5Q

Page 56: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Demo of A3C

โ€ข Visual Doom AI Competition @ CIG 2016

โ€ข https://www.youtube.com/watch?v=94EPSjQH38Y

Page 57: Introduction of Reinforcement Learningspeech.ee.ntu.edu.tw/~tlkagk/courses/ML_2017/Lecture/RL (v4).pdfIntroduction of Reinforcement Learning. Deep Reinforcement Learning. Reference

Concluding Remarks

Policy-based Value-based

Learning an Actor Learning a CriticActor + Critic

Model-based Approach

Model-free Approach


Recommended