45
RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial

RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

  • Upload
    others

  • View
    8

  • Download
    0

Embed Size (px)

Citation preview

Page 1: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

RL + DL

Rob Platt Northeastern University

Some slides from David Silver, Deep RL Tutorial

Page 2: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Deep RL in an exciting recent development

Page 3: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Deep RL in an exciting recent development

Page 4: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Remember the RL scenario...

Agent World

a

s,r

Agent takes actions

Agent perceives states and rewards

Page 5: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Tabular RL

Worldstate action

argmaxQ action

sta

teQ-function

Update rule

Page 6: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Tabular RL

Worldstate action

argmaxQ action

stat

e

Q-function

Update rule

Page 7: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Deep RL

Worldstate action

argmax

Q-function

Page 8: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Deep RL

Worldstate action

argmax

Q-function

But, why would we want to do this?

Page 9: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Where does “state” come from?

Agent

a

s,r

Agent takes actions

Agent perceives states and rewards

Earlier, we dodged this question: “it’s part of the MDP problem statement”

But, that’s a cop out. How do we get state?

Typically can’t use “raw” sensor data as state w/ a tabular Q-function– it’s too big (e.g. pacman has something like 2^(num pellets) + … states)

Page 10: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Linear function approximation?

In the RL lecture, we suggested using Linear Function approximation:

Q-function encoded by the weights

Page 11: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Linear function approximation?

In the RL lecture, we suggested using Linear Function approximation:

Q-function encoded by the weights

But, this STILL requires a human designer to specify the features!

Page 12: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Linear function approximation?

In the RL lecture, we suggested using Linear Function approximation:

Q-function encoded by the weights

But, this STILL requires a human designer to specify the features!

Is it possible to do RL WITHOUThand-coding either states or features?

Page 13: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Deep Q Learning

Page 14: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Instead of state, we have an image– in practice, it could be a history of the k most recent images stacked as a single k-channel image

Hopefully this new image representation is Markov…– in some domains, it might not be!

Deep Q Learning

Page 15: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Network structure

Q-function

Conv 1 Conv 2 FC 1 Output

Stack of images

Page 16: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Q-function

Conv 1 Conv 2 FC 1 Output

Stack of images

Network structure

Page 17: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Q-function

Conv 1 Conv 2 FC 1 Output

Stack of images

Num output nodes equals the number of actions

Network structure

Page 18: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Updating the Q-function

Here’s the standard Q-learning update equation:

Page 19: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Here’s the standard Q-learning update equation:

Updating the Q-function

Page 20: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Here’s the standard Q-learning update equation:

Rewriting:

Updating the Q-function

Page 21: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Here’s the standard Q-learning update equation:

Rewriting:

let’s call this the “target”

This equation adjusts Q(s,a) in the direction of the target

Updating the Q-function

Page 22: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Here’s the standard Q-learning update equation:

Rewriting:

let’s call this the “target”

This equation adjusts Q(s,a) in the direction of the target

We’re going to accomplish this same thing in a different way using neural networks...

Updating the Q-function

Page 23: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Remember how we train a neural network...

Given a dataset:

Define loss function:

Adjust w, b so as to minimize:

Do gradient descent on dataset:

1. repeat

2.

3.

4. until converged

Page 24: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Use this loss function:

Updating the Q-network

Page 25: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Use this loss function:

Updating the Q-network

Notice that Q is now parameterized by the weights, w

Page 26: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Use this loss function:

Updating the Q-network

I’m including the bias in the weights

Page 27: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Use this loss function:

Updating the Q-network

target

Page 28: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Use this loss function:

Updating the Q-network

target

Notice that the target is a function of the weights– this is a problem b/c targets are non-stationary

Page 29: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Use this loss function:

Updating the Q-network

target

Notice that the target is a function of the weights– this is a problem b/c targets are non-stationary

So… when we do gradient descent, we’re going to hold the targets constant

Now we can do gradient descent!

Page 30: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Deep Q Learning, no replay buffers

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Where:

Page 31: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Where:

This is all that changed relative to standard

q-learning

Deep Q Learning, no replay buffers

Page 32: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Example: 4x4 frozen lake env

Get to the goal (G)Don’t fall in a hole (H)

Demo!

Page 33: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Experience replay

Deep learning typically assumes independent, identically distributed (IID) training data

Page 34: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Experience replay

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Deep learning typically assumes independent, identically distributed (IID) training data

But is this true in the deep RL scenario?

Page 35: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Experience replay

Initialize Q(s,a;w) with random weightsRepeat (for each episode):

Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

Until s is terminal

Deep learning typically assumes independent, identically distributed (IID) training data

But is this true in the deep RL scenario?

Our solution: buffer experiences and then “replay” them during training

Page 36: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Initialize Q(s,a;w) with random weights

Repeat (for each episode):Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

If mod(step,trainfreq) == 0:sample batch B from D

Deep Q Learning WITH replay buffers

Replay buffer

Add this exp to buffer

One step grad descent WRT buffer

Train every trainfreq steps

Page 37: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Initialize Q(s,a;w) with random weights

Repeat (for each episode):Initialize sRepeat (for each step of the episode):

Choose a from s using policy derived from Q (e.g. e-greedy)Take action a, observe r, s’

If mod(step,trainfreq) == 0:sample batch B from D

Deep Q Learning WITH replay buffers

Replay buffer

Add this exp to buffer

One step grad descent WRT buffer

Train every trainfreq steps

– buffers like this are pretty common in DL...

Page 38: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Example: 4x4 frozen lake env

Get to the goal (G)Don’t fall in a hole (H)

Demo!

Page 39: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Atari Learning Environment

Page 40: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Atari Learning Environment

Page 41: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Atari Learning Environment

Page 42: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Additional Deep RL “Tricks”

Page 43: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Additional Deep RL “Tricks”

Page 44: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Additional Deep RL “Tricks”

Page 45: RL + DL · RL + DL Rob Platt Northeastern University Some slides from David Silver, Deep RL Tutorial. Deep RL in an exciting recent development. Deep RL in an exciting recent development

Additional Deep RL “Tricks”

Combined algorithm: 3x mean Atari score DQN w/ replay buffers