Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia...

Preview:

Citation preview

Learning in sequential environments

Raia HadsellStaff Research Scientist, DeepMind

raiahadsell.com

Scaling deep reinforcement learning towards the real world:

part 1: learning sequential tasks without forgettingpart 2: learning to navigate in complex worlds

Raia Hadsell 2017

EnvironmentAgent

Reinforcement Learning

OBSERVATIONS

ACTIONS

REWARD?

Raia Hadsell 2017

○ Maximizing Qπ(s,a) over possible policies gives the optimal

action-value function and the Bellman equation:

○ Basic idea:

■ Approximate →

■ Apply the Bellman Equation as an iterative update

Value Iteration

Raia Hadsell 2017

○ Use a neural network for Q(s,a; )

○ Train end-to-end from raw pixels

End-to-End Reinforcement Learning

Raia Hadsell 2017

but.. a network for every task?

Raia Hadsell 2017

one network for all?

Raia Hadsell 2017

Catastrophic forgetting

● Well-known phenomenon● Especially severe in Deep RL

Raia Hadsell 2017

Catastrophic forgetting

https://www.youtube.com/watch?v=Fh_zNpdc0Xs

Raia Hadsell 2017

An illustration

Task B

*

Task A

SGD

EWC

L2

Raia Hadsell 2017

Elastic Weight Consolidation

Task B

*

Task A

Elastic Weight Consolidation (EWC):

Constrain important parameters

to stay close to their old values

Continual learning in the brain:

Synaptic consolidation reduces

the plasticity of synapses that are

vital to previous tasks.

SGD

EWC

L2

Raia Hadsell 2017

Elastic Weight Consolidation

Implement constraint as a quadratic penalty

that is applied while training on B, but not

uniformly - rather, should be greater for

important parameters of Task A.

Posterior distribution

contains exactly this,

but is intractable. Task B

*

Task A

SGD

EWC

L2

Raia Hadsell 2017

Estimate posterior with Gaussian.

Mean: parameter vector *A

Diagonal precision given by approximation

of the Fisher Information F.

Elastic Weight Consolidation

Task B

*

Task A

SGD

EWC

L2

Raia Hadsell 2017

Elastic Weight Consolidation

Task B

*

Task A

SGD

EWC

L2

Raia Hadsell 2017

Experiment: Permuted MNIST

Random, fixed permutations of MNIST dataset.

Train a multilayer, fully-connected network with ReLus until convergence

We compare SGD, L2 regularisation, and EWC.

Perm A Perm B Perm C

Raia Hadsell 2017

Experiment: Permuted MNIST

Raia Hadsell 2017

Experiment: Permuted MNIST

Raia Hadsell 2017

Experiment: Permuted MNIST

Raia Hadsell 2017

Experiment: Permuted MNIST

Raia Hadsell 2017

Let’s try something harder...

Sequential reinforcement learning tasks (10 Atari games)

Random ordering with extended game play on each task, multiple returns

Unknown task boundaries

Regular testing of all 10 games

Single network with fixed capacity

Raia Hadsell 2017

Experiment: Atari 10

Forget-Me-Not1 allows labeling of data

segments, used for

● EWC regularisation

● Task-specific replay buffers used for

DDQN2

● Task-specific bias and gains at each

network layer

Fisher estimated at each task

boundary and EWC penalty is updated

[1] The forget-me-not process, Milan et al., NIPS 2016[2] Deep reinforcement learning with double q-learning, Hasselt et al., AAAI 2016

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia

Clopath, Dharshan Kumaran, Raia Hadsell

Overcoming catastrophic forgetting in neural networks

PNAS 2017arxiv.org/abs/1612.00796

Raia Hadsell 2017

Learning to navigate in complex mazes

Raia Hadsell 2017

Navigation mazesGame episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

+10 +1

Raia Hadsell 2017

Navigation mazesGame episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

+10 +1

Variants:● Static maze, static goal● Static maze, random goal● Random maze

Observations: RGB, velocityActions: 8

Raia Hadsell 2017

Navigation mazesGame episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

3600 steps/episode

Variants:● Static maze, static goal● Static maze, random goal● Random maze

Observations: RGB, velocityActions: 8

Raia Hadsell 2017

Navigation mazes Game episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

3600 steps/episode

10800 steps/episode

3600 steps/episode

Variants:● Static maze, static goal● Static maze, random goal● Random maze

Observations: RGB, velocityActions: 8

Raia Hadsell 2017

The vast and meaningless silence of an agent exploring...

Given: Sparse rewards

Wanted:Spatial knowledge

1e7

I have been here before! I know where to go!

1e7

Why is learning navigation via reinforcement learning hard?

Raia Hadsell 2017

Given:Sparse rewards

Wanted:Spatial knowledge

1. Accelerate reinforcement learning through auxiliary losses➔ Stable gradients help learning, even if unrelated to reward

2. Drive spatial knowledge through choice of auxiliary tasks:● Depth prediction● Loop closure prediction

Why is learning navigation via reinforcement learning hard?

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

enc

xt

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

enc

xt

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

enc

xt rt-1 {vt, at-1}

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

enc

xt rt-1 {vt, at-1}

Mnih et al. (2016)

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

5. Aux task 1: Depth predictors

enc

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 )

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

5. Aux task 1: Depth predictors

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

5. Aux task 1: Depth predictor

6. Aux task 2: Loop closure predictor enc

Loop (L)

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 )

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

4. RL: Asynchronous advantage actor critic (A3C)

5. Aux task 1: Depth predictor

6. Aux task 2: Loop closure predictor

7. For analysis: Position decoder

enc

Loop (L)

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 ) Position

Raia Hadsell 2017

details..

Raia Hadsell 2017

more details.. policy gradient:

depth prediction from visual features:

depth prediction from LSTM features:

loop prediction from LSTM features:

Raia Hadsell 2017

Experiments

xt rt-1 {vt, at-1}

enc

xt

enc enc

Loop (L)

Depth (D1 )

a. FF A3C c. Nav A3C d. Nav A3C +D1D2L

xt rt-1 {vt, at-1}

enc

xt

b. LSTM A3C

Depth (D2 )

Raia Hadsell 2017

+10 +1

Results on large maze with static goal

https://www.youtube.com/watch?v=zHhbypmKaj0

Raia Hadsell 2017

Should depth be an input? Or a target?

rgbdt rt-1 {vt, at-1}

enc enc

Depth (D1 )

rgbt rt-1 {vt, at-1}

Depth (D2 )

Answer: the dense, non-noisy gradients from depth as a target are more helpful

Raia Hadsell 2017

Results with Random Goal locations

Is the agent remembering the goal

location?

● Mean time to first goal find of episode:

14.0 sec

● Mean time to subsequent goal finds:

7.2 sec

● Not as impressive for large mazes:

15.4 sec vs 15.0 sec.

small

large

Raia Hadsell 2017

Latency to goal (as the agent returns)

● Trajectories of the Nav A3C+D+L agent in the I-maze and random goal maze over the course of one episode

● Value function and goal finding (red lines) are shown

Raia Hadsell 2017

Position decoding

● Trajectories of the Nav A3C+D+L agent in the random goal maze

● Position likelihoods are overlaid (predicted from LSTM hiddens)

● Initial uncertainty gives way to accurate position estimation.

enc

���� L

D1

xt rt-1 {vt, at-1}

D2Position

Raia Hadsell 2017

Results in random mazessmall

large

https://www.youtube.com/watch?v=EKXQAjoNdGM

https://www.youtube.com/watch?v=lNoaTyMZsWI

Thank you!raiahadsell.com

Piotr Mirowski Razvan Pascanu

Fabio Ross Andy Hubert Laurent Koray Dharsh Misha Andrea

Learning to navigate in complex environments

ICLR2017arxiv.org/abs/1611.03673

Recommended