Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia...

Learning in sequential environments

Raia HadsellStaff Research Scientist, DeepMind

raiahadsell.com

Scaling deep reinforcement learning towards the real world:

part 1: learning sequential tasks without forgettingpart 2: learning to navigate in complex worlds

Raia Hadsell 2017

EnvironmentAgent

Reinforcement Learning

OBSERVATIONS

ACTIONS

REWARD?

Raia Hadsell 2017

○ Maximizing Qπ(s,a) over possible policies gives the optimal

action-value function and the Bellman equation:

○ Basic idea:

■ Approximate →

■ Apply the Bellman Equation as an iterative update

Value Iteration

Raia Hadsell 2017

○ Use a neural network for Q(s,a; )

○ Train end-to-end from raw pixels

End-to-End Reinforcement Learning

Raia Hadsell 2017

but.. a network for every task?

Raia Hadsell 2017

one network for all?

Raia Hadsell 2017

Catastrophic forgetting

● Well-known phenomenon● Especially severe in Deep RL

Raia Hadsell 2017

https://www.youtube.com/watch?v=Fh_zNpdc0Xs

Raia Hadsell 2017

https://www.youtube.com/watch?v=yk_sW4x6zb0https://www.youtube.com/watch?v=V4oT1Ei-8_khttps://www.youtube.com/watch?v=LjFGy4BxOL8

Raia Hadsell 2017

An illustration

Task B

Task A

Raia Hadsell 2017

Elastic Weight Consolidation

Task B

Task A

Elastic Weight Consolidation (EWC):

Constrain important parameters

to stay close to their old values

Continual learning in the brain:

Synaptic consolidation reduces

the plasticity of synapses that are

vital to previous tasks.

Raia Hadsell 2017

Implement constraint as a quadratic penalty

that is applied while training on B, but not

uniformly - rather, should be greater for

important parameters of Task A.

Posterior distribution

contains exactly this,

but is intractable. Task B

Task A

Raia Hadsell 2017

Estimate posterior with Gaussian.

Mean: parameter vector *A

Diagonal precision given by approximation

of the Fisher Information F.

Task B

Task A

Raia Hadsell 2017

Task B

Task A

Raia Hadsell 2017

Experiment: Permuted MNIST

Random, fixed permutations of MNIST dataset.

Train a multilayer, fully-connected network with ReLus until convergence

We compare SGD, L2 regularisation, and EWC.

Perm A Perm B Perm C

Raia Hadsell 2017

Let’s try something harder...

Sequential reinforcement learning tasks (10 Atari games)

Random ordering with extended game play on each task, multiple returns

Unknown task boundaries

Regular testing of all 10 games

Single network with fixed capacity

Raia Hadsell 2017

Experiment: Atari 10

Forget-Me-Not1 allows labeling of data

segments, used for

● EWC regularisation

● Task-specific replay buffers used for

● Task-specific bias and gains at each

network layer

Fisher estimated at each task

boundary and EWC penalty is updated

[1] The forget-me-not process, Milan et al., NIPS 2016[2] Deep reinforcement learning with double q-learning, Hasselt et al., AAAI 2016

https://www.youtube.com/watch?v=Ry2WRcnwsYU

James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia

Clopath, Dharshan Kumaran, Raia Hadsell

Overcoming catastrophic forgetting in neural networks

PNAS 2017arxiv.org/abs/1612.00796

Raia Hadsell 2017

Learning to navigate in complex mazes

Raia Hadsell 2017

Navigation mazesGame episode:

1. Random start2. Find the goal (+10)3. Teleport randomly4. Re-find the goal (+10)5. Repeat (limited time)

+10 +1

Raia Hadsell 2017

+10 +1

Variants:● Static maze, static goal● Static maze, random goal● Random maze

Observations: RGB, velocityActions: 8

Raia Hadsell 2017

3600 steps/episode

Raia Hadsell 2017

Navigation mazes Game episode:

3600 steps/episode

10800 steps/episode

3600 steps/episode

Raia Hadsell 2017

The vast and meaningless silence of an agent exploring...

Given: Sparse rewards

Wanted:Spatial knowledge

I have been here before! I know where to go!

Why is learning navigation via reinforcement learning hard?

Raia Hadsell 2017

Given:Sparse rewards

Wanted:Spatial knowledge

1. Accelerate reinforcement learning through auxiliary losses➔ Stable gradients help learning, even if unrelated to reward

2. Drive spatial knowledge through choice of auxiliary tasks:● Depth prediction● Loop closure prediction

Why is learning navigation via reinforcement learning hard?

Raia Hadsell 2017

Nav agent ingredients:

1. Convolutional encoder and RGB inputs

Raia Hadsell 2017

2. Stacked LSTM

Raia Hadsell 2017

2. Stacked LSTM

3. Additional inputs (reward, action, and velocity)

xt rt-1 {vt, at-1}

Raia Hadsell 2017

2. Stacked LSTM

4. RL: Asynchronous advantage actor critic (A3C)

xt rt-1 {vt, at-1}

Mnih et al. (2016)

Raia Hadsell 2017

2. Stacked LSTM

5. Aux task 1: Depth predictors

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 )

Raia Hadsell 2017

2. Stacked LSTM

5. Aux task 1: Depth predictors

Raia Hadsell 2017

2. Stacked LSTM

5. Aux task 1: Depth predictor

6. Aux task 2: Loop closure predictor enc

Loop (L)

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 )

Raia Hadsell 2017

2. Stacked LSTM

5. Aux task 1: Depth predictor

6. Aux task 2: Loop closure predictor

7. For analysis: Position decoder

Loop (L)

Depth (D1 )

xt rt-1 {vt, at-1}

Depth (D2 ) Position

Raia Hadsell 2017

details..

Raia Hadsell 2017

more details.. policy gradient:

depth prediction from visual features:

depth prediction from LSTM features:

loop prediction from LSTM features:

Raia Hadsell 2017

Experiments

xt rt-1 {vt, at-1}

enc enc

Loop (L)

Depth (D1 )

a. FF A3C c. Nav A3C d. Nav A3C +D1D2L

xt rt-1 {vt, at-1}

b. LSTM A3C

Depth (D2 )

Raia Hadsell 2017

+10 +1

Results on large maze with static goal

https://www.youtube.com/watch?v=zHhbypmKaj0

Raia Hadsell 2017

Should depth be an input? Or a target?

rgbdt rt-1 {vt, at-1}

enc enc

Depth (D1 )

rgbt rt-1 {vt, at-1}

Depth (D2 )

Answer: the dense, non-noisy gradients from depth as a target are more helpful

Raia Hadsell 2017

Results with Random Goal locations

Is the agent remembering the goal

location?

● Mean time to first goal find of episode:

14.0 sec

● Mean time to subsequent goal finds:

7.2 sec

● Not as impressive for large mazes:

15.4 sec vs 15.0 sec.

Raia Hadsell 2017

Latency to goal (as the agent returns)

● Trajectories of the Nav A3C+D+L agent in the I-maze and random goal maze over the course of one episode

● Value function and goal finding (red lines) are shown

Raia Hadsell 2017

Position decoding

● Trajectories of the Nav A3C+D+L agent in the random goal maze

● Position likelihoods are overlaid (predicted from LSTM hiddens)

● Initial uncertainty gives way to accurate position estimation.

�� L

xt rt-1 {vt, at-1}

D2Position

Raia Hadsell 2017

Results in random mazessmall

https://www.youtube.com/watch?v=EKXQAjoNdGM

https://www.youtube.com/watch?v=lNoaTyMZsWI

Thank you!raiahadsell.com

Piotr Mirowski Razvan Pascanu

Fabio Ross Andy Hubert Laurent Koray Dharsh Misha Andrea

Learning to navigate in complex environments

ICLR2017arxiv.org/abs/1611.03673

Raia Hadsellraiahadsell.com/.../oxford_cslecture_raia_hadsell_2017.pdf · 2017. 6. 1. · Raia...

Documents

Nossa Raia #72

Raia Loredana-Stiloul.pptx

Nossa Raia #74

Raia Tzvetkova

Supplementary Materials for · A Not1-fragment containing a FlpO-floxed PGK-Neo cassette was then inserted in-frame into a unique Not1 site at amino acid 62 of the mouse Lbx1 protein

Structure and RNA-binding properties of the Not1–Not2–Not5 ...pubman.mpdl.mpg.de/pubman/item/escidoc:1877807:2/component/es… · Structure and RNA-binding properties of the Not1–Not2–Not5

Castelos e fortificações da raia

Nossa Raia Ed 78

Raia drogasil presentation_20150429_en

COMBAT ID – Combat Identification System · COMBAT ID – Combat Identification System Garbis Salgian, Zsolt Kira, Raia Hadsell, Han -Pang Chiu, Xun Zhou, Bing-Bing Chai, Supun

Raia drogasil apresentacao_20150429_pt (1)

A Tutorial on Energy-Based Learning - Yann LeCunA Tutorial on Energy-Based Learning Yann LeCun, Sumit Chopra, Raia Hadsell, ... graphs, and they provide considerably more ﬂexibility

Age at Last Name First Name DeathDate Of BirthDate Of ... · HADLOCK MOSES 69 1906PRATTSBURGH HADSELL CLIFFORD 97 1898 1996Erwin HADSELL CLINTON A 49 1944 1993Hornellsville HADSELL

Nossa Raia #70

Letter From Wesley Hadsell to His Mother

Nossa Raia #78

Nossa Raia #75

Styling de desfile paula raia

raia g - micologiaveterinaria.fmvz.unam.mx

Nossa Raia #76