53
Model-Based Active Exploration Pranav Shyam, Wojciech Jaskowski, Faustino Gomez arxiv.org/abs/1810.12162 Presentation by Danijar Hafner

Model-Based Active Exploration Presentation by Danijar

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Model-Based Active Exploration Presentation by Danijar

Model-Based Active ExplorationPranav Shyam, Wojciech Jaskowski, Faustino Gomez

arxiv.org/abs/1810.12162

Presentation by Danijar Hafner

Page 2: Model-Based Active Exploration Presentation by Danijar

Reinforcement Learning

sensorinput

motoroutput unknown

environmentlearning agent

objective

algorithm

Page 3: Model-Based Active Exploration Presentation by Danijar

Intrinsic Motivation

sensorinput

motoroutput

learning agent

objective

algorithm

Reinforcement Learning

sensorinput

motoroutput unknown

environmentlearning agent

objective

algorithm

unknown environment

Page 5: Model-Based Active Exploration Presentation by Danijar

Without rewards, the agent can only learn about the environment.

Information Gain

Page 6: Model-Based Active Exploration Presentation by Danijar

Without rewards, the agent can only learn about the environment.

A model W represents our knowledge. E.g.: input density, forward prediction

Information Gain

Page 7: Model-Based Active Exploration Presentation by Danijar

p(W)

Without rewards, the agent can only learn about the environment.

A model W represents our knowledge. E.g.: input density, forward prediction

Need to represent uncertainty about W to tell how much we have learned.

Information Gain

Page 8: Model-Based Active Exploration Presentation by Danijar

p(W)

Without rewards, the agent can only learn about the environment.

A model W represents our knowledge. E.g.: input density, forward prediction

Need to represent uncertainty about W to tell how much we have learned.

Information Gain

data collection

p(W | X)

Page 9: Model-Based Active Exploration Presentation by Danijar

maxa I(X; W | A=a)

p(W)

Without rewards, the agent can only learn about the environment.

A model W represents our knowledge. E.g.: input density, forward prediction

Need to represent uncertainty about W to tell how much we have learned.

Information Gain

data collection

To gain the most information, we aim to maximize the mutual information between future sensory inputs X and model parameters W:

p(W | X)

Both W and X are random variables

Page 10: Model-Based Active Exploration Presentation by Danijar

maxa I(X; W | A=a)

p(W)

Without rewards, the agent can only learn about the environment.

A model W represents our knowledge. E.g.: input density, forward prediction

Need to represent uncertainty about W to tell how much we have learned.

Information Gain

data collection

To gain the most information, we aim to maximize the mutual information between future sensory inputs X and model parameters W:

p(W | X)

Both W and X are random variables

= ?

Page 11: Model-Based Active Exploration Presentation by Danijar

Expected Infogain

Need to search for actions that will lead to high information gain without additional environment interaction

Learn a forward model of the environment to search for actions by planning or learning in imagination

Computing the expected information gain requires computing entropies of a model with uncertainty estimates

Retrospective Infogain

Collect episodes, train world model, record improvement, reward the controller by this improvement

Infogain depends on agent's knowledge that keeps changing, making it a non-stationary objective

The learned controller will lag behind and go to states that were previously novel but are not anymore

e.g. VIME, ICM, RND e.g. MAX, PETS-ET, LD

I(X; W | A=a)KL[p(W|X,A=a) p(W|A=a)]||

Page 12: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 1 Everything unknown

Page 13: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 1 Random behavior

Page 14: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 1 High novelty

Page 15: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 1 Reinforce behavior

Page 16: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 2 Repeat behavior

Page 17: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 2 Reach similar states

Page 18: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 2 Not surprising anymore :(

Page 19: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 2 Unlearn behavior

Page 20: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 3 Repeat behavior

Page 21: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 3 Repeat behavior

Page 22: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 3 Still not novel

Page 23: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 3 Unlearn behavior

Page 24: Model-Based Active Exploration Presentation by Danijar

Retrospective Novelty

Episode 4 Back to random behavior

The agent builds a map of where it was already and avoids those states.

Page 25: Model-Based Active Exploration Presentation by Danijar

Expected Novelty

Episode 1 Everything unknown

Page 26: Model-Based Active Exploration Presentation by Danijar

Expected Novelty

Episode 1 Consider options

Page 27: Model-Based Active Exploration Presentation by Danijar

Expected Novelty

Episode 1 Execute plan

Page 28: Model-Based Active Exploration Presentation by Danijar

Expected Novelty

Episode 1 Observe new data

Page 29: Model-Based Active Exploration Presentation by Danijar

Expected Novelty

Episode 2 Consider options

Page 30: Model-Based Active Exploration Presentation by Danijar

Expected Novelty

Episode 2 Execute plan

Page 31: Model-Based Active Exploration Presentation by Danijar

Expected Novelty

Episode 2 Observe new data

Page 32: Model-Based Active Exploration Presentation by Danijar

Learn dynamics both to represent knowledge and to plan for expected infogain

Ensemble of Dynamics Models

Page 33: Model-Based Active Exploration Presentation by Danijar

Learn dynamics both to represent knowledge and to plan for expected infogain

Capture uncertainty as an ensemble of non-linear Gaussian predictors

Ensemble of Dynamics Models

Page 34: Model-Based Active Exploration Presentation by Danijar

Learn dynamics both to represent knowledge and to plan for expected infogain

Capture uncertainty as an ensemble of non-linear Gaussian predictors

Ensemble of Dynamics Models

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertaintytotal uncertainty

Page 35: Model-Based Active Exploration Presentation by Danijar

Learn dynamics both to represent knowledge and to plan for expected infogain

Capture uncertainty as an ensemble of non-linear Gaussian predictors

Ensemble of Dynamics Models

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertaintytotal uncertainty

Information gain targets uncertain trajectories with low expected noise

Page 36: Model-Based Active Exploration Presentation by Danijar

Learn dynamics both to represent knowledge and to plan for expected infogain

Capture uncertainty as an ensemble of non-linear Gaussian predictors

Ensemble of Dynamics Models

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertaintytotal uncertainty

Information gain targets uncertain trajectories with low expected noise

Wide predictions mean high expected noiseOverlapping modes means less total uncertainty

Page 37: Model-Based Active Exploration Presentation by Danijar

Learn dynamics both to represent knowledge and to plan for expected infogain

Capture uncertainty as an ensemble of non-linear Gaussian predictors

Ensemble of Dynamics Models

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertaintytotal uncertainty

Information gain targets uncertain trajectories with low expected noise

Wide predictions mean high expected noiseOverlapping modes means less total uncertainty

Narrow predictions mean low expected noiseDistant modes means large total uncertainty

Page 38: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertaintytotal uncertainty

Page 39: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

total uncertainty

Page 40: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

Aggregate prediction: p(X | A=a) = 1/K Σ p(X | W=wk, A=a)

total uncertainty

Page 41: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

Aggregate prediction: p(X | A=a) = 1/K Σ p(X | W=wk, A=a)

Aleatoric uncertainty:

total uncertainty

Page 42: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

Aggregate prediction: p(X | A=a) = 1/K Σ p(X | W=wk, A=a)

Aleatoric uncertainty: ?

total uncertainty

Page 43: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

Aggregate prediction: p(X | A=a) = 1/K Σ p(X | W=wk, A=a)

Aleatoric uncertainty: 1/K Σ H(p(X | W=wk, A=a))

total uncertainty

Page 44: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

Aggregate prediction: p(X | A=a) = 1/K Σ p(X | W=wk, A=a)

Aleatoric uncertainty: 1/K Σ H(p(X | W=wk, A=a))

Total uncertainty:

total uncertainty

Page 45: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

Aggregate prediction: p(X | A=a) = 1/K Σ p(X | W=wk, A=a)

Aleatoric uncertainty: 1/K Σ H(p(X | W=wk, A=a))

Total uncertainty: ?

total uncertainty

Page 46: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

Aggregate prediction: p(X | A=a) = 1/K Σ p(X | W=wk, A=a)

Aleatoric uncertainty: 1/K Σ H(p(X | W=wk, A=a))

Total uncertainty: H(1/K Σ p(X | W=wk, A=a))

total uncertainty

Page 47: Model-Based Active Exploration Presentation by Danijar

Expected Infogain Approximation

I(X; W | A=a) = H(X | A=a) − H(X | W, A=a)epistemic uncertainty aleatoric uncertainty

Ensemble members: p(X | W=wk, A=a)

Aggregate prediction: p(X | A=a) = 1/K Σ p(X | W=wk, A=a)

Aleatoric uncertainty: 1/K Σ H(p(X | W=wk, A=a))

Total uncertainty: H(1/K Σ p(X | W=wk, A=a))

total uncertainty

Gaussian entropy has a closed form, so we can compute the aleatoric uncertainty.GMM entropy does not, sample it or switch to Renyi entropy that has a closed form.

Page 48: Model-Based Active Exploration Presentation by Danijar
Page 49: Model-Based Active Exploration Presentation by Danijar

Compared Algorithms

Learning from imaginedtrajectories (Expected)

MAX: JSD infogainTVAX: State variance

Learning from experiencereplay (Retrospective)

JDRX: JSD infogainPERX: Prediction error

Page 50: Model-Based Active Exploration Presentation by Danijar

Exploration Chain Domain

+0.001 +1

Page 51: Model-Based Active Exploration Presentation by Danijar

State coverage of Ant Maze

Page 52: Model-Based Active Exploration Presentation by Danijar

model-free with 10x data

Zero-Shot Adaptation

no exploration needed exploration needed

Learn evaluation policy inside of learned model given a known reward function

Page 53: Model-Based Active Exploration Presentation by Danijar

ConclusionsInformation gain is a principled task-agnostic objective

As a non-stationary objective, it should be optimized in expectation

This requires a dynamics model for planning to explore

Ensemble of Gaussian dynamics is a practical way to represent uncertainty