Curiosity and Maximum Entropy Exploration · S. M. Kakade (UW) Curiosity 6/16. Example 2: Visitation-based Rewards [Burda et.al 2018b] rt = kf(st) ˚(st)k2 Agent completes over 10

Curiosity and Maximum Entropy Exploration

Sham M. KakadeElad Hazan, Karan Singh, Abby Van Soest

University of Washington

Google AI Princeton

S. M. Kakade (UW) Curiosity 1 / 16

Exploration and Curiosity

Can an agent, in a known or unknown environment, learn what it iscapable of doing?

understanding the agents capabilities are helpful for downstreamtasksreward functions may be poorly specified or sparse


Setting: Markov decision processes

S states. start with s0 ∼ d0A actions.dynamics model P(s′|s,a).reward function r(s)discount factor γ Sutton, Barto ’18

(non-stationary) policy π: (s0,a0, s1,a1, . . .)→ atStandard objective: find π which maximizes:

V (π) = (1− γ)E[r(s0) + γr(s1) + γ2r(s2) + . . .]

where the distribution of st ,at is induced by π.


Prior work: The Explore/Exploit Tradeoff

Thrun ’92

Random search does not find the reward quickly.

(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]


Prior work: Intrinsic Motivation

There is a host of work on learning based on “intrinsic reward”.

Prior work:intrinsic motivation:Chentanez et.al. ’05; Singh et.al. ’09, ’10; Zheng et.al. ’18Bonuses:Bellemare et.al. ’16; Ostrovski et.al. ’17; Tang et.al. ’17Prediction error:Lopes et.al. ’12; Pathak et.al. ’17; Savinov et.al. ’18; Fu et.al. ’17;Shakir et.al. ’15; Rein et.al. ’16; Weber ’18


Example 1: Dynamics-based Prediction Error

[Burda et.al 2018a]

rt = ‖f (st ,at)− φ(st+1)‖2

Agent reaches the end of many games with no rewards signal.

Intrinsic reward function based on dynamics prediction error.f (s,a) is a forward dynamics model trained to predict the next state.φ can be random features, embedding for inverse dynamics, VAE.


Example 2: Visitation-based Rewards

[Burda et.al 2018b]

rt = ‖f (st)− φ(st)‖2

Agent completes over 10 rooms, with no reward signal.

Intrinsic reward designed to be large for infrequently visited states.φ(·) is a random function, f (·) is a learned function.Related [Bellemare et. al ’16]


Algorithmic Challenges

Many algorithms solve sub-problems using fictitious rewards.disentangling uncertainty in the dynamics with novelty?determining ’novelty bonuses’ as fictitious rewards?

Understanding what ’oracle models/sup-problems’ we should solve.’oracle’ based algorithms: CPI, DPPS, DAGGER, SEARN, ...

An optimization framework may help with these issues.


Outline

This work: Can we efficiently learn policies in an MDP which optimize“task agnostic” reward functions?

Formalize a class of “task agnostic” reward functions.Theorem 1: there is a computationally efficient algorithm to optimizethese reward functions.Theorem 2: (unknown model case) there is also a sample efficientalgorithm.


Formalization: A ’task agnostic’ objective

π induces a distribution over states:

dπ(s) = (1−γ)(Pr(s0 = s|π) + γ Pr(s1 = s|π) + γ2 Pr(s2 = s|π) + . . .

)Consider (concave) reward functions based on density dπ:

maxπ

R(dπ)

Examples:Entropy(dπ) = −

∑s dπ(s) log dπ(s)

CrossEntropy(Uniform||dπ) = −∑

s1S log dπ(s)


Example: The CrossEntropy Reward Function

π∗CE = argmaxπ

CrossEntropy(Uniform||dπ)

The CrossEntropy encourages the ’most uniform’ random walk overthe MDP.

LemmaGiven some state s, let Pr∗(s) be the highest probability of being in sover all policies:

Pr∗(s) = maxπ

dπ(s) .

For all states s, we have that:

dπ∗CE(s) ≥Pr∗(s)

4S


Optimization Challenges

maxπ

R(dπ)

Can we optimize this efficiently?

with an approximate planning oracle, where we feed in ’intrinsic’reward functions?with a known model? with only simulation based access?


Optimization Landscape

LemmaR(dπ) is not concave in the policy π,even if R(d) is concave in d.


Mixtures of policies

The optimal policy may be stochastic.Define a mixture policy

policies: C = (π1, . . . πk )mixing weights: (α1, . . . αk )πmix = (α,C): at t = 0, sample a policy πi and use it onwards.

The induced state distribution is:

dπmix(s) =∑

i

αidπi (s).


Optimization with Approximate Oracles

For t = 0, . . . ,T − 1 do1 Set d̂πmix = DensityEst(πmix, �est).2 Set the intrinsic reward function as

rt(s) :=∂R(d1,d2 . . . dS)

∂ds

∣∣∣ds=d̂πmix (s)

=1

d̂πmix(s)︸︷︷︸for CrossEntropy

.

3 Set πt+1 = ApproxPlan(rt , �plan

).

4 Update πmix ← (αt+1,Ct+1) using “learning rate” η:

Ct+1 = (π0, π1, . . . , πt+1),αt+1 = ((1− η)αt , η).


Theorem 1: Computation

TheoremAssume:

R(·) is concave and β smooth, i.e. ‖∇2R‖ ≤ β.DensityEst has accuracy �est ≤ �/β in `∞.ApproxPlan has accuracy �plan ≤ � in `∞.

IfT ≥ 10β

�

thenR(dπmix) ≥ maxπ R(dπ)− � .


Proof idea: the conditional gradient method

Problem: convex set K , concave R(·)maxx∈K R(x)Optimization oracle: suppose we havemaxy∈K 〈y ,∇R(x)〉Guarantee: After T rounds, xT is:- a T−1-approximate maximizer- convex combination of T vertices in K

Choose K to be the convex set of state distributions.

dπmix ← (1− η) · dπmix + η · argmaxy∈K

〈y ,∇R(dπmix)〉


Theorem 2: unknown MDP case

TheoremIn an episodic setting, there is an algorithm that usesO(S2A ∗ poly(β, �−1)

)samples and returns a policy πmix such that:

R(dπmix) ≥ maxπ R(dπ)− � .


Experiments: CrossEntropy


Thanks!

Provably efficient methods to optimize a ’task agnostic’ rewardfunctions.Lots of open questions!

Means to think about exploration from an optimization perspective?Interplay with local search based methods?

Collaborators:

Elad Hazan Karan Singh Abby Van Soest


The LeadDefine the model/problemThe Lead, again

Documents

Curiosity and Maximum Entropy Exploration · S. M. Kakade (UW) Curiosity 6/16. Example 2: Visitation-based Rewards [Burda et.al 2018b] rt = kf(st) ˚(st)k2 Agent completes over 10