Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Curiosity and Maximum Entropy Exploration
Sham M. KakadeElad Hazan, Karan Singh, Abby Van Soest
University of Washington
Google AI Princeton
S. M. Kakade (UW) Curiosity 1 / 16
Exploration and Curiosity
Can an agent, in a known or unknown environment, learn what it iscapable of doing?
understanding the agents capabilities are helpful for downstreamtasksreward functions may be poorly specified or sparse
S. M. Kakade (UW) Curiosity 2 / 16
Setting: Markov decision processes
S states. start with s0 ∼ d0A actions.dynamics model P(s′|s,a).reward function r(s)discount factor γ Sutton, Barto ’18
(non-stationary) policy π: (s0,a0, s1,a1, . . .)→ atStandard objective: find π which maximizes:
V (π) = (1− γ)E[r(s0) + γr(s1) + γ2r(s2) + . . .]
where the distribution of st ,at is induced by π.
S. M. Kakade (UW) Curiosity 3 / 16
Prior work: The Explore/Exploit Tradeoff
Thrun ’92
Random search does not find the reward quickly.
(theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]
S. M. Kakade (UW) Curiosity 4 / 16
Prior work: Intrinsic Motivation
There is a host of work on learning based on “intrinsic reward”.
Prior work:intrinsic motivation:Chentanez et.al. ’05; Singh et.al. ’09, ’10; Zheng et.al. ’18Bonuses:Bellemare et.al. ’16; Ostrovski et.al. ’17; Tang et.al. ’17Prediction error:Lopes et.al. ’12; Pathak et.al. ’17; Savinov et.al. ’18; Fu et.al. ’17;Shakir et.al. ’15; Rein et.al. ’16; Weber ’18
S. M. Kakade (UW) Curiosity 5 / 16
Example 1: Dynamics-based Prediction Error
[Burda et.al 2018a]
rt = ‖f (st ,at)− φ(st+1)‖2
Agent reaches the end of many games with no rewards signal.
Intrinsic reward function based on dynamics prediction error.f (s,a) is a forward dynamics model trained to predict the next state.φ can be random features, embedding for inverse dynamics, VAE.
S. M. Kakade (UW) Curiosity 6 / 16
Example 2: Visitation-based Rewards
[Burda et.al 2018b]
rt = ‖f (st)− φ(st)‖2
Agent completes over 10 rooms, with no reward signal.
Intrinsic reward designed to be large for infrequently visited states.φ(·) is a random function, f (·) is a learned function.Related [Bellemare et. al ’16]
S. M. Kakade (UW) Curiosity 7 / 16
Algorithmic Challenges
Many algorithms solve sub-problems using fictitious rewards.disentangling uncertainty in the dynamics with novelty?determining ’novelty bonuses’ as fictitious rewards?
Understanding what ’oracle models/sup-problems’ we should solve.’oracle’ based algorithms: CPI, DPPS, DAGGER, SEARN, ...
An optimization framework may help with these issues.
S. M. Kakade (UW) Curiosity 8 / 16
Outline
This work: Can we efficiently learn policies in an MDP which optimize“task agnostic” reward functions?
Formalize a class of “task agnostic” reward functions.Theorem 1: there is a computationally efficient algorithm to optimizethese reward functions.Theorem 2: (unknown model case) there is also a sample efficientalgorithm.
S. M. Kakade (UW) Curiosity 8 / 16
Formalization: A ’task agnostic’ objective
π induces a distribution over states:
dπ(s) = (1−γ)(Pr(s0 = s|π) + γ Pr(s1 = s|π) + γ2 Pr(s2 = s|π) + . . .
)Consider (concave) reward functions based on density dπ:
maxπ
R(dπ)
Examples:Entropy(dπ) = −
∑s dπ(s) log dπ(s)
CrossEntropy(Uniform||dπ) = −∑
s1S log dπ(s)
S. M. Kakade (UW) Curiosity 9 / 16
Example: The CrossEntropy Reward Function
π∗CE = argmaxπ
CrossEntropy(Uniform||dπ)
The CrossEntropy encourages the ’most uniform’ random walk overthe MDP.
LemmaGiven some state s, let Pr∗(s) be the highest probability of being in sover all policies:
Pr∗(s) = maxπ
dπ(s) .
For all states s, we have that:
dπ∗CE(s) ≥Pr∗(s)
4S
S. M. Kakade (UW) Curiosity 9 / 16
Optimization Challenges
maxπ
R(dπ)
Can we optimize this efficiently?
with an approximate planning oracle, where we feed in ’intrinsic’reward functions?with a known model? with only simulation based access?
S. M. Kakade (UW) Curiosity 9 / 16
Optimization Landscape
LemmaR(dπ) is not concave in the policy π,even if R(d) is concave in d.
S. M. Kakade (UW) Curiosity 10 / 16
Mixtures of policies
The optimal policy may be stochastic.Define a mixture policy
policies: C = (π1, . . . πk )mixing weights: (α1, . . . αk )πmix = (α,C): at t = 0, sample a policy πi and use it onwards.
The induced state distribution is:
dπmix(s) =∑
i
αidπi (s).
S. M. Kakade (UW) Curiosity 11 / 16
Optimization with Approximate Oracles
For t = 0, . . . ,T − 1 do1 Set d̂πmix = DensityEst(πmix, �est).2 Set the intrinsic reward function as
rt(s) :=∂R(d1,d2 . . . dS)
∂ds
∣∣∣ds=d̂πmix (s)
=1
d̂πmix(s)︸ ︷︷ ︸for CrossEntropy
.
3 Set πt+1 = ApproxPlan(rt , �plan
).
4 Update πmix ← (αt+1,Ct+1) using “learning rate” η:
Ct+1 = (π0, π1, . . . , πt+1),αt+1 = ((1− η)αt , η).
S. M. Kakade (UW) Curiosity 12 / 16
Theorem 1: Computation
TheoremAssume:
R(·) is concave and β smooth, i.e. ‖∇2R‖ ≤ β.DensityEst has accuracy �est ≤ �/β in `∞.ApproxPlan has accuracy �plan ≤ � in `∞.
IfT ≥ 10β
�
thenR(dπmix) ≥ maxπ R(dπ)− � .
S. M. Kakade (UW) Curiosity 13 / 16
Proof idea: the conditional gradient method
Problem: convex set K , concave R(·)maxx∈K R(x)Optimization oracle: suppose we havemaxy∈K 〈y ,∇R(x)〉Guarantee: After T rounds, xT is:- a T−1-approximate maximizer- convex combination of T vertices in K
Choose K to be the convex set of state distributions.
dπmix ← (1− η) · dπmix + η · argmaxy∈K
〈y ,∇R(dπmix)〉
S. M. Kakade (UW) Curiosity 14 / 16
Theorem 2: unknown MDP case
TheoremIn an episodic setting, there is an algorithm that usesO(S2A ∗ poly(β, �−1)
)samples and returns a policy πmix such that:
R(dπmix) ≥ maxπ R(dπ)− � .
S. M. Kakade (UW) Curiosity 15 / 16
Experiments: CrossEntropy
S. M. Kakade (UW) Curiosity 15 / 16
Thanks!
Provably efficient methods to optimize a ’task agnostic’ rewardfunctions.Lots of open questions!
Means to think about exploration from an optimization perspective?Interplay with local search based methods?
Collaborators:
Elad Hazan Karan Singh Abby Van Soest
S. M. Kakade (UW) Curiosity 16 / 16
The LeadDefine the model/problemThe Lead, again