20
Curiosity and Maximum Entropy Exploration Sham M. Kakade Elad Hazan, Karan Singh, Abby Van Soest University of Washington Google AI Princeton S. M. Kakade (UW) Curiosity 1 / 16

Curiosity and Maximum Entropy Exploration · S. M. Kakade (UW) Curiosity 6/16. Example 2: Visitation-based Rewards [Burda et.al 2018b] rt = kf(st) ˚(st)k2 Agent completes over 10

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

  • Curiosity and Maximum Entropy Exploration

    Sham M. KakadeElad Hazan, Karan Singh, Abby Van Soest

    University of Washington

    Google AI Princeton

    S. M. Kakade (UW) Curiosity 1 / 16

  • Exploration and Curiosity

    Can an agent, in a known or unknown environment, learn what it iscapable of doing?

    understanding the agents capabilities are helpful for downstreamtasksreward functions may be poorly specified or sparse

    S. M. Kakade (UW) Curiosity 2 / 16

  • Setting: Markov decision processes

    S states. start with s0 ∼ d0A actions.dynamics model P(s′|s,a).reward function r(s)discount factor γ Sutton, Barto ’18

    (non-stationary) policy π: (s0,a0, s1,a1, . . .)→ atStandard objective: find π which maximizes:

    V (π) = (1− γ)E[r(s0) + γr(s1) + γ2r(s2) + . . .]

    where the distribution of st ,at is induced by π.

    S. M. Kakade (UW) Curiosity 3 / 16

  • Prior work: The Explore/Exploit Tradeoff

    Thrun ’92

    Random search does not find the reward quickly.

    (theory) Balancing the explore/exploit tradeoff:[Kearns & Singh, ’02] E3 is a near-optimal algo.Sample complexity: [K. ’03, Azar ’17]Model free: [Strehl et.al. ’06; Dann and Brunskill ’15; Szita &Szepesvari ’10; Lattimore et.al. ’14; Jin et.al. ’18]

    S. M. Kakade (UW) Curiosity 4 / 16

  • Prior work: Intrinsic Motivation

    There is a host of work on learning based on “intrinsic reward”.

    Prior work:intrinsic motivation:Chentanez et.al. ’05; Singh et.al. ’09, ’10; Zheng et.al. ’18Bonuses:Bellemare et.al. ’16; Ostrovski et.al. ’17; Tang et.al. ’17Prediction error:Lopes et.al. ’12; Pathak et.al. ’17; Savinov et.al. ’18; Fu et.al. ’17;Shakir et.al. ’15; Rein et.al. ’16; Weber ’18

    S. M. Kakade (UW) Curiosity 5 / 16

  • Example 1: Dynamics-based Prediction Error

    [Burda et.al 2018a]

    rt = ‖f (st ,at)− φ(st+1)‖2

    Agent reaches the end of many games with no rewards signal.

    Intrinsic reward function based on dynamics prediction error.f (s,a) is a forward dynamics model trained to predict the next state.φ can be random features, embedding for inverse dynamics, VAE.

    S. M. Kakade (UW) Curiosity 6 / 16

  • Example 2: Visitation-based Rewards

    [Burda et.al 2018b]

    rt = ‖f (st)− φ(st)‖2

    Agent completes over 10 rooms, with no reward signal.

    Intrinsic reward designed to be large for infrequently visited states.φ(·) is a random function, f (·) is a learned function.Related [Bellemare et. al ’16]

    S. M. Kakade (UW) Curiosity 7 / 16

  • Algorithmic Challenges

    Many algorithms solve sub-problems using fictitious rewards.disentangling uncertainty in the dynamics with novelty?determining ’novelty bonuses’ as fictitious rewards?

    Understanding what ’oracle models/sup-problems’ we should solve.’oracle’ based algorithms: CPI, DPPS, DAGGER, SEARN, ...

    An optimization framework may help with these issues.

    S. M. Kakade (UW) Curiosity 8 / 16

  • Outline

    This work: Can we efficiently learn policies in an MDP which optimize“task agnostic” reward functions?

    Formalize a class of “task agnostic” reward functions.Theorem 1: there is a computationally efficient algorithm to optimizethese reward functions.Theorem 2: (unknown model case) there is also a sample efficientalgorithm.

    S. M. Kakade (UW) Curiosity 8 / 16

  • Formalization: A ’task agnostic’ objective

    π induces a distribution over states:

    dπ(s) = (1−γ)(Pr(s0 = s|π) + γ Pr(s1 = s|π) + γ2 Pr(s2 = s|π) + . . .

    )Consider (concave) reward functions based on density dπ:

    maxπ

    R(dπ)

    Examples:Entropy(dπ) = −

    ∑s dπ(s) log dπ(s)

    CrossEntropy(Uniform||dπ) = −∑

    s1S log dπ(s)

    S. M. Kakade (UW) Curiosity 9 / 16

  • Example: The CrossEntropy Reward Function

    π∗CE = argmaxπ

    CrossEntropy(Uniform||dπ)

    The CrossEntropy encourages the ’most uniform’ random walk overthe MDP.

    LemmaGiven some state s, let Pr∗(s) be the highest probability of being in sover all policies:

    Pr∗(s) = maxπ

    dπ(s) .

    For all states s, we have that:

    dπ∗CE(s) ≥Pr∗(s)

    4S

    S. M. Kakade (UW) Curiosity 9 / 16

  • Optimization Challenges

    maxπ

    R(dπ)

    Can we optimize this efficiently?

    with an approximate planning oracle, where we feed in ’intrinsic’reward functions?with a known model? with only simulation based access?

    S. M. Kakade (UW) Curiosity 9 / 16

  • Optimization Landscape

    LemmaR(dπ) is not concave in the policy π,even if R(d) is concave in d.

    S. M. Kakade (UW) Curiosity 10 / 16

  • Mixtures of policies

    The optimal policy may be stochastic.Define a mixture policy

    policies: C = (π1, . . . πk )mixing weights: (α1, . . . αk )πmix = (α,C): at t = 0, sample a policy πi and use it onwards.

    The induced state distribution is:

    dπmix(s) =∑

    i

    αidπi (s).

    S. M. Kakade (UW) Curiosity 11 / 16

  • Optimization with Approximate Oracles

    For t = 0, . . . ,T − 1 do1 Set d̂πmix = DensityEst(πmix, �est).2 Set the intrinsic reward function as

    rt(s) :=∂R(d1,d2 . . . dS)

    ∂ds

    ∣∣∣ds=d̂πmix (s)

    =1

    d̂πmix(s)︸ ︷︷ ︸for CrossEntropy

    .

    3 Set πt+1 = ApproxPlan(rt , �plan

    ).

    4 Update πmix ← (αt+1,Ct+1) using “learning rate” η:

    Ct+1 = (π0, π1, . . . , πt+1),αt+1 = ((1− η)αt , η).

    S. M. Kakade (UW) Curiosity 12 / 16

  • Theorem 1: Computation

    TheoremAssume:

    R(·) is concave and β smooth, i.e. ‖∇2R‖ ≤ β.DensityEst has accuracy �est ≤ �/β in `∞.ApproxPlan has accuracy �plan ≤ � in `∞.

    IfT ≥ 10β

    thenR(dπmix) ≥ maxπ R(dπ)− � .

    S. M. Kakade (UW) Curiosity 13 / 16

  • Proof idea: the conditional gradient method

    Problem: convex set K , concave R(·)maxx∈K R(x)Optimization oracle: suppose we havemaxy∈K 〈y ,∇R(x)〉Guarantee: After T rounds, xT is:- a T−1-approximate maximizer- convex combination of T vertices in K

    Choose K to be the convex set of state distributions.

    dπmix ← (1− η) · dπmix + η · argmaxy∈K

    〈y ,∇R(dπmix)〉

    S. M. Kakade (UW) Curiosity 14 / 16

  • Theorem 2: unknown MDP case

    TheoremIn an episodic setting, there is an algorithm that usesO(S2A ∗ poly(β, �−1)

    )samples and returns a policy πmix such that:

    R(dπmix) ≥ maxπ R(dπ)− � .

    S. M. Kakade (UW) Curiosity 15 / 16

  • Experiments: CrossEntropy

    S. M. Kakade (UW) Curiosity 15 / 16

  • Thanks!

    Provably efficient methods to optimize a ’task agnostic’ rewardfunctions.Lots of open questions!

    Means to think about exploration from an optimization perspective?Interplay with local search based methods?

    Collaborators:

    Elad Hazan Karan Singh Abby Van Soest

    S. M. Kakade (UW) Curiosity 16 / 16

    The LeadDefine the model/problemThe Lead, again