20
University of Waterloo The Option-Critic Architecture Author: Pierre-Luc Bacon, Jean Harb, Doina Precup Speaker: Zebin KANG June 26, 2018

The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

University of Waterloo

The Option-Critic Architecture

Author: Pierre-Luc Bacon, Jean Harb, Doina PrecupSpeaker: Zebin KANG

June 26, 2018

Page 2: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

1

Content

BackgroundResearch ProblemMarkov Decision Process (MDP)Policy Gradient MethodsThe Options Framework

Learning OptionsOption-value FunctionIntra-Option Policy Gradient Theorem (Theorem 1)Termination Gradient Theorem (Theorem 2)Architecture and Algorithm

ExperimentsFour-rooms DomainsPinball DomainsArcade Learning EnvironmentConclusion

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 3: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

2

BackgroundResearch Problem

Figure 1: Finding subgoals in four-room domain and learning policies toachieve these subgoals

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 4: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

3

BackgroundMarkov Decision Process (MDP)

I S: a set of statesI A: a set of actionsI P: a transition function, mapping S ×A to S → [0,1]

I r : a reward function, mapping S ×A to RI π: a policy, the probability distribution over actions conditioned on

states, i.e. π : S ×A → [0,1]

I Vπ(s) = E[∑∞

t=0 γt rt+1|s0 = s]: the value function of a policy π

I Qπ(s,a) = E[∑∞

t=0 γt rt+1|s0 = s,a0 = a]: the action-value

function of a policy πI ρ(θ, s0) = Eπθ

[∑∞

t=0 γt rt+1|s0]: the discounted return with

respect a specific start state s0

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 5: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

4

BackgroundPolicy Gradient Methods

Policy Gradient Theorem [2]Uses stochastic gradient descent to optimize a performance objectiveover a given family of parametrized stochastic policies πθ:

∂ρ(θ, s0)

∂θ=

∑s

µπθ(s|s0)

∑a

∂πθ(a|s)

∂θQπθ

(s,a)

where µπθ(s|s0) =

∑∞t=0 γ

tP(st = s|s0, π) is a discounted weighting ofstate along the trajectories starting from s0 andQπθ

(s,a) = E∑∞

k=1 γk−1rt+k |st = s,at = a, π is the action-value

given a policy.

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 6: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

5

BackgroundThe Options Framework

a Markovian option: ω = (Iω, πω, βω)

I Ω: the set of all histories and ω ∈ Ω

I Iω: an initiation set and Iω ⊂ SI πω: an intra-option policy, mapping S × A to [0,1]

I βω: a termination function, mapping S to [0,1]

I πω,θ: an intra-option policy of ω parametrized by θI βω,ϑ: a termination function of ω parametrized by ϑ

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 7: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

6

Learning OptionsOption-value Function

Option-value Function can be defined as:

QΩ(s, ω) =∑

a

πω,θ(a|s)QU(s, ω, a)

where QU is the option-action-value function

QU(s, ω, a) = r(s,a) + γ∑

s′

P(s′|s,a)U(ω, s′)

The function U is the option-value function upon arrival:

U(ω, s′) = (1− βωt ,ϑ(s′))QΩ(s′, ω) + βωt ,ϑ(s′)VΩ(s′)

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 8: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

7

Learning OptionsIntra-Option Policy Gradient Theorem (Theorem 1)

Intra-Option Policy Gradient Theorem (Theorem 1)Given a set of Markov options with stochastic intra-option policiesdifferentiable in their parameters θ, the gradient of the option-valuefunction with respect to θ and initial condition (s0, ω0):

∂QΩ(s0, ω0)

∂θ=

∑s,ω

µΩ(s, ω|s0, ω0)∑

a

∂πω,θ(a|s)

∂θQU(s, ω, a)

where µΩ(s, ω|s0, ω0) is a discounted weighting of state-option pairsalong trajectories starting from (s0, ω0):

µΩ(s, ω|s0, ω0) =∞∑

t=0

γtP(st = s, ωt = ω|s0, ω0)

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 9: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

8

Learning OptionsTermination Gradient Theorem (Theorem 2)

Termination Gradient Theorem (Theorem 2)Given a set of Markov options with stochastic termination functionsdifferentiable in their parameters ϑ, the gradient of the option-valuefunction upon arrival with respect to ϑ and the initial condition(s1, ω0) is:

∂U(ω0, s1)

∂ϑ= −

∑s′,ω

µΩ(s′, ω|s1, ω0)∂βω,ϑ(s′)

∂ϑAΩ(s′, ω)

where µΩ(s′, ω|s1, ω0) is a discounted weighting of state-option pairsalong trajectories from (s1, ω0):

µΩ(s′, ω|s1, ω0) =∞∑

t=0

γtP(st+1 = s′, ωt = ω|s1, ω0)

and AΩ(s′, ω) = QΩ(s′, ω)− VΩ(s′) is the advantage function [5].

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 10: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

9

Learning OptionsArchitecture and Algorithm

Figure 2: Diagram of the option-criticarchitecture

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 11: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

10

ExperimentsFour-rooms Domains

Figure 3: After a 1000 episodes, the goal location in the four-rooms domain ismoved randomly. Option-critic (“OC”) recovers faster than the primitiveactor-critic (“AC-PG”) and SARSA(0). Each line is averaged over 350 runs.

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 12: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

11

ExperimentsFour-rooms Domains

Figure 4: Termination probabilities for the option-critic agent learning with 4options. The darkest color represents the walls in the environment whilelighter colors encode higher termination probabilities.

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 13: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

12

ExperimentsPinball Domains

Figure 5: Pinball: Sample trajectory of the solution found after 250 episodesof training using 4 options All options (color-coded) are used by the policyover options in successful trajectories. The initial state is in the top left cornerand the goal is in the bottom right one (red circle).

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 14: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

13

ExperimentsPinball Domains

Figure 6: Learning curves in the Pinball domain.

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 15: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

14

ExperimentsArcade Learning Environment

Figure 7: Extend deep neural network architecture [8]. A concatenation of thelast 4 images is fed through the convolutional layers, producing a denserepresentation shared across intra-option policies, termination functions andpolicy over options.

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 16: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

15

ExperimentsArcade Learning Environment

Figure 8: Seaquest: Using a baseline in the gradient estimators improves thedistribution over actions in the intra-option policies, making them lessdeterministic. Each column represents one of the options learned inSeaquest. The vertical axis spans the 18 primitive actions of ALE. Theempirical action frequencies are coded by intensity.

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 17: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

16

ExperimentsArcade Learning Environment

Figure 9: Learning curves in the Arcade Learning Environment. The sameset of parameters was used across all four games: 8 options, 0.01termination regularization, 0.01 entropy regularization, and a baseline for theintra-option policy gradients.

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 18: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

17

ExperimentsArcade Learning Environment

Figure 10: Up/down specialization in the solution found by option-critic whenlearning with 2 options in Seaquest. The top bar shows a trajectory in thegame, with “white” representing a segment during which option 1 was activeand “black” for option 2.

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 19: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

18

Conclusion

I Proves "Intra-Option Policy Gradient Theorem" and "TerminationGradient Theorem"

I Raises the option-critic architecture and algorithmI Verifies the option-critic architecture with experiments in various

domains

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Page 20: The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

19

References

[1] Bacon, P. L., Harb, J., & Precup, D. (2017, February). The Option-CriticArchitecture. In AAAI (pp. 1726-1734).

[2] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradientmethods for reinforcement learning with function approximation. In Advances inneural information processing systems (pp. 1057-1063).

[3] Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: Aframework for temporal abstraction in reinforcement learning. Artificialintelligence, 112(1-2), 181-211.

[4] Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning.

[5] Baird III, L. C. (1993). Advantage updating (No. WL-TR-93-1146). WRIGHT LABWRIGHT-PATTERSON AFB OH.

[6] Mann, T., Mankowitz, D., & Mannor, S. (2014, January). Time-regularizedinterrupting options (TRIO). In International Conference on Machine Learning (pp.1350-1358).

[7] Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances inneural information processing systems (pp. 1008-1014).

[8] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXivpreprint arXiv:1312.5602.Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture