The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top

University of Waterloo

The Option-Critic Architecture

Author: Pierre-Luc Bacon, Jean Harb, Doina PrecupSpeaker: Zebin KANG

June 26, 2018

1

Content

BackgroundResearch ProblemMarkov Decision Process (MDP)Policy Gradient MethodsThe Options Framework

Learning OptionsOption-value FunctionIntra-Option Policy Gradient Theorem (Theorem 1)Termination Gradient Theorem (Theorem 2)Architecture and Algorithm

ExperimentsFour-rooms DomainsPinball DomainsArcade Learning EnvironmentConclusion

Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

2

BackgroundResearch Problem

Figure 1: Finding subgoals in four-room domain and learning policies toachieve these subgoals


3

BackgroundMarkov Decision Process (MDP)

I S: a set of statesI A: a set of actionsI P: a transition function, mapping S ×A to S → [0,1]

I r : a reward function, mapping S ×A to RI π: a policy, the probability distribution over actions conditioned on

states, i.e. π : S ×A → [0,1]

I Vπ(s) = E[∑∞

t=0 γt rt+1|s0 = s]: the value function of a policy π

I Qπ(s,a) = E[∑∞

t=0 γt rt+1|s0 = s,a0 = a]: the action-value

function of a policy πI ρ(θ, s0) = Eπθ

[∑∞

t=0 γt rt+1|s0]: the discounted return with

respect a specific start state s0


4

BackgroundPolicy Gradient Methods

Policy Gradient Theorem [2]Uses stochastic gradient descent to optimize a performance objectiveover a given family of parametrized stochastic policies πθ:

∂ρ(θ, s0)

∂θ=

∑s

µπθ(s|s0)

∑a

∂πθ(a|s)

∂θQπθ

(s,a)

where µπθ(s|s0) =

∑∞t=0 γ

tP(st = s|s0, π) is a discounted weighting ofstate along the trajectories starting from s0 andQπθ

(s,a) = E∑∞

k=1 γk−1rt+k |st = s,at = a, π is the action-value

given a policy.


5

BackgroundThe Options Framework

a Markovian option: ω = (Iω, πω, βω)

I Ω: the set of all histories and ω ∈ Ω

I Iω: an initiation set and Iω ⊂ SI πω: an intra-option policy, mapping S × A to [0,1]

I βω: a termination function, mapping S to [0,1]

I πω,θ: an intra-option policy of ω parametrized by θI βω,ϑ: a termination function of ω parametrized by ϑ


6

Learning OptionsOption-value Function

Option-value Function can be defined as:

QΩ(s, ω) =∑

a

πω,θ(a|s)QU(s, ω, a)

where QU is the option-action-value function

QU(s, ω, a) = r(s,a) + γ∑

s′

P(s′|s,a)U(ω, s′)

The function U is the option-value function upon arrival:

U(ω, s′) = (1− βωt ,ϑ(s′))QΩ(s′, ω) + βωt ,ϑ(s′)VΩ(s′)


7

Learning OptionsIntra-Option Policy Gradient Theorem (Theorem 1)

Intra-Option Policy Gradient Theorem (Theorem 1)Given a set of Markov options with stochastic intra-option policiesdifferentiable in their parameters θ, the gradient of the option-valuefunction with respect to θ and initial condition (s0, ω0):

∂QΩ(s0, ω0)

∂θ=

∑s,ω

µΩ(s, ω|s0, ω0)∑

a

∂πω,θ(a|s)

∂θQU(s, ω, a)

where µΩ(s, ω|s0, ω0) is a discounted weighting of state-option pairsalong trajectories starting from (s0, ω0):

µΩ(s, ω|s0, ω0) =∞∑

t=0

γtP(st = s, ωt = ω|s0, ω0)


8

Learning OptionsTermination Gradient Theorem (Theorem 2)

Termination Gradient Theorem (Theorem 2)Given a set of Markov options with stochastic termination functionsdifferentiable in their parameters ϑ, the gradient of the option-valuefunction upon arrival with respect to ϑ and the initial condition(s1, ω0) is:

∂U(ω0, s1)

∂ϑ= −

∑s′,ω

µΩ(s′, ω|s1, ω0)∂βω,ϑ(s′)

∂ϑAΩ(s′, ω)

where µΩ(s′, ω|s1, ω0) is a discounted weighting of state-option pairsalong trajectories from (s1, ω0):

µΩ(s′, ω|s1, ω0) =∞∑

t=0

γtP(st+1 = s′, ωt = ω|s1, ω0)

and AΩ(s′, ω) = QΩ(s′, ω)− VΩ(s′) is the advantage function [5].


9

Learning OptionsArchitecture and Algorithm

Figure 2: Diagram of the option-criticarchitecture


10

ExperimentsFour-rooms Domains

Figure 3: After a 1000 episodes, the goal location in the four-rooms domain ismoved randomly. Option-critic (“OC”) recovers faster than the primitiveactor-critic (“AC-PG”) and SARSA(0). Each line is averaged over 350 runs.


11

ExperimentsFour-rooms Domains

Figure 4: Termination probabilities for the option-critic agent learning with 4options. The darkest color represents the walls in the environment whilelighter colors encode higher termination probabilities.


12

ExperimentsPinball Domains

Figure 5: Pinball: Sample trajectory of the solution found after 250 episodesof training using 4 options All options (color-coded) are used by the policyover options in successful trajectories. The initial state is in the top left cornerand the goal is in the bottom right one (red circle).


13

ExperimentsPinball Domains

Figure 6: Learning curves in the Pinball domain.


14

ExperimentsArcade Learning Environment

Figure 7: Extend deep neural network architecture [8]. A concatenation of thelast 4 images is fed through the convolutional layers, producing a denserepresentation shared across intra-option policies, termination functions andpolicy over options.


15


Figure 8: Seaquest: Using a baseline in the gradient estimators improves thedistribution over actions in the intra-option policies, making them lessdeterministic. Each column represents one of the options learned inSeaquest. The vertical axis spans the 18 primitive actions of ALE. Theempirical action frequencies are coded by intensity.


16


Figure 9: Learning curves in the Arcade Learning Environment. The sameset of parameters was used across all four games: 8 options, 0.01termination regularization, 0.01 entropy regularization, and a baseline for theintra-option policy gradients.


17


Figure 10: Up/down specialization in the solution found by option-critic whenlearning with 2 options in Seaquest. The top bar shows a trajectory in thegame, with “white” representing a segment during which option 1 was activeand “black” for option 2.


18

Conclusion

I Proves "Intra-Option Policy Gradient Theorem" and "TerminationGradient Theorem"

I Raises the option-critic architecture and algorithmI Verifies the option-critic architecture with experiments in various

domains


19

References

[1] Bacon, P. L., Harb, J., & Precup, D. (2017, February). The Option-CriticArchitecture. In AAAI (pp. 1726-1734).

[2] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradientmethods for reinforcement learning with function approximation. In Advances inneural information processing systems (pp. 1057-1063).

[3] Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: Aframework for temporal abstraction in reinforcement learning. Artificialintelligence, 112(1-2), 181-211.

[4] Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning.

[5] Baird III, L. C. (1993). Advantage updating (No. WL-TR-93-1146). WRIGHT LABWRIGHT-PATTERSON AFB OH.

[6] Mann, T., Mankowitz, D., & Mannor, S. (2014, January). Time-regularizedinterrupting options (TRIO). In International Conference on Machine Learning (pp.1350-1358).

[7] Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances inneural information processing systems (pp. 1008-1014).

[8] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXivpreprint arXiv:1312.5602.Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture

Documents

The Option-Critic Architectureppoupart/teaching/cs885... · Figure 10:Up/down specialization in the solution found by option-critic when learning with 2 options in Seaquest. The top