Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
University of Waterloo
The Option-Critic Architecture
Author: Pierre-Luc Bacon, Jean Harb, Doina PrecupSpeaker: Zebin KANG
June 26, 2018
1
Content
BackgroundResearch ProblemMarkov Decision Process (MDP)Policy Gradient MethodsThe Options Framework
Learning OptionsOption-value FunctionIntra-Option Policy Gradient Theorem (Theorem 1)Termination Gradient Theorem (Theorem 2)Architecture and Algorithm
ExperimentsFour-rooms DomainsPinball DomainsArcade Learning EnvironmentConclusion
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
2
BackgroundResearch Problem
Figure 1: Finding subgoals in four-room domain and learning policies toachieve these subgoals
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
3
BackgroundMarkov Decision Process (MDP)
I S: a set of statesI A: a set of actionsI P: a transition function, mapping S ×A to S → [0,1]
I r : a reward function, mapping S ×A to RI π: a policy, the probability distribution over actions conditioned on
states, i.e. π : S ×A → [0,1]
I Vπ(s) = E[∑∞
t=0 γt rt+1|s0 = s]: the value function of a policy π
I Qπ(s,a) = E[∑∞
t=0 γt rt+1|s0 = s,a0 = a]: the action-value
function of a policy πI ρ(θ, s0) = Eπθ
[∑∞
t=0 γt rt+1|s0]: the discounted return with
respect a specific start state s0
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
4
BackgroundPolicy Gradient Methods
Policy Gradient Theorem [2]Uses stochastic gradient descent to optimize a performance objectiveover a given family of parametrized stochastic policies πθ:
∂ρ(θ, s0)
∂θ=
∑s
µπθ(s|s0)
∑a
∂πθ(a|s)
∂θQπθ
(s,a)
where µπθ(s|s0) =
∑∞t=0 γ
tP(st = s|s0, π) is a discounted weighting ofstate along the trajectories starting from s0 andQπθ
(s,a) = E∑∞
k=1 γk−1rt+k |st = s,at = a, π is the action-value
given a policy.
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
5
BackgroundThe Options Framework
a Markovian option: ω = (Iω, πω, βω)
I Ω: the set of all histories and ω ∈ Ω
I Iω: an initiation set and Iω ⊂ SI πω: an intra-option policy, mapping S × A to [0,1]
I βω: a termination function, mapping S to [0,1]
I πω,θ: an intra-option policy of ω parametrized by θI βω,ϑ: a termination function of ω parametrized by ϑ
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
6
Learning OptionsOption-value Function
Option-value Function can be defined as:
QΩ(s, ω) =∑
a
πω,θ(a|s)QU(s, ω, a)
where QU is the option-action-value function
QU(s, ω, a) = r(s,a) + γ∑
s′
P(s′|s,a)U(ω, s′)
The function U is the option-value function upon arrival:
U(ω, s′) = (1− βωt ,ϑ(s′))QΩ(s′, ω) + βωt ,ϑ(s′)VΩ(s′)
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
7
Learning OptionsIntra-Option Policy Gradient Theorem (Theorem 1)
Intra-Option Policy Gradient Theorem (Theorem 1)Given a set of Markov options with stochastic intra-option policiesdifferentiable in their parameters θ, the gradient of the option-valuefunction with respect to θ and initial condition (s0, ω0):
∂QΩ(s0, ω0)
∂θ=
∑s,ω
µΩ(s, ω|s0, ω0)∑
a
∂πω,θ(a|s)
∂θQU(s, ω, a)
where µΩ(s, ω|s0, ω0) is a discounted weighting of state-option pairsalong trajectories starting from (s0, ω0):
µΩ(s, ω|s0, ω0) =∞∑
t=0
γtP(st = s, ωt = ω|s0, ω0)
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
8
Learning OptionsTermination Gradient Theorem (Theorem 2)
Termination Gradient Theorem (Theorem 2)Given a set of Markov options with stochastic termination functionsdifferentiable in their parameters ϑ, the gradient of the option-valuefunction upon arrival with respect to ϑ and the initial condition(s1, ω0) is:
∂U(ω0, s1)
∂ϑ= −
∑s′,ω
µΩ(s′, ω|s1, ω0)∂βω,ϑ(s′)
∂ϑAΩ(s′, ω)
where µΩ(s′, ω|s1, ω0) is a discounted weighting of state-option pairsalong trajectories from (s1, ω0):
µΩ(s′, ω|s1, ω0) =∞∑
t=0
γtP(st+1 = s′, ωt = ω|s1, ω0)
and AΩ(s′, ω) = QΩ(s′, ω)− VΩ(s′) is the advantage function [5].
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
9
Learning OptionsArchitecture and Algorithm
Figure 2: Diagram of the option-criticarchitecture
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
10
ExperimentsFour-rooms Domains
Figure 3: After a 1000 episodes, the goal location in the four-rooms domain ismoved randomly. Option-critic (“OC”) recovers faster than the primitiveactor-critic (“AC-PG”) and SARSA(0). Each line is averaged over 350 runs.
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
11
ExperimentsFour-rooms Domains
Figure 4: Termination probabilities for the option-critic agent learning with 4options. The darkest color represents the walls in the environment whilelighter colors encode higher termination probabilities.
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
12
ExperimentsPinball Domains
Figure 5: Pinball: Sample trajectory of the solution found after 250 episodesof training using 4 options All options (color-coded) are used by the policyover options in successful trajectories. The initial state is in the top left cornerand the goal is in the bottom right one (red circle).
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
13
ExperimentsPinball Domains
Figure 6: Learning curves in the Pinball domain.
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
14
ExperimentsArcade Learning Environment
Figure 7: Extend deep neural network architecture [8]. A concatenation of thelast 4 images is fed through the convolutional layers, producing a denserepresentation shared across intra-option policies, termination functions andpolicy over options.
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
15
ExperimentsArcade Learning Environment
Figure 8: Seaquest: Using a baseline in the gradient estimators improves thedistribution over actions in the intra-option policies, making them lessdeterministic. Each column represents one of the options learned inSeaquest. The vertical axis spans the 18 primitive actions of ALE. Theempirical action frequencies are coded by intensity.
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
16
ExperimentsArcade Learning Environment
Figure 9: Learning curves in the Arcade Learning Environment. The sameset of parameters was used across all four games: 8 options, 0.01termination regularization, 0.01 entropy regularization, and a baseline for theintra-option policy gradients.
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
17
ExperimentsArcade Learning Environment
Figure 10: Up/down specialization in the solution found by option-critic whenlearning with 2 options in Seaquest. The top bar shows a trajectory in thegame, with “white” representing a segment during which option 1 was activeand “black” for option 2.
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
18
Conclusion
I Proves "Intra-Option Policy Gradient Theorem" and "TerminationGradient Theorem"
I Raises the option-critic architecture and algorithmI Verifies the option-critic architecture with experiments in various
domains
Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture
19
References
[1] Bacon, P. L., Harb, J., & Precup, D. (2017, February). The Option-CriticArchitecture. In AAAI (pp. 1726-1734).
[2] Sutton, R. S., McAllester, D. A., Singh, S. P., & Mansour, Y. (2000). Policy gradientmethods for reinforcement learning with function approximation. In Advances inneural information processing systems (pp. 1057-1063).
[3] Sutton, R. S., Precup, D., & Singh, S. (1999). Between MDPs and semi-MDPs: Aframework for temporal abstraction in reinforcement learning. Artificialintelligence, 112(1-2), 181-211.
[4] Sutton, R. S. (1984). Temporal credit assignment in reinforcement learning.
[5] Baird III, L. C. (1993). Advantage updating (No. WL-TR-93-1146). WRIGHT LABWRIGHT-PATTERSON AFB OH.
[6] Mann, T., Mankowitz, D., & Mannor, S. (2014, January). Time-regularizedinterrupting options (TRIO). In International Conference on Machine Learning (pp.1350-1358).
[7] Konda, V. R., & Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances inneural information processing systems (pp. 1008-1014).
[8] Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., &Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXivpreprint arXiv:1312.5602.Pierre-Luc Bacon, Jean Harb, Doina Precup | The Option-Critic Architecture