Upload
trinhphuc
View
213
Download
1
Embed Size (px)
Citation preview
EWRL - Barcelona, December 2016
Exploration-Exploitation in MDPs with Options
Ronan Fruit Alessandro Lazaric
SequeL – INRIA Lille
Context and Motivations
Online learning with optionsOverview
Assumption: options are assumed to be given as input.
Exploration-Exploitation in MDPs with Options SequeL - 1/22
Context and Motivations
Online learning with optionsOverview
Assumption: options are assumed to be given as input.
1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).
Exploration-Exploitation in MDPs with Options SequeL - 1/22
Context and Motivations
Online learning with optionsOverview
Assumption: options are assumed to be given as input.
1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).
2. Question: What theoretical properties can explain these observations?
Exploration-Exploitation in MDPs with Options SequeL - 1/22
Context and Motivations
Online learning with optionsOverview
Assumption: options are assumed to be given as input.
1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).
2. Question: What theoretical properties can explain these observations?
3. Existing algorithms based on options:
Exploration-Exploitation in MDPs with Options SequeL - 1/22
Context and Motivations
Online learning with optionsOverview
Assumption: options are assumed to be given as input.
1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).
2. Question: What theoretical properties can explain these observations?
3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])
Exploration-Exploitation in MDPs with Options SequeL - 1/22
Context and Motivations
Online learning with optionsOverview
Assumption: options are assumed to be given as input.
1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).
2. Question: What theoretical properties can explain these observations?
3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])I Model-based learning : SMDP-RMAX, SMDP-E3 ((author?) [1])
Exploration-Exploitation in MDPs with Options SequeL - 1/22
Context and Motivations
Online learning with optionsOverview
Assumption: options are assumed to be given as input.
1. Empirical observations: introducing options in an MDP can speed uplearning ((author?) [5]) but can also be harmful ((author?) [3]).
2. Question: What theoretical properties can explain these observations?
3. Existing algorithms based on options:I Model-free learning : SMDP-Q-learning ((author?) [4])I Model-based learning : SMDP-RMAX, SMDP-E3 ((author?) [1])
4. In this presentation: Model-based learning in the regret setting
Exploration-Exploitation in MDPs with Options SequeL - 1/22
Notations and Theoretical Background
Outline
Notations and Theoretical Background
Algorithm
AnalysisSMDPsMDPs with Options
Results
Conclusion and Future Work
Exploration-Exploitation in MDPs with Options SequeL - 2/22
Notations and Theoretical Background
Markov Decision ProcessesNotations
M = {S,A, p, r} where:
I S is a set of states,
I A = (As)s∈S is a set of actions,
I p(s ′|s, a) is the probability of transition (s, a)→ s ′,
I r(s, a) is the reward of action a in state s.
Exploration-Exploitation in MDPs with Options SequeL - 3/22
Notations and Theoretical Background
Temporal AbstractionThe Option Framework
Definition 1A (Markov) Option o is a 3-tuple
{Io , βo , πo
}
where:I Io ⊂ S is the set of initial states,
I βo : S → [0, 1] is a Markov terminationcondition,
I πo ∈ ΠSRM is a stationary Markov policy.
Exploration-Exploitation in MDPs with Options SequeL - 4/22
Notations and Theoretical Background
Temporal AbstractionSemi-Markov Decision Processes
Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.
Exploration-Exploitation in MDPs with Options SequeL - 5/22
Notations and Theoretical Background
Temporal AbstractionSemi-Markov Decision Processes
Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.
Proposition 1 ((author?) [4])A set of options O defined on an MDP M induces an SMDP M ′:M +O =⇒ M ′.
Exploration-Exploitation in MDPs with Options SequeL - 5/22
Notations and Theoretical Background
Temporal AbstractionSemi-Markov Decision Processes
Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.
Proposition 1 ((author?) [4])A set of options O defined on an MDP M induces an SMDP M ′:M +O =⇒ M ′.
Theorem 3If the set of options is proper, all holding times and rewards aresub-Exponential. Moreover, they are sub-Gaussian if and only if theyare bounded.
Exploration-Exploitation in MDPs with Options SequeL - 5/22
Notations and Theoretical Background
Temporal AbstractionSemi-Markov Decision Processes
Definition 2An SMDP M ′ = {S,A, p, r , τ} is an MDP with a random holding timeτ(s, a) associated with any state-action pair.
Proposition 1 ((author?) [4])A set of options O defined on an MDP M induces an SMDP M ′:M +O =⇒ M ′.
Theorem 3If the set of options is proper, all holding times and rewards aresub-Exponential. Moreover, they are sub-Gaussian if and only if theyare bounded.
MDP+Option ⇒ Sub-exponential or bounded SMDP
Exploration-Exploitation in MDPs with Options SequeL - 5/22
Notations and Theoretical Background
SMDP Learning protocol
s0
n = 0 T0 = 0 R0 = 0
Exploration-Exploitation in MDPs with Options SequeL - 6/22
Notations and Theoretical Background
SMDP Learning protocol
s0a2
a1
a3
n = 1 T0 = 0 R0 = 0
Exploration-Exploitation in MDPs with Options SequeL - 6/22
Notations and Theoretical Background
SMDP Learning protocol
s0a2
n = 1 T0 = 0 R0 = 0
Exploration-Exploitation in MDPs with Options SequeL - 6/22
Notations and Theoretical Background
SMDP Learning protocol
s0 s1τ1, r1
n = 1 T1 ← T0 + τ1 R1 ← R0 + r1
Exploration-Exploitation in MDPs with Options SequeL - 6/22
Notations and Theoretical Background
SMDP Learning protocol
s0 s1a′2
a′1
a′3
n = 2 T1 = τ1 R1 = r1
Exploration-Exploitation in MDPs with Options SequeL - 6/22
Notations and Theoretical Background
SMDP Learning protocol
s0 s1
a′1
n = 2 T1 = τ1 R1 = r1
Exploration-Exploitation in MDPs with Options SequeL - 6/22
Notations and Theoretical Background
SMDP Learning protocol
s0 s1
s2
τ2, r2
n = 2 T2 ← T1 + τ2 R2 ← R1 + r2
=⇒ Tn =∑n
i=1 τi and Rn =∑n
i=1 ri
Exploration-Exploitation in MDPs with Options SequeL - 6/22
Notations and Theoretical Background
Learning Setting1. Tn: number of time steps
n: number of decision steps=⇒ Tn/n: ”scale factor”
Exploration-Exploitation in MDPs with Options SequeL - 7/22
Notations and Theoretical Background
Learning Setting1. Tn: number of time steps
n: number of decision steps=⇒ Tn/n: ”scale factor”
2. An optimal policy π∗ for the agent is a policy maximizing thelong-term average reward:
ρ∗(M) = maxπ
{lim
n→+∞Eπ[
RnTn
]}
Exploration-Exploitation in MDPs with Options SequeL - 7/22
Notations and Theoretical Background
Learning Setting1. Tn: number of time steps
n: number of decision steps=⇒ Tn/n: ”scale factor”
2. An optimal policy π∗ for the agent is a policy maximizing thelong-term average reward:
ρ∗(M) = maxπ
{lim
n→+∞Eπ[
RnTn
]}
3. The agent aims at minimizing the total regret:
∆(M,A, n) = ρ∗(M)Tn − Rn
Exploration-Exploitation in MDPs with Options SequeL - 7/22
Algorithm
Outline
Notations and Theoretical Background
Algorithm
AnalysisSMDPsMDPs with Options
Results
Conclusion and Future Work
Exploration-Exploitation in MDPs with Options SequeL - 8/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 1
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an
estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP
|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0
|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin
and τ̃(s, a) ≥ r̃(s, a)
‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and
∑
s′∈Sp̃(s ′ | s, a) = 1
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an
estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP
|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0
|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin
and τ̃(s, a) ≥ r̃(s, a)
‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and
∑
s′∈Sp̃(s ′ | s, a) = 1
X̂k : estimated value of X at episode k (empirical average)
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an
estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP
|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0
|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin
and τ̃(s, a) ≥ r̃(s, a)
‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and
∑
s′∈Sp̃(s ′ | s, a) = 1
βk : confidence interval at episode k
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an
estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP
|̃r(s, a)− r̂k(s, a)| ≤ βrk(s, a) and τmax ≥ r̃(s, a) ≥ 0
|τ̃(s, a)− τ̂k(s, a)| ≤ βτk (s, a) and τmax ≥ τ̃(s, a) ≥ τmin
and τ̃(s, a) ≥ r̃(s, a)
‖p̃(· | s, a)− p̂k(· | s, a)‖1 ≤ βpk (s, a) and
∑
s′∈Sp̃(s ′ | s, a) = 1
τmax = maxs,a{τ̄ (s, a)}: maximal expected duration of optionsτmin = mins,a{τ̄ (s, a)}: minimal expected duration of options
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an
estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP
3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty
M̃k = arg maxM′
{ρ∗ (M ′)}
π̃k = arg maxπ
{ρπ(
M̃k
)}
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an
estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP
3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty
4. Follow the policy found until the number of new observations hasdoubled for at least one pair (s, a)
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an
estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP
3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty
4. Follow the policy found until the number of new observations hasdoubled for at least one pair (s, a)
5. End episode k
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Algorithm
AlgorithmUCRL-based algorithm:
1. Start new episode k ← k + 12. Use observations acquired during previous episodes to build an
estimated SMDP with confidence intervals=⇒ Bounded Parameter SMDP
3. Using Value Iteration, compute an (approximately) optimal policy π̃kfor the optimistic SMDP M̃k=⇒ Optimism in the face of uncertainty
4. Follow the policy found until the number of new observations hasdoubled for at least one pair (s, a)
5. End episode k
=⇒ Repeat steps 1 to 5 until i = n
Exploration-Exploitation in MDPs with Options SequeL - 9/22
Analysis
Outline
Notations and Theoretical Background
Algorithm
AnalysisSMDPsMDPs with Options
Results
Conclusion and Future Work
Exploration-Exploitation in MDPs with Options SequeL - 10/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M ′, n, δ))√
S ′A′n log(nδ
))
with:
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M ′, n, δ))√
S ′A′n log(nδ
))
with:I δ: confidence
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S′ + C(M ′, n, δ))√
S′A′n log(nδ
))
with:I δ: confidenceI S′: number of states
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M ′, n, δ))√
S ′A′n log(nδ
))
with:I δ: confidenceI S ′: number of statesI A′: number of actions
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M ′, n, δ))√
S ′A′n log(nδ
))
with:I δ: confidenceI S ′: number of statesI A′: number of actionsI D′: diameter of the SMDP
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M ′, n, δ))√
S ′A′n log(nδ
))
with:I δ: confidenceI S ′: number of statesI A′: number of actionsI D′: diameter of the SMDP
D(M ′) = maxs,s′∈S
{min
d∈DMDM
{Ed∞[
T (s ′)|s0 = s]}}
where T (s ′) = inf{ n∑
i=1τi : n ∈ N, sn = s ′
}
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M ′, n, δ))√
S ′A′n log(nδ
))
with:I δ: confidenceI S ′: number of statesI A′ = maxs |A′s |: maximal number of actions per stateI D′: diameter of the MDPI n: number of decision steps
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M′,n, δ))√
S ′A′n log(nδ
))
with:I C(M′,n, δ) = O(Tmax) for bounded r and τ
Tmax: maximal holding time
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M′,n, δ))√
S ′A′n log(nδ
))
with:I C(M′,n, δ) = τmax + Cτ
√log( nδ
)for sub-Exponential r and τ
Cτ : sub-Exponential constant
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis SMDPs
Regret analysis for SMDPsTheorem 4High-probability regret bound for SMDP-UCRL in M ′:
∆(M ′,A, n) = O((
D′√
S ′ + C(M′,n, δ))√
S ′A′n log(nδ
))
with:I C(M′,n, δ) = τmax + Cτ
√log( nδ
)for sub-Exponential r and τ
Cτ : sub-Exponential constant
Question: Can we use this bound to explain when options can speed uplearning?
Exploration-Exploitation in MDPs with Options SequeL - 11/22
Analysis MDPs with Options
Learning in MDPs with Options
ΠM : set of policies in MDP M
Exploration-Exploitation in MDPs with Options SequeL - 12/22
Analysis MDPs with Options
Learning in MDPs with Options
ΠMΠOM : set of policies over options inMDP M
Exploration-Exploitation in MDPs with Options SequeL - 12/22
Analysis MDPs with Options
Learning in MDPs with Options
ΠM
ρ∗(M)
ΠOM ρ∗O(M)
Exploration-Exploitation in MDPs with Options SequeL - 12/22
Analysis MDPs with Options
Learning in MDPs with Options
ΠM
ρ∗(M)
ΠOMρ∗O(M)ρ∗(M) ≥ ρ∗O(M)
Exploration-Exploitation in MDPs with Options SequeL - 12/22
Analysis MDPs with Options
Learning in MDPs with OptionsRemark 1
I Regret of SMDP-UCRL in the original MDP M:
∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))
Exploration-Exploitation in MDPs with Options SequeL - 13/22
Analysis MDPs with Options
Learning in MDPs with OptionsRemark 1
I Regret of SMDP-UCRL in the original MDP M:
∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))
Question: Comparison of the regret with and without options?
Exploration-Exploitation in MDPs with Options SequeL - 13/22
Analysis MDPs with Options
Learning in MDPs with OptionsRemark 1
I Regret of SMDP-UCRL in the original MDP M:
∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))
I Ratio of the regrets SMDP/MDP UCRL:
R =∆(M,SMDP-UCRL,Tn)
∆(M,MDP-UCRL,Tn)
?≤ 1
Exploration-Exploitation in MDPs with Options SequeL - 13/22
Analysis MDPs with Options
Learning in MDPs with OptionsRemark 1
I Regret of SMDP-UCRL in the original MDP M:
∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))
I Ratio of the regrets SMDP/MDP UCRL:
R =∆(M,SMDP-UCRL,Tn)
∆(M,MDP-UCRL,Tn)
?≤ 1
Bound for MDPs ((author?) [2]):
∆(M,A,Tn) = O(
DS
√ATn log
(Tnδ
))
Exploration-Exploitation in MDPs with Options SequeL - 13/22
Analysis MDPs with Options
Learning in MDPs with OptionsRemark 1
I Regret of SMDP-UCRL in the original MDP M:
∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))
I Ratio of the regrets SMDP/MDP UCRL:
R̃ ∼(
D′D +
Tmax
D√
S
)√OnATn
Exploration-Exploitation in MDPs with Options SequeL - 13/22
Analysis MDPs with Options
Learning in MDPs with OptionsRemark 1
I Regret of SMDP-UCRL in the original MDP M:
∆(M,A,Tn) = ∆(M ′,A, n) + Tn (ρ∗(M)− ρ∗O(M))
I Ratio of the regrets SMDP/MDP UCRL:
R̃ ∼(
D′D +
Tmax
D√
S
)√OnATn
∼ D′D
√OA
√n
Tn
Exploration-Exploitation in MDPs with Options SequeL - 13/22
Results
Outline
Notations and Theoretical Background
Algorithm
AnalysisSMDPsMDPs with Options
Results
Conclusion and Future Work
Exploration-Exploitation in MDPs with Options SequeL - 14/22
Results
Illustrative ExperimentsOptions with bounded holding time and reward
I Bounded optionsI A = O = 4I S ′ = S = d2
I D = 2(d − 1)
I τmax? D′?
R̃ ∼ D′D
√OA
√n
Tn
Exploration-Exploitation in MDPs with Options SequeL - 15/22
Results
Illustrative ExperimentsOptions with bounded holding time and reward
I Stochastic termination condition:
Starting State
s0 s3s2s1
1Tmax
1Tmax
1Tmax
τmax =Tmax + 1
2 and D′ ≤ D + Tmax(Tmax + 1)
R̃ ∼ D′D
√OA
√n
Tn
Exploration-Exploitation in MDPs with Options SequeL - 16/22
Results
Asymptotic bound on the ratio
1 2 3 4 5 6 7 8Maximal duration of options Tmax
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
Rat
ioof
regr
etsR̃
20x20 Grid25x25 Grid30x30 Grid
Exploration-Exploitation in MDPs with Options SequeL - 17/22
Results
Simulations
1 2 3 4 5 6 7 8Maximal duration of options Tmax
0.75
0.80
0.85
0.90
0.95
1.00
1.05
1.10
Rat
ioof
regr
etsR
20x20 Grid25x25 Grid30x30 Grid
Exploration-Exploitation in MDPs with Options SequeL - 18/22
Conclusion and Future Work
Outline
Notations and Theoretical Background
Algorithm
AnalysisSMDPsMDPs with Options
Results
Conclusion and Future Work
Exploration-Exploitation in MDPs with Options SequeL - 19/22
Conclusion and Future Work
Conclusion and Future WorkConclusion:
Exploration-Exploitation in MDPs with Options SequeL - 20/22
Conclusion and Future Work
Conclusion and Future WorkConclusion:
1. UCRL-based algorithm for SMDPs
Exploration-Exploitation in MDPs with Options SequeL - 20/22
Conclusion and Future Work
Conclusion and Future WorkConclusion:
1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)
Exploration-Exploitation in MDPs with Options SequeL - 20/22
Conclusion and Future Work
Conclusion and Future WorkConclusion:
1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison
Exploration-Exploitation in MDPs with Options SequeL - 20/22
Conclusion and Future Work
Conclusion and Future WorkConclusion:
1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison
Current limitations of SMDP-UCRL for options:
Exploration-Exploitation in MDPs with Options SequeL - 20/22
Conclusion and Future Work
Conclusion and Future WorkConclusion:
1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison
Current limitations of SMDP-UCRL for options:1. Additional knowledge about options is needed (e.g., sub-Exponential
constants)
Exploration-Exploitation in MDPs with Options SequeL - 20/22
Conclusion and Future Work
Conclusion and Future WorkConclusion:
1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison
Current limitations of SMDP-UCRL for options:1. Additional knowledge about options is needed (e.g., sub-Exponential
constants)2. Similarities between options and correlation between rewards and
holding times not exploited
Exploration-Exploitation in MDPs with Options SequeL - 20/22
Conclusion and Future Work
Conclusion and Future WorkConclusion:
1. UCRL-based algorithm for SMDPs2. Regret bound (SMDPs, MDPs with options)3. Theoretical/empirical comparison
Current limitations of SMDP-UCRL for options:1. Additional knowledge about options is needed (e.g., sub-Exponential
constants)2. Similarities between options and correlation between rewards and
holding times not exploited
Extensions:1. Automatic discovery of options? [estimating the diameter]
Exploration-Exploitation in MDPs with Options SequeL - 20/22
Conclusion and Future Work
Thank you for your attention
Exploration-Exploitation in MDPs with Options SequeL - 21/22
Conclusion and Future Work
Emma Brunskill and Lihong Li. Pac-inspired option discovery in lifelongreinforcement learning. In Tony Jebara and Eric P. Xing, editors,Proceedings of the 31st International Conference on MachineLearning (ICML-14), pages 316–324. JMLR Workshop andConference Proceedings, 2014.
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regretbounds for reinforcement learning. J. Mach. Learn. Res.,11:1563–1600, August 2010.
Nicholas K. Jong, Todd Hester, and Peter Stone. The utility of temporalabstraction in reinforcement learning. In The Seventh InternationalJoint Conference on Autonomous Agents and Multiagent Systems,May 2008.
Richard Sutton, Doina Precup, and Satinder Singh. Between mdps andsemi-mdps: A framework for temporal abstraction in reinforcementlearning. Artificial Intelligence, 112:181–211, 1999.
Chen Tessler, Shahar Givony, Tom Zahavy, Daniel J. Mankowitz, andShie Mannor. A deep hierarchical approach to lifelong learning inminecraft. CoRR, abs/1604.07255, 2016.
Exploration-Exploitation in MDPs with Options SequeL - 22/22