View
5
Download
0
Category
Preview:
Citation preview
Exploration Conscious Reinforcement LearningRevisited
Lior Shani* Yonathan Efroni* Shie Mannor
Technion Institute of Technology
Shani, Efroni & Mannor /19
I LOVEπ-GREEDY
Why?
12-Jun-19Exploration Conscious Reinforcement Learning revisited 2
β’ To learn a good policy, an RL agent must explore!
β’ However, it can cause hazardous behavior during training.
Shani, Efroni & Mannor /19
I LOVEπ-GREEDY
Damn youExploration!
Why?
12-Jun-19Exploration Conscious Reinforcement Learning revisited 3
β’ To learn a good policy, an RL agent must explore!
β’ However, it can cause hazardous behavior during training.
Shani, Efroni & Mannor /19
β’ Objective: Find the optimal policy knowing that exploration might occur
β’ For example : π-greedy exploration (πΆ = π)
π πΆβ β argmaxπ βπ«πΌ
πβπΆ π +πΆπ π
π=π
β
πΈππ ππ, ππ
Exploration Conscious Reinforcement Learning
12-Jun-19Exploration Conscious Reinforcement Learning revisited 4
Shani, Efroni & Mannor /19
β’ Objective: Find the optimal policy knowing that exploration might occur
β’ For example : π-greedy exploration (πΆ = π)
π πΆβ β argmaxπ βπ«πΌ
πβπΆ π +πΆπ π
π=π
β
πΈππ ππ, ππ
β’ Solving the Exploration-Conscious problem = Solving an MDP
β’ We describe a bias-error sensitivity tradeoff in πΆ
Exploration Conscious Reinforcement Learning
12-Jun-19Exploration Conscious Reinforcement Learning revisited 5
Shani, Efroni & Mannor /19
β’ Objective: Find the optimal policy knowing that exploration might occur
Exploration Conscious Reinforcement Learning
12-Jun-19Exploration Conscious Reinforcement Learning revisited 6
Iβm ExplorationConscious
Shani, Efroni & Mannor /19
Fixed Exploration Schemes (e.g. π-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 7
Choose
Greedy
Action
ππππππ π
β’ ππππππ π β argmaxππΈπ πΆ π, π
Shani, Efroni & Mannor /19
Choose
Greedy
Action
ππππππ π
Draw
Exploratory
Action
ππππ
Fixed Exploration Schemes (e.g. π-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 8
β’ ππππππ π β argmaxππΈπ πΆ π, π
β’ For πΆ-greedy: ππππ βππππππ π w.p. π β πΆπ π else
Shani, Efroni & Mannor /19
Choose
Greedy
Action
ππππππ π
Draw
Exploratory
Action
ππππ
Act
ππππ
Fixed Exploration Schemes (e.g. π-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 9
β’ ππππππ π β argmaxππΈπ πΆ π, π
β’ For πΆ-greedy: ππππ βππππππ π w.p. π β πΆπ π else
Shani, Efroni & Mannor /19
Choose
Greedy
Action
ππππππ π
Draw
Exploratory
Action
ππππ
Act
ππππRecieve
π, πβ²
Fixed Exploration Schemes (e.g. π-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 10
β’ ππππππ π β argmaxππΈπ πΆ π, π
β’ For πΆ-greedy: ππππ βππππππ π w.p. π β πΆπ π else
Shani, Efroni & Mannor /19
Choose
Greedy
Action
ππππππ π
Draw
Exploratory
Action
ππππ
Act
ππππRecieve
π, πβ²
Fixed Exploration Schemes (e.g. π-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 11
β’ Normally used information: π, ππππ, π, πβ²
Shani, Efroni & Mannor /19
Choose
Greedy
Action
ππππππ π
Draw
Exploratory
Action
ππππ
Act
ππππRecieve
π, πβ²
Fixed Exploration Schemes (e.g. π-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 12
β’ Normally used information: π, ππππ, π, πβ²
β’ Using information about the exploration process: π, ππππππ π , ππππ, π, πβ²
Shani, Efroni & Mannor /19
Two Approaches β Expected approach
12-Jun-19Exploration Conscious Reinforcement Learning revisited 13
1. Update πΈπ πΆ ππ, πππππ
2. Expect that the agent might explore in the next state
πΈπ πΆ ππ, πππππ += πΌ ππ + πΈπΌ πβπΆ π +πΆπ ππΈπ πΆ ππ+π, π β πΈπ πΆ ππ, ππ
πππ
Shani, Efroni & Mannor /19
Two Approaches β Expected approach
12-Jun-19Exploration Conscious Reinforcement Learning revisited 14
1. Update πΈπ πΆ ππ, πππππ
2. Expect that the agent might explore in the next state
πΈπ πΆ ππ, πππππ += πΌ ππ + πΈπΌ πβπΆ π +πΆπ ππΈπ πΆ ππ+π, π β πΈπ πΆ ππ, ππ
πππ
β’ Calculating expectations can be hard.
β’ Requires sampling in the continuous case!
Shani, Efroni & Mannor /19
β’ Exploration is incorporated into the environment!
1. Update πΈπ πΆ ππ, πππππππ π
2. The rewards and next state ππ ,ππ+π are given by the acted action πππππ
πΈπ πΆ ππ, πππππππ π
+= πΌ ππ + πΈπΈπ πΆ ππ+π, ππ+ππππππ π
β πΈπ πΆ ππ, πππππππ π
Two Approaches β Surrogate approach
12-Jun-19Exploration Conscious Reinforcement Learning revisited 15
Shani, Efroni & Mannor /19
β’ Exploration is incorporated into the environment!
1. Update πΈπ πΆ ππ, πππππππ π
2. The rewards and next state ππ ,ππ+π are given by the acted action πππππ
πΈπ πΆ ππ, πππππππ π
+= πΌ ππ + πΈπΈπ πΆ ππ+π, ππ+ππππππ π
β πΈπ πΆ ππ, πππππππ π
β’ NO NEED TO SAMPLE!
Two Approaches β Surrogate approach
12-Jun-19Exploration Conscious Reinforcement Learning revisited 16
Shani, Efroni & Mannor /19
Deep RLExperimental Results
12-Jun-19Exploration Conscious Reinforcement Learning revisited 17
Training Evaluation
Shani, Efroni & Mannor /19
β’ We define Exploration Conscious RL and analyze its properties.
β’ Exploration Conscious RL can improve performance over both the training
and evaluation regimes.
β’ Conclusion: Exploration-Conscious RL and specifically, the Surrogate
approach, can easily help to improve variety of RL algorithms.
Summary
12-Jun-19Exploration Conscious Reinforcement Learning revisited 18
Shani, Efroni & Mannor /19
β’ We define Exploration Conscious RL and analyze its properties.
β’ Exploration Conscious RL can improve performance over both the training
and evaluation regimes.
β’ Conclusion: Exploration-Conscious RL and specifically, the Surrogate
approach, can easily help to improve variety of RL algorithms.
SEE YOU AT POSTER
#90
Summary
12-Jun-19Exploration Conscious Reinforcement Learning revisited 19
Recommended