Upload
others
View
5
Download
0
Embed Size (px)
Citation preview
Exploration Conscious Reinforcement LearningRevisited
Lior Shani* Yonathan Efroni* Shie Mannor
Technion Institute of Technology
Shani, Efroni & Mannor /19
I LOVE𝝐-GREEDY
Why?
12-Jun-19Exploration Conscious Reinforcement Learning revisited 2
• To learn a good policy, an RL agent must explore!
• However, it can cause hazardous behavior during training.
Shani, Efroni & Mannor /19
I LOVE𝝐-GREEDY
Damn youExploration!
Why?
12-Jun-19Exploration Conscious Reinforcement Learning revisited 3
• To learn a good policy, an RL agent must explore!
• However, it can cause hazardous behavior during training.
Shani, Efroni & Mannor /19
• Objective: Find the optimal policy knowing that exploration might occur
• For example : 𝝐-greedy exploration (𝜶 = 𝝐)
𝝅𝜶∗ ∈ argmax𝝅∈𝜫𝔼
𝟏−𝜶 𝝅+𝜶𝝅𝟎
𝒕=𝟎
∞
𝜸𝒕𝒓 𝒔𝒕, 𝒂𝒕
Exploration Conscious Reinforcement Learning
12-Jun-19Exploration Conscious Reinforcement Learning revisited 4
Shani, Efroni & Mannor /19
• Objective: Find the optimal policy knowing that exploration might occur
• For example : 𝝐-greedy exploration (𝜶 = 𝝐)
𝝅𝜶∗ ∈ argmax𝝅∈𝜫𝔼
𝟏−𝜶 𝝅+𝜶𝝅𝟎
𝒕=𝟎
∞
𝜸𝒕𝒓 𝒔𝒕, 𝒂𝒕
• Solving the Exploration-Conscious problem = Solving an MDP
• We describe a bias-error sensitivity tradeoff in 𝜶
Exploration Conscious Reinforcement Learning
12-Jun-19Exploration Conscious Reinforcement Learning revisited 5
Shani, Efroni & Mannor /19
• Objective: Find the optimal policy knowing that exploration might occur
Exploration Conscious Reinforcement Learning
12-Jun-19Exploration Conscious Reinforcement Learning revisited 6
I’m ExplorationConscious
Shani, Efroni & Mannor /19
Fixed Exploration Schemes (e.g. 𝜖-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 7
Choose
Greedy
Action
𝒂𝒈𝒓𝒆𝒆𝒅𝒚
• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂
Shani, Efroni & Mannor /19
Choose
Greedy
Action
𝒂𝒈𝒓𝒆𝒆𝒅𝒚
Draw
Exploratory
Action
𝒂𝒂𝒄𝒕
Fixed Exploration Schemes (e.g. 𝜖-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 8
• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂
• For 𝜶-greedy: 𝒂𝒂𝒄𝒕 ∈𝒂𝒈𝒓𝒆𝒆𝒅𝒚 w.p. 𝟏 − 𝜶𝝅𝟎 else
Shani, Efroni & Mannor /19
Choose
Greedy
Action
𝒂𝒈𝒓𝒆𝒆𝒅𝒚
Draw
Exploratory
Action
𝒂𝒂𝒄𝒕
Act
𝒂𝒂𝒄𝒕
Fixed Exploration Schemes (e.g. 𝜖-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 9
• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂
• For 𝜶-greedy: 𝒂𝒂𝒄𝒕 ∈𝒂𝒈𝒓𝒆𝒆𝒅𝒚 w.p. 𝟏 − 𝜶𝝅𝟎 else
Shani, Efroni & Mannor /19
Choose
Greedy
Action
𝒂𝒈𝒓𝒆𝒆𝒅𝒚
Draw
Exploratory
Action
𝒂𝒂𝒄𝒕
Act
𝒂𝒂𝒄𝒕Recieve
𝒓, 𝒔′
Fixed Exploration Schemes (e.g. 𝜖-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 10
• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂
• For 𝜶-greedy: 𝒂𝒂𝒄𝒕 ∈𝒂𝒈𝒓𝒆𝒆𝒅𝒚 w.p. 𝟏 − 𝜶𝝅𝟎 else
Shani, Efroni & Mannor /19
Choose
Greedy
Action
𝒂𝒈𝒓𝒆𝒆𝒅𝒚
Draw
Exploratory
Action
𝒂𝒂𝒄𝒕
Act
𝒂𝒂𝒄𝒕Recieve
𝒓, 𝒔′
Fixed Exploration Schemes (e.g. 𝜖-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 11
• Normally used information: 𝒔, 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′
Shani, Efroni & Mannor /19
Choose
Greedy
Action
𝒂𝒈𝒓𝒆𝒆𝒅𝒚
Draw
Exploratory
Action
𝒂𝒂𝒄𝒕
Act
𝒂𝒂𝒄𝒕Recieve
𝒓, 𝒔′
Fixed Exploration Schemes (e.g. 𝜖-greedy)
12-Jun-19Exploration Conscious Reinforcement Learning revisited 12
• Normally used information: 𝒔, 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′
• Using information about the exploration process: 𝒔, 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 , 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′
Shani, Efroni & Mannor /19
Two Approaches – Expected approach
12-Jun-19Exploration Conscious Reinforcement Learning revisited 13
1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕
2. Expect that the agent might explore in the next state
𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕 += 𝜼 𝒓𝒕 + 𝜸𝔼 𝟏−𝜶 𝝅+𝜶𝝅𝟎𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂 − 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕
𝒂𝒄𝒕
Shani, Efroni & Mannor /19
Two Approaches – Expected approach
12-Jun-19Exploration Conscious Reinforcement Learning revisited 14
1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕
2. Expect that the agent might explore in the next state
𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕 += 𝜼 𝒓𝒕 + 𝜸𝔼 𝟏−𝜶 𝝅+𝜶𝝅𝟎𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂 − 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕
𝒂𝒄𝒕
• Calculating expectations can be hard.
• Requires sampling in the continuous case!
Shani, Efroni & Mannor /19
• Exploration is incorporated into the environment!
1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚
2. The rewards and next state 𝒓𝒕 ,𝒔𝒕+𝟏 are given by the acted action 𝒂𝒕𝒂𝒄𝒕
𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚
+= 𝜼 𝒓𝒕 + 𝜸𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂𝒕+𝟏𝒈𝒓𝒆𝒆𝒅𝒚
− 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚
Two Approaches – Surrogate approach
12-Jun-19Exploration Conscious Reinforcement Learning revisited 15
Shani, Efroni & Mannor /19
• Exploration is incorporated into the environment!
1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚
2. The rewards and next state 𝒓𝒕 ,𝒔𝒕+𝟏 are given by the acted action 𝒂𝒕𝒂𝒄𝒕
𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚
+= 𝜼 𝒓𝒕 + 𝜸𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂𝒕+𝟏𝒈𝒓𝒆𝒆𝒅𝒚
− 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚
• NO NEED TO SAMPLE!
Two Approaches – Surrogate approach
12-Jun-19Exploration Conscious Reinforcement Learning revisited 16
Shani, Efroni & Mannor /19
Deep RLExperimental Results
12-Jun-19Exploration Conscious Reinforcement Learning revisited 17
Training Evaluation
Shani, Efroni & Mannor /19
• We define Exploration Conscious RL and analyze its properties.
• Exploration Conscious RL can improve performance over both the training
and evaluation regimes.
• Conclusion: Exploration-Conscious RL and specifically, the Surrogate
approach, can easily help to improve variety of RL algorithms.
Summary
12-Jun-19Exploration Conscious Reinforcement Learning revisited 18
Shani, Efroni & Mannor /19
• We define Exploration Conscious RL and analyze its properties.
• Exploration Conscious RL can improve performance over both the training
and evaluation regimes.
• Conclusion: Exploration-Conscious RL and specifically, the Surrogate
approach, can easily help to improve variety of RL algorithms.
SEE YOU AT POSTER
#90
Summary
12-Jun-19Exploration Conscious Reinforcement Learning revisited 19