Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE...

Exploration Conscious Reinforcement LearningRevisited

Lior Shani* Yonathan Efroni* Shie Mannor

Technion Institute of Technology

Shani, Efroni & Mannor /19

I LOVE𝝐-GREEDY

12-Jun-19Exploration Conscious Reinforcement Learning revisited 2

• To learn a good policy, an RL agent must explore!

• However, it can cause hazardous behavior during training.

I LOVE𝝐-GREEDY

Damn youExploration!

• To learn a good policy, an RL agent must explore!

• However, it can cause hazardous behavior during training.

• Objective: Find the optimal policy knowing that exploration might occur

• For example : 𝝐-greedy exploration (𝜶 = 𝝐)

𝝅𝜶∗ ∈ argmax𝝅∈𝜫𝔼

𝟏−𝜶 𝝅+𝜶𝝅𝟎

𝒕=𝟎

𝜸𝒕𝒓 𝒔𝒕, 𝒂𝒕

Exploration Conscious Reinforcement Learning

• For example : 𝝐-greedy exploration (𝜶 = 𝝐)

𝝅𝜶∗ ∈ argmax𝝅∈𝜫𝔼

𝟏−𝜶 𝝅+𝜶𝝅𝟎

𝒕=𝟎

𝜸𝒕𝒓 𝒔𝒕, 𝒂𝒕

• Solving the Exploration-Conscious problem = Solving an MDP

• We describe a bias-error sensitivity tradeoff in 𝜶

I’m ExplorationConscious

Fixed Exploration Schemes (e.g. 𝜖-greedy)

Choose

Greedy

Action

𝒂𝒈𝒓𝒆𝒆𝒅𝒚

• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂

Choose

Greedy

Action

Exploratory

Action

𝒂𝒂𝒄𝒕

• For 𝜶-greedy: 𝒂𝒂𝒄𝒕 ∈𝒂𝒈𝒓𝒆𝒆𝒅𝒚 w.p. 𝟏 − 𝜶𝝅𝟎 else

Choose

Greedy

Action

Exploratory

Action

𝒂𝒂𝒄𝒕

Choose

Greedy

Action

Exploratory

Action

𝒂𝒂𝒄𝒕

𝒂𝒂𝒄𝒕Recieve

𝒓, 𝒔′

Choose

Greedy

Action

Exploratory

Action

𝒂𝒂𝒄𝒕

𝒓, 𝒔′

• Normally used information: 𝒔, 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

Choose

Greedy

Action

Exploratory

Action

𝒂𝒂𝒄𝒕

𝒓, 𝒔′

• Normally used information: 𝒔, 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

• Using information about the exploration process: 𝒔, 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 , 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

Two Approaches – Expected approach

1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕

2. Expect that the agent might explore in the next state

𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕 += 𝜼 𝒓𝒕 + 𝜸𝔼 𝟏−𝜶 𝝅+𝜶𝝅𝟎𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂 − 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕

𝒂𝒄𝒕

Two Approaches – Expected approach

1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕

2. Expect that the agent might explore in the next state

𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕 += 𝜼 𝒓𝒕 + 𝜸𝔼 𝟏−𝜶 𝝅+𝜶𝝅𝟎𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂 − 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕

𝒂𝒄𝒕

• Calculating expectations can be hard.

• Requires sampling in the continuous case!

• Exploration is incorporated into the environment!

1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

2. The rewards and next state 𝒓𝒕 ,𝒔𝒕+𝟏 are given by the acted action 𝒂𝒕𝒂𝒄𝒕

𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

+= 𝜼 𝒓𝒕 + 𝜸𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂𝒕+𝟏𝒈𝒓𝒆𝒆𝒅𝒚

− 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

Two Approaches – Surrogate approach

• Exploration is incorporated into the environment!

1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

2. The rewards and next state 𝒓𝒕 ,𝒔𝒕+𝟏 are given by the acted action 𝒂𝒕𝒂𝒄𝒕

𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

+= 𝜼 𝒓𝒕 + 𝜸𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂𝒕+𝟏𝒈𝒓𝒆𝒆𝒅𝒚

− 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

• NO NEED TO SAMPLE!

Two Approaches – Surrogate approach

Deep RLExperimental Results

Training Evaluation

• We define Exploration Conscious RL and analyze its properties.

• Exploration Conscious RL can improve performance over both the training

and evaluation regimes.

• Conclusion: Exploration-Conscious RL and specifically, the Surrogate

approach, can easily help to improve variety of RL algorithms.

Summary

• We define Exploration Conscious RL and analyze its properties.

• Exploration Conscious RL can improve performance over both the training

and evaluation regimes.

• Conclusion: Exploration-Conscious RL and specifically, the Surrogate

approach, can easily help to improve variety of RL algorithms.

SEE YOU AT POSTER

Summary

Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE...

Documents

Learning for exploration-exploitation in reinforcement learning. The dusk of the small formulas’ reign

Off-Policy Deep Reinforcement Learning without Exploration · 2019-08-13 · Off-Policy Deep Reinforcement Learning without Exploration Scott Fujimoto 1 2David Meger Doina Precup

Exploration by Distributional Reinforcement Learning · on exploration in deep reinforcement learning setting. We also show that the algorithm achieves efcient exploration in challenging

Online Deep Reinforcement Learning for Autonomous UAV ...Online Deep Reinforcement Learning for Autonomous UAV Navigation and Exploration of Outdoor Environments Bruna G. Maciel-Pearson

Optimal Tuning of Continual Online Exploration in Reinforcement Learning

A FRAMEWORK FOR VISUALIZATION-DRIVEN ECO-CONSCIOUS DESIGN ... · A FRAMEWORK FOR VISUALIZATION-DRIVEN ECO-CONSCIOUS DESIGN EXPLORATION Devarajan Ramanujan1, William Z. Bernstein 1,

Cooperative Exploration for Multi-Agent Deep Reinforcement

Conscious Living/Conscious Dying

Reinforcement Learning - Stanford University · 2020. 3. 11. · Model-Based Reinforcement Learning* (Will discuss other exploration later) 1.Take action a for current state s given

Exploration of Reinforcement Learning in Radar Scheduling

Robot Skills: Imitation and Exploration Learning - ULisboa · Robot Skills: Imitation and Exploration Learning Dynamic Movement Primitives and Reinforcement Learning for a Ping Pong

Genetic-Gated Networks for Deep Reinforcement Learningway to control the exploration of a model for contemporary reinforcement learning issues. 3 Genetic-Gated Networks In this section,

Meta-learning of exploration-exploitation strategies in reinforcement learning

Reinforcement Learning with Derivative-Free Exploration · In this paper, we introduce a derivative-free based exploration called DFE as a ... Now the overview of DFE can be seen

Lecture Notes in Computer Science: Safe Exploration Techniques …cmp.felk.cvut.cz/~peckama2/papers/safe_exploration... · 2014-09-30 · 4 LNCS: Safe Exploration for Reinforcement

Understanding Exploration Strategies in Model Based …pparth2/support/ms_thesis.pdf · 2020-02-21 · KEYWORDS: Reinforcement learning, Exploration strategies, PAC-guarantees, Regret

Structured Exploration for Reinforcement Learningai-lab/pubs/nkj-thesis...Reinforcement Learning Methods Thesis Focus One Solution to Many Problems RealityMany tasks meanmany engineering

Conscious Creation – Symbol / ConsciousCreation™consciousnesspioneers.nl/data/documents/Conscious-Creation-Symb… · Conscious Creation – Symbol / Conscious Creation ... The

PBCS: Efﬁcient Exploration and Exploitation Using …PBCS: Efﬁcient Exploration and Exploitation Using a Synergy between Reinforcement Learning and Motion Planning Guillaume Matheron

Exploration in Reinforcement Learning