Exploration Conscious Reinforcement Learning12...Β Β· Shani, Efroni & Mannor /19 I LOVE...

Preview:

Citation preview

Exploration Conscious Reinforcement LearningRevisited

Lior Shani* Yonathan Efroni* Shie Mannor

Technion Institute of Technology

Shani, Efroni & Mannor /19

I LOVE𝝐-GREEDY

Why?

12-Jun-19Exploration Conscious Reinforcement Learning revisited 2

β€’ To learn a good policy, an RL agent must explore!

β€’ However, it can cause hazardous behavior during training.

Shani, Efroni & Mannor /19

I LOVE𝝐-GREEDY

Damn youExploration!

Why?

12-Jun-19Exploration Conscious Reinforcement Learning revisited 3

β€’ To learn a good policy, an RL agent must explore!

β€’ However, it can cause hazardous behavior during training.

Shani, Efroni & Mannor /19

β€’ Objective: Find the optimal policy knowing that exploration might occur

β€’ For example : 𝝐-greedy exploration (𝜢 = 𝝐)

π…πœΆβˆ— ∈ argmaxπ…βˆˆπœ«π”Ό

πŸβˆ’πœΆ 𝝅+πœΆπ…πŸŽ

𝒕=𝟎

∞

πœΈπ’•π’“ 𝒔𝒕, 𝒂𝒕

Exploration Conscious Reinforcement Learning

12-Jun-19Exploration Conscious Reinforcement Learning revisited 4

Shani, Efroni & Mannor /19

β€’ Objective: Find the optimal policy knowing that exploration might occur

β€’ For example : 𝝐-greedy exploration (𝜢 = 𝝐)

π…πœΆβˆ— ∈ argmaxπ…βˆˆπœ«π”Ό

πŸβˆ’πœΆ 𝝅+πœΆπ…πŸŽ

𝒕=𝟎

∞

πœΈπ’•π’“ 𝒔𝒕, 𝒂𝒕

β€’ Solving the Exploration-Conscious problem = Solving an MDP

β€’ We describe a bias-error sensitivity tradeoff in 𝜢

Exploration Conscious Reinforcement Learning

12-Jun-19Exploration Conscious Reinforcement Learning revisited 5

Shani, Efroni & Mannor /19

β€’ Objective: Find the optimal policy knowing that exploration might occur

Exploration Conscious Reinforcement Learning

12-Jun-19Exploration Conscious Reinforcement Learning revisited 6

I’m ExplorationConscious

Shani, Efroni & Mannor /19

Fixed Exploration Schemes (e.g. πœ–-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 7

Choose

Greedy

Action

π’‚π’ˆπ’“π’†π’†π’…π’š

β€’ π’‚π’ˆπ’“π’†π’†π’…π’š ∈ argmaxπ’‚π‘Έπ…πœΆ 𝒔, 𝒂

Shani, Efroni & Mannor /19

Choose

Greedy

Action

π’‚π’ˆπ’“π’†π’†π’…π’š

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Fixed Exploration Schemes (e.g. πœ–-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 8

β€’ π’‚π’ˆπ’“π’†π’†π’…π’š ∈ argmaxπ’‚π‘Έπ…πœΆ 𝒔, 𝒂

β€’ For 𝜢-greedy: 𝒂𝒂𝒄𝒕 βˆˆπ’‚π’ˆπ’“π’†π’†π’…π’š w.p. 𝟏 βˆ’ πœΆπ…πŸŽ else

Shani, Efroni & Mannor /19

Choose

Greedy

Action

π’‚π’ˆπ’“π’†π’†π’…π’š

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Act

𝒂𝒂𝒄𝒕

Fixed Exploration Schemes (e.g. πœ–-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 9

β€’ π’‚π’ˆπ’“π’†π’†π’…π’š ∈ argmaxπ’‚π‘Έπ…πœΆ 𝒔, 𝒂

β€’ For 𝜢-greedy: 𝒂𝒂𝒄𝒕 βˆˆπ’‚π’ˆπ’“π’†π’†π’…π’š w.p. 𝟏 βˆ’ πœΆπ…πŸŽ else

Shani, Efroni & Mannor /19

Choose

Greedy

Action

π’‚π’ˆπ’“π’†π’†π’…π’š

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Act

𝒂𝒂𝒄𝒕Recieve

𝒓, 𝒔′

Fixed Exploration Schemes (e.g. πœ–-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 10

β€’ π’‚π’ˆπ’“π’†π’†π’…π’š ∈ argmaxπ’‚π‘Έπ…πœΆ 𝒔, 𝒂

β€’ For 𝜢-greedy: 𝒂𝒂𝒄𝒕 βˆˆπ’‚π’ˆπ’“π’†π’†π’…π’š w.p. 𝟏 βˆ’ πœΆπ…πŸŽ else

Shani, Efroni & Mannor /19

Choose

Greedy

Action

π’‚π’ˆπ’“π’†π’†π’…π’š

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Act

𝒂𝒂𝒄𝒕Recieve

𝒓, 𝒔′

Fixed Exploration Schemes (e.g. πœ–-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 11

β€’ Normally used information: 𝒔, 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

Shani, Efroni & Mannor /19

Choose

Greedy

Action

π’‚π’ˆπ’“π’†π’†π’…π’š

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Act

𝒂𝒂𝒄𝒕Recieve

𝒓, 𝒔′

Fixed Exploration Schemes (e.g. πœ–-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 12

β€’ Normally used information: 𝒔, 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

β€’ Using information about the exploration process: 𝒔, π’‚π’ˆπ’“π’†π’†π’…π’š , 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

Shani, Efroni & Mannor /19

Two Approaches – Expected approach

12-Jun-19Exploration Conscious Reinforcement Learning revisited 13

1. Update π‘Έπ…πœΆ 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕

2. Expect that the agent might explore in the next state

π‘Έπ…πœΆ 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕 += 𝜼 𝒓𝒕 + πœΈπ”Ό πŸβˆ’πœΆ 𝝅+πœΆπ…πŸŽπ‘Έπ…πœΆ 𝒔𝒕+𝟏, 𝒂 βˆ’ π‘Έπ…πœΆ 𝒔𝒕, 𝒂𝒕

𝒂𝒄𝒕

Shani, Efroni & Mannor /19

Two Approaches – Expected approach

12-Jun-19Exploration Conscious Reinforcement Learning revisited 14

1. Update π‘Έπ…πœΆ 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕

2. Expect that the agent might explore in the next state

π‘Έπ…πœΆ 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕 += 𝜼 𝒓𝒕 + πœΈπ”Ό πŸβˆ’πœΆ 𝝅+πœΆπ…πŸŽπ‘Έπ…πœΆ 𝒔𝒕+𝟏, 𝒂 βˆ’ π‘Έπ…πœΆ 𝒔𝒕, 𝒂𝒕

𝒂𝒄𝒕

β€’ Calculating expectations can be hard.

β€’ Requires sampling in the continuous case!

Shani, Efroni & Mannor /19

β€’ Exploration is incorporated into the environment!

1. Update π‘Έπ…πœΆ 𝒔𝒕, π’‚π’•π’ˆπ’“π’†π’†π’…π’š

2. The rewards and next state 𝒓𝒕 ,𝒔𝒕+𝟏 are given by the acted action 𝒂𝒕𝒂𝒄𝒕

π‘Έπ…πœΆ 𝒔𝒕, π’‚π’•π’ˆπ’“π’†π’†π’…π’š

+= 𝜼 𝒓𝒕 + πœΈπ‘Έπ…πœΆ 𝒔𝒕+𝟏, 𝒂𝒕+πŸπ’ˆπ’“π’†π’†π’…π’š

βˆ’ π‘Έπ…πœΆ 𝒔𝒕, π’‚π’•π’ˆπ’“π’†π’†π’…π’š

Two Approaches – Surrogate approach

12-Jun-19Exploration Conscious Reinforcement Learning revisited 15

Shani, Efroni & Mannor /19

β€’ Exploration is incorporated into the environment!

1. Update π‘Έπ…πœΆ 𝒔𝒕, π’‚π’•π’ˆπ’“π’†π’†π’…π’š

2. The rewards and next state 𝒓𝒕 ,𝒔𝒕+𝟏 are given by the acted action 𝒂𝒕𝒂𝒄𝒕

π‘Έπ…πœΆ 𝒔𝒕, π’‚π’•π’ˆπ’“π’†π’†π’…π’š

+= 𝜼 𝒓𝒕 + πœΈπ‘Έπ…πœΆ 𝒔𝒕+𝟏, 𝒂𝒕+πŸπ’ˆπ’“π’†π’†π’…π’š

βˆ’ π‘Έπ…πœΆ 𝒔𝒕, π’‚π’•π’ˆπ’“π’†π’†π’…π’š

β€’ NO NEED TO SAMPLE!

Two Approaches – Surrogate approach

12-Jun-19Exploration Conscious Reinforcement Learning revisited 16

Shani, Efroni & Mannor /19

Deep RLExperimental Results

12-Jun-19Exploration Conscious Reinforcement Learning revisited 17

Training Evaluation

Shani, Efroni & Mannor /19

β€’ We define Exploration Conscious RL and analyze its properties.

β€’ Exploration Conscious RL can improve performance over both the training

and evaluation regimes.

β€’ Conclusion: Exploration-Conscious RL and specifically, the Surrogate

approach, can easily help to improve variety of RL algorithms.

Summary

12-Jun-19Exploration Conscious Reinforcement Learning revisited 18

Shani, Efroni & Mannor /19

β€’ We define Exploration Conscious RL and analyze its properties.

β€’ Exploration Conscious RL can improve performance over both the training

and evaluation regimes.

β€’ Conclusion: Exploration-Conscious RL and specifically, the Surrogate

approach, can easily help to improve variety of RL algorithms.

SEE YOU AT POSTER

#90

Summary

12-Jun-19Exploration Conscious Reinforcement Learning revisited 19

Recommended