19
Exploration Conscious Reinforcement Learning Revisited Lior Shani* Yonathan Efroni* Shie Mannor Technion Institute of Technology

Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Exploration Conscious Reinforcement LearningRevisited

Lior Shani* Yonathan Efroni* Shie Mannor

Technion Institute of Technology

Page 2: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

I LOVE𝝐-GREEDY

Why?

12-Jun-19Exploration Conscious Reinforcement Learning revisited 2

• To learn a good policy, an RL agent must explore!

• However, it can cause hazardous behavior during training.

Page 3: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

I LOVE𝝐-GREEDY

Damn youExploration!

Why?

12-Jun-19Exploration Conscious Reinforcement Learning revisited 3

• To learn a good policy, an RL agent must explore!

• However, it can cause hazardous behavior during training.

Page 4: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

• Objective: Find the optimal policy knowing that exploration might occur

• For example : 𝝐-greedy exploration (𝜶 = 𝝐)

𝝅𝜶∗ ∈ argmax𝝅∈𝜫𝔼

𝟏−𝜶 𝝅+𝜶𝝅𝟎

𝒕=𝟎

𝜸𝒕𝒓 𝒔𝒕, 𝒂𝒕

Exploration Conscious Reinforcement Learning

12-Jun-19Exploration Conscious Reinforcement Learning revisited 4

Page 5: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

• Objective: Find the optimal policy knowing that exploration might occur

• For example : 𝝐-greedy exploration (𝜶 = 𝝐)

𝝅𝜶∗ ∈ argmax𝝅∈𝜫𝔼

𝟏−𝜶 𝝅+𝜶𝝅𝟎

𝒕=𝟎

𝜸𝒕𝒓 𝒔𝒕, 𝒂𝒕

• Solving the Exploration-Conscious problem = Solving an MDP

• We describe a bias-error sensitivity tradeoff in 𝜶

Exploration Conscious Reinforcement Learning

12-Jun-19Exploration Conscious Reinforcement Learning revisited 5

Page 6: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

• Objective: Find the optimal policy knowing that exploration might occur

Exploration Conscious Reinforcement Learning

12-Jun-19Exploration Conscious Reinforcement Learning revisited 6

I’m ExplorationConscious

Page 7: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Fixed Exploration Schemes (e.g. 𝜖-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 7

Choose

Greedy

Action

𝒂𝒈𝒓𝒆𝒆𝒅𝒚

• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂

Page 8: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Choose

Greedy

Action

𝒂𝒈𝒓𝒆𝒆𝒅𝒚

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Fixed Exploration Schemes (e.g. 𝜖-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 8

• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂

• For 𝜶-greedy: 𝒂𝒂𝒄𝒕 ∈𝒂𝒈𝒓𝒆𝒆𝒅𝒚 w.p. 𝟏 − 𝜶𝝅𝟎 else

Page 9: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Choose

Greedy

Action

𝒂𝒈𝒓𝒆𝒆𝒅𝒚

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Act

𝒂𝒂𝒄𝒕

Fixed Exploration Schemes (e.g. 𝜖-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 9

• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂

• For 𝜶-greedy: 𝒂𝒂𝒄𝒕 ∈𝒂𝒈𝒓𝒆𝒆𝒅𝒚 w.p. 𝟏 − 𝜶𝝅𝟎 else

Page 10: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Choose

Greedy

Action

𝒂𝒈𝒓𝒆𝒆𝒅𝒚

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Act

𝒂𝒂𝒄𝒕Recieve

𝒓, 𝒔′

Fixed Exploration Schemes (e.g. 𝜖-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 10

• 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 ∈ argmax𝒂𝑸𝝅𝜶 𝒔, 𝒂

• For 𝜶-greedy: 𝒂𝒂𝒄𝒕 ∈𝒂𝒈𝒓𝒆𝒆𝒅𝒚 w.p. 𝟏 − 𝜶𝝅𝟎 else

Page 11: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Choose

Greedy

Action

𝒂𝒈𝒓𝒆𝒆𝒅𝒚

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Act

𝒂𝒂𝒄𝒕Recieve

𝒓, 𝒔′

Fixed Exploration Schemes (e.g. 𝜖-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 11

• Normally used information: 𝒔, 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

Page 12: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Choose

Greedy

Action

𝒂𝒈𝒓𝒆𝒆𝒅𝒚

Draw

Exploratory

Action

𝒂𝒂𝒄𝒕

Act

𝒂𝒂𝒄𝒕Recieve

𝒓, 𝒔′

Fixed Exploration Schemes (e.g. 𝜖-greedy)

12-Jun-19Exploration Conscious Reinforcement Learning revisited 12

• Normally used information: 𝒔, 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

• Using information about the exploration process: 𝒔, 𝒂𝒈𝒓𝒆𝒆𝒅𝒚 , 𝒂𝒂𝒄𝒕, 𝒓, 𝒔′

Page 13: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Two Approaches – Expected approach

12-Jun-19Exploration Conscious Reinforcement Learning revisited 13

1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕

2. Expect that the agent might explore in the next state

𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕 += 𝜼 𝒓𝒕 + 𝜸𝔼 𝟏−𝜶 𝝅+𝜶𝝅𝟎𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂 − 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕

𝒂𝒄𝒕

Page 14: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Two Approaches – Expected approach

12-Jun-19Exploration Conscious Reinforcement Learning revisited 14

1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕

2. Expect that the agent might explore in the next state

𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒂𝒄𝒕 += 𝜼 𝒓𝒕 + 𝜸𝔼 𝟏−𝜶 𝝅+𝜶𝝅𝟎𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂 − 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕

𝒂𝒄𝒕

• Calculating expectations can be hard.

• Requires sampling in the continuous case!

Page 15: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

• Exploration is incorporated into the environment!

1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

2. The rewards and next state 𝒓𝒕 ,𝒔𝒕+𝟏 are given by the acted action 𝒂𝒕𝒂𝒄𝒕

𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

+= 𝜼 𝒓𝒕 + 𝜸𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂𝒕+𝟏𝒈𝒓𝒆𝒆𝒅𝒚

− 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

Two Approaches – Surrogate approach

12-Jun-19Exploration Conscious Reinforcement Learning revisited 15

Page 16: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

• Exploration is incorporated into the environment!

1. Update 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

2. The rewards and next state 𝒓𝒕 ,𝒔𝒕+𝟏 are given by the acted action 𝒂𝒕𝒂𝒄𝒕

𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

+= 𝜼 𝒓𝒕 + 𝜸𝑸𝝅𝜶 𝒔𝒕+𝟏, 𝒂𝒕+𝟏𝒈𝒓𝒆𝒆𝒅𝒚

− 𝑸𝝅𝜶 𝒔𝒕, 𝒂𝒕𝒈𝒓𝒆𝒆𝒅𝒚

• NO NEED TO SAMPLE!

Two Approaches – Surrogate approach

12-Jun-19Exploration Conscious Reinforcement Learning revisited 16

Page 17: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

Deep RLExperimental Results

12-Jun-19Exploration Conscious Reinforcement Learning revisited 17

Training Evaluation

Page 18: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

• We define Exploration Conscious RL and analyze its properties.

• Exploration Conscious RL can improve performance over both the training

and evaluation regimes.

• Conclusion: Exploration-Conscious RL and specifically, the Surrogate

approach, can easily help to improve variety of RL algorithms.

Summary

12-Jun-19Exploration Conscious Reinforcement Learning revisited 18

Page 19: Exploration Conscious Reinforcement Learning12... · Shani, Efroni & Mannor /19 I LOVE 𝝐-GREEDY Damn you Exploration! Why? Exploration Conscious Reinforcement Learning revisited

Shani, Efroni & Mannor /19

• We define Exploration Conscious RL and analyze its properties.

• Exploration Conscious RL can improve performance over both the training

and evaluation regimes.

• Conclusion: Exploration-Conscious RL and specifically, the Surrogate

approach, can easily help to improve variety of RL algorithms.

SEE YOU AT POSTER

#90

Summary

12-Jun-19Exploration Conscious Reinforcement Learning revisited 19