44
| | Hardware Acceleration for Data Processing Fall 2017 Hardware Acceleration for Data Processing Fall 2017 14.11.2017 1 Asynchronous Methods for Deep Reinforcement Learning Lavrenti Frobeen

Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

  • Upload
    others

  • View
    11

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Hardware Acceleration for Data Processing

Fall 2017

14.11.2017 1

Asynchronous Methods for

Deep Reinforcement Learning

Lavrenti Frobeen

Page 2: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 2

Page 3: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 3

Page 4: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 4

Page 5: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 5

Page 6: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

▪ Set of States 𝑆

▪ Set of Actions 𝐴

Agent observes 𝑠𝑡 ∈ 𝑆 and…

…takes action 𝑎𝑡 ∈ 𝐴 according to its policy…

…receiving a new 𝑠𝑡+1 ∈ 𝑆 and a reward 𝑟𝑡+1 ∈ ℝ.

▪ Total Reward:

𝑅𝑡 =

𝑘=0

𝛾𝑘 𝑟𝑡+𝑘 𝛾 ∈ (0, 1)

14.11.2017Lavrenti Frobeen 6

Ingredients of Reinforcement Learning

Page 7: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

▪ Delayed rewards

▪ Vast state spaces

▪ Dynamic environment /

Actions influence training data

14.11.2017Lavrenti Frobeen 7

What makes RL difficult?

Page 8: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Idea: Tabulate “quality” of state-action-pairs

Qπ s, a = 𝔼 Rt|st = s, a, 𝜋

Choose action that maximizes Q (𝜀-greedily)

For an optimal policy it holds that:

Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′

Q∗ 𝑠′,  𝑎′

14.11.2017Lavrenti Frobeen 8

Solving RL: Q-Learning (1)

Action 1 … Action m

State 1 1 … 0.5

State 2 0 … 0.6

… ... … …

State n 0.7 … 2

if 𝑠′ terminal

else

Page 9: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Iterate Two Steps:

1. Choose action that maximizes Q (𝜀-greedily)

2. Update table according to

Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′

Q∗ 𝑠′,  𝑎′

3. Repeat until convergence

14.11.2017Lavrenti Frobeen 9

Solving RL: Q-Learning (2)[Watkins 89]

Action 1 … Action m

State 1 1 … 0.5

State 2 0 … 0.6

… ... … …

State n 0.7 … 2if 𝑠′ terminal

else

Page 10: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 10

Problem: Q-Tables Do Not Scale

Action 1 … Action 361

State 1 1 … 0.5

State 2 0 … 0.6

… ... … …

State 10170 0.7 … 2

Input State Q-values

Page 11: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 11

Reinforcement Learning with Neural Networks

Input State

RepresentationQ-values

𝑄(𝑠, 𝑎, 𝜃) ≈ 𝑄∗ s,  a

Page 12: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 12

Reinforcement Learning with Deep Neural Networks[Mnih et al. 2015]

Input State

RepresentationQ-values

Taken from “Human-level control through deep reinforcement learning” (Mnih et al. 2015)

Page 13: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Network is parameterized by 𝜃

𝜃 can be trained by minimizing the loss

𝐿 𝜃 = 𝔼 𝑡𝑎𝑟𝑔𝑒𝑡𝑄 − 𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑒𝑑𝑄 2

= 𝔼 𝑟 + 𝛾max𝑎′

Q 𝑠′, 𝑎′, 𝜃 − 𝑄 𝑠, 𝑎, 𝜃

2

14.11.2017Lavrenti Frobeen 13

Deep Q-Learning

Q-values for 𝑠

State 𝑠

Action 𝑎

Q-values for 𝑠′

max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃𝜕𝐿

𝜕𝜃

Reward 𝑟 for action 𝑎

State 𝑠′

(after 𝑎)

Page 14: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

▪ Delayed rewards

▪ Vast state spaces

▪ Dynamic environment /

Actions influence training data

▪ Feedback loops

▪ Updates affect Q-function globally

14.11.2017Lavrenti Frobeen 14

Issues with Deep Q-Learning

Q-values for 𝑠

State 𝑠

Action 𝑎

Q-values for 𝑠′

𝜕𝐿

𝜕𝜃State 𝑠′

(after 𝑎)

max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃

Reward 𝑟 for action 𝑎

Page 15: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 15

Breaking Feedback Loops

Q-values for 𝑠State 𝑠

Action 𝑎

max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃−𝜕𝐿

𝜕𝜃

Reward for 𝑎

State 𝑠′

(after 𝑎)

𝑄 𝑠, 𝑎, 𝜃

Target Network (𝜃−)

Trained Network (𝜃)

Page 16: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 16

Breaking Feedback Loops

Q-values for 𝑠State 𝑠

Action 𝑎

State 𝑠′

(after 𝑎)

𝜃Target Network (𝜃−)

Page 17: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

▪ Delayed rewards

▪ Vast state spaces

▪ Dynamic environment /

Actions influence training data

▪ Feedback loops

▪ Updates affect Q-function globally

14.11.2017Lavrenti Frobeen 17

Issues with Deep Q-Learning

Q-values for 𝑠

State 𝑠

Action 𝑎

Q-values for 𝑠′

𝜕𝐿

𝜕𝜃State 𝑠′

(after 𝑎)

max𝑎′

𝑄 𝑠′, 𝑎′, 𝜃 and 𝑄 𝑠, 𝑎, 𝜃

Reward 𝑟 for action 𝑎

Page 18: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Problem

Learning from new experiences might override

previous progress!

Idea

Store state transitions during exploration

Sample from past experience during learning to

maintain diversity

14.11.2017Lavrenti Frobeen 18

Experience Replay

𝑎 = 𝑓𝑟 = 0

𝑎 = 𝑓𝑟 = 1

𝑎 = 𝑏𝑟 = 0

𝜕𝐿

𝜕𝜃

Page 19: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

▪ Delayed rewards

▪ Vast state spaces

▪ Dynamic environment /

Actions influence training data

▪ Feedback loops

▪ Updates affect Q-function globally

14.11.2017Lavrenti Frobeen 19

Issues with Deep Q-Learning

Page 20: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Problem

Rewards propagate through the DQN slowly!

Idea

Take multiple actions in sequence before

updating the network

14.11.2017Lavrenti Frobeen 20

N-Step Methods

Page 21: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Agent N

14.11.2017Lavrenti Frobeen 21

Parallel Deep Q-Learning

Parameter Server

𝜃 𝜃𝜃

Agent 1 Agent …

Page 22: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

▪ Distributed version of DQN

▪ (Then) State-of-the-art results on the Atari

▪ Training takes days on 100+ machines

14.11.2017Lavrenti Frobeen 22

Massively Parallel Methods of Deep RL [Nair et al. 2015]

Taken from “Massively Parallel Methods of Deep RL” (Nair et al. 2015)

Page 23: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 23

Going HOGWILD!

Agent / Thread N

RAM

𝜃 𝜃𝜃

Agent / Thread 1 Agent / Thread …

Single Machine

Page 24: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 24

Asynchronous one-step Q-learning[Mnih et al. 2016]

Agent / Thread 1

Shared 𝜃, 𝜃−

Agent / Thread N

every 5 frames

every 40000 frames

Page 25: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

𝑎 = 𝑓𝑟 = 0

𝑎 = 𝑓𝑟 = 1

Experience Replay was scrapped!

Intuition

Parallel learners explore different paths anyway!

Parallelism similar effect to experience replay.

Diversity can be enforced by different

policies

14.11.2017Lavrenti Frobeen 25

What about Experience Replay?

𝜕𝐿

𝜕𝜃

Agent / Thread 1 Agent / Thread N

𝜕𝐿

𝜕𝜃

Experience Replay:

Actor Parallelism:

Page 26: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Policy is parametrized directly with

𝜋(𝑎𝑡|𝑠𝑡; 𝜃)

and estimated using gradient ascent for

𝛻𝜃 log 𝜋(𝑎𝑡|𝑠𝑡; 𝜃) 𝑅𝑡

which is an unbiased estimate of

𝛻𝜃𝔼[𝑅𝑡]

14.11.2017Lavrenti Frobeen 26

Policy Gradient Methods

𝜋(𝑎𝑡|𝑠𝑡; 𝜃)

State

𝑎1 𝑎2 𝑎3 ... 𝑎𝑛

Page 27: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Advantage Actor-Critic

𝛻𝜃 log 𝜋 𝑎𝑡 𝑠𝑡; 𝜃 𝑅𝑡 − 𝑉(𝑠𝑡 , 𝜃𝑣)

Note:

𝑅𝑡 depends on action taken 𝑎𝑡𝑉(𝑠𝑡 , 𝜃𝑣) depends the state only

𝑅𝑡 − 𝑉(𝑠𝑡 , 𝜃𝑣) is called the advantage of 𝑎𝑡

14.11.2017Lavrenti Frobeen 27

Asynchronous Advantage Actor-Critic (A3C)

Agent / Thread N

Agent / Thread 1

Shared 𝜃, 𝜃𝑣

𝜋 𝑎𝑡 𝑠𝑡; 𝜃

𝑉(𝑠𝑡 , 𝜃𝑣)

Page 28: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

Agent / Thread N

14.11.2017Lavrenti Frobeen 28

Asynchronous Advantage Actor-Critic (A3C) [Mnih et al. 2016]

Agent / Thread 1

Shared 𝜃, 𝜃𝑣

𝜋 𝑎𝑡 𝑠𝑡; 𝜃

𝑉(𝑠𝑡 , 𝜃𝑣)

Page 29: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 29

Evaluation on the Atari Domain

Normalized scores on 57 Atari Games [Mnih et. al 2016]

Page 30: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 30

EvaluationAsynchronous Methods vs. DQN

Averaged scores of different algorithms for 5 games over time [Mnih et. al 2016]

Page 31: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 31

(Super-) Linear Speedup

Average speedup of different algorithms for

different thread counts [Mnih et. al 2016]

Average speedup of A3C for different thread counts

[Mnih et. al 2016]

Page 32: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 32

Evaluation with TORCS

Averaged scores of different algorithms for TORCS

racing simulator [Mnih et. al 2016]

Page 33: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 33

Stability & Robustness

Scores of A3C with different learning rates for 5 games [Mnih et. al 2016]

Page 34: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 34

Stability & Robustness

Scores of A3C with different learning rates for 5 games [Mnih et. al 2016]

Raw scores of different algorithms for a selection of Atari Games [Mnih et. al 2016]

Page 35: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017

▪ Deep Reinforcement Learning is hard

▪ Requires techniques like experience replay

▪ Deep RL is easily parallelizable

▪ Parallelism can replace experience replay

▪ Dropping experience replay allows on-policy

methods like actor-critic

▪ A3C surpasses state-of-the-art performance

14.11.2017Lavrenti Frobeen 35

Takeaway Message

Page 36: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 36

QUESTIONS?

Page 37: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 37

One-step Q-learning vs. N-step Q-learning

Page 38: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 38

Q-Learning: A Toy Example

Backwards Forwards

Start 0 0

Street 0 0

Goal – –

Current Q-Table

Hyperparameters: 𝜀 = 0.999 𝛾 = 0.5

Next Action: Go Forward

Page 39: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 39

Q-Learning: A Toy Example

Backwards Forwards

Start 0 0

Street 0 0

Goal – –

Current Q-Table

Previous Action: Go Forwards

No Reward

Next Action: Go Forwards

Page 40: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 40

Q-Learning: A Toy Example

Backwards Forwards

Start 0 0

Street 0 1

Goal – –

Current Q-Table

Previous Action: Go Forwards

Reward: +1

Update Table According to Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′

Q∗ 𝑠′,  𝑎′if 𝑠′ terminal

else

Page 41: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 41

Q-Learning: A Toy Example

Backwards Forwards

Start 0 0

Street 0 1

Goal – –

Current Q-Table

Game has restarted!

Next Action: Go Forwards

Page 42: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 42

Q-Learning: A Toy Example

Backwards Forwards

Start 0 0.5

Street 0 1

Goal – –

Current Q-Table

Previous Action: Go Forwards,

No reward, but rewarding follow-up action

Update Table According to

Next Action: Go Backwards

Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′

Q∗ 𝑠′,  𝑎′if 𝑠′ terminal

else

Page 43: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 43

Q-Learning: A Toy Example

Backwards Forwards

Start 0 0.5

Street 0.25 1

Goal – –

Current Q-Table

Previous Action: Go Backwards

No reward, but rewarding follow-up action

Update Table According to

Next Action: Go Backwards

Q∗ s, a = ቊ𝑟

𝑟 + 𝛾maxa′

Q∗ 𝑠′,  𝑎′if 𝑠′ terminal

else

Page 44: Asynchronous Methods for Deep Reinforcement Learning · Distributed version of DQN (Then) State-of-the-art results on the Atari Training takes days on 100+ machines Lavrenti Frobeen

||Hardware Acceleration for Data Processing – Fall 2017 14.11.2017Lavrenti Frobeen 44

Q-Learning: A Toy Example

Backwards Forwards

Start 0.25 0.5

Street 0.25 1

Goal – –

Current Q-Table

Done