Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions

Policy estimate from training episodes

We have:

I unknown squared world of unknown size and structure

I robot/agents moves in unknown directions with unknown parameters

→ We do not know anything

I we only have a few episodes the robot tried

What to do?

A: Run away :-)

B: Examine episodes and learn

C: Guess

D: Try something

1 / 59


We have:





What to do?

A: Run away :-)


C: Guess

D: Try something

1 / 59


We have:





What to do?

A: Run away :-)


C: Guess

D: Try something

1 / 59


We have:





What to do?

A: Run away :-)


C: Guess

D: Try something

2 / 59

Example I

Episode 1 Episode 2 Episode 3 Episode 4(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)

(B,←,A,−1)(A,←, exit, 6)

each field in table is n-tuple (s, a, s ′, r), known discount factor γ = 1Task: for non-terminal states determine the optimal policy. Use model-based learning.

What do we have to learn (model based learning)?

A: policy π

B: state set S , policy π

C: state set S , action set A, transition model p(s ′|s, a)

D: state set S , action set A, rewards r , transition model p(s ′|s, a)

3 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)

each field in table is n-tuple (s, a, s ′, r), known discount factor γ = 1Task: for non-terminal states determine the optimal policy. Use model-based learning.


A: policy π




3 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)

each field in table is n-tuple (s, a, s ′, r)Task: for non-terminal states determine the optimal policy


A: policy π




4 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)

each field in table is n-tuple (s, a, s′, r)

What is the state set S?

A: S = {B,C}B: S = {A,B,C ,D, exit}C: S = {A,B,C ,D}D: S = {A,D}

5 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)


What is the state set S?

A: S = {B,C}B: S = {A,B,C ,D, exit}C: S = {A,B,C ,D}D: S = {A,D}

6 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)


State set S = {A,B,C ,D}I What are the terminal states?

A: {A,B,C ,D}B: {A,D}C: {B,C}D: {A,C ,D}

7 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




7 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




8 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)


State set S = {A,B,C ,D}I Terminal states: {A,D}I What are the non-terminal states?

A: {A,B,C ,D}B: {A,D}C: {B,C}D: {A,B,C}

9 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




9 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




10 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)


State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}

What is the action set?

A: {→,←}B: {→,←, ↑, ↓}C: {→,←, ↑}D: {→,←, ↓}

11 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




A: {→,←}B: {→,←, ↑, ↓}C: {→,←, ↑}D: {→,←, ↓}

11 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




A: {→,←}B: {→,←, ↑, ↓}C: {→,←, ↑}D: {→,←, ↓}

12 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)


State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}

What is the transition model?

A: deterministic

B: non-deterministic

Let’s examine :-)

13 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




A: deterministic


Let’s examine :-)

13 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




A: deterministic


Let’s examine :-)

13 / 59

Example IEpisode 1 Episode 2 Episode 3 Episode 4

(B,→,C ,−3) (B,←,A,−1) (C ,→,D,−3) (C ,←,B,−1)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→,C ,−3)(D,←, exit, 6) (C ,←,B,−1)

(B,←,A,−1)(A,←, exit, 6)




I How to compute?

A: for each state and actionB: for each state, action and new stateC: for each stateD: for each action and new state

14 / 59



(B,←,A,−1)(A,←, exit, 6)




I How to compute?

A: for each state and actionB: for each state, action and new stateC: for each stateD: for each action and new state

15 / 59



(B,←,A,−1)(A,←, exit, 6)




I How to compute?

1. for each state, action and new state2. A: as relative frequencies in one episode

B: as sum of occurencies in one episodeC: as relative frequencies in all episodesD: as sum of occurencies in all episodes

16 / 59



(B,←,A,−1)(A,←, exit, 6)




I How to compute?

1. for each state, action and new state2. A: as relative frequencies in one episode

B: as sum of occurencies in one episodeC: as relative frequencies in all episodesD: as sum of occurencies in all episodes

17 / 59



(B,←,A,−1)(A,←, exit, 6)




I How to compute?

1. for each state, action and new state2. as relative frequencies in all episodes

I evaluate p(C |B,→)

A: 1B: 2/3C: 1/2D: 1/3

18 / 59



(B,←,A,−1)(A,←, exit, 6)




I How to compute?

1. for each state, action and new state2. as relative frequencies in all episodes

I evaluate p(C |B,→)

A: 1 = #(B,→,C ,·)#(B,→,·,·) = 2/2

B: 2/3C: 1/2D: 1/3

19 / 59



(B,←,A,−1)(A,←, exit, 6)




I p(C |B,→) = 2/2 = 1

p(A|B,←) = 2/2 = 1

p(D|C ,→) = 2/2 = 1

p(B|C ,←) = 2/2 = 1

A: non-deterministic

B: deterministic

20 / 59



(B,←,A,−1)(A,←, exit, 6)




I p(C |B,→) = 2/2 = 1

p(A|B,←) = 2/2 = 1

p(D|C ,→) = 2/2 = 1

p(B|C ,←) = 2/2 = 1


B: deterministic

20 / 59



(B,←,A,−1)(A,←, exit, 6)




I p(C |B,→) = 2/2 = 1

p(A|B,←) = 2/2 = 1

p(D|C ,→) = 2/2 = 1

p(B|C ,←) = 2/2 = 1


B: deterministic

20 / 59



(B,←,A,−1)(A,←, exit, 6)




I p(C |B,→) = 2/2 = 1

p(A|B,←) = 2/2 = 1

p(D|C ,→) = 2/2 = 1

p(B|C ,←) = 2/2 = 1


B: deterministic

21 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)


State set S = {A,B,C ,D}, terminal states: {A,D}, non-terminal states: {B,C}Action set A = {→,←}Deterministic transition model: p(C |B,→) = p(A|B,←) = p(D|C ,→) = p(B|C ,←) = 2/2 = 1

What is the world structure?

A: A C B D

B: A B C D

C: B A C D

22 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




A: A C B D

B: A B C D

C: B A C D

22 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




A: A C B D

B: A B C D

C: B A C D

23 / 59



(B,←,A,−1)(A,←, exit, 6)



World structure: A B C D

What is a correct value for the reward function?

A: r(B) = −1

B: r(B,←,A) = −4

C: r(B) = −3

D: r(B,←) = −1

24 / 59



(B,←,A,−1)(A,←, exit, 6)




What is a correct value for the reward function?

A: r(B) = −1

B: r(B,←,A) = −4

C: r(B) = −3

D: r(B,←) = −1

25 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1

What is also correct for the reward function?

A: r(B) = −1

B: r(B,→) = −3

C: r(B) = −3

D: r(B,→,C) = −1

26 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1


A: r(B) = −1

B: r(B,→) = −3

C: r(B) = −3

D: r(B,→,C) = −1

26 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1


A: r(B) = −1

B: r(B,→) = −3

C: r(B) = −3

D: r(B,→,C) = −1

27 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1, r(B,→) = −3


A: r(C) = −1

B: r(C ,←,B) = −3

C: None

D: r(C ,←) = −1

28 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1,r(B,→) = −3


A: r(C) = −1

B: r(C ,←,B) = −3

C: None

D: r(C ,←) = −1

29 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1,r(B,→) = −3,r(C ,←) = −1


A: r(C) = −1

B: r(C ,→) = −3

C: r(C) = −3

D: r(C ,→,D) = −4

30 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1,r(B,→) = −3,r(C ,←) = −1


A: r(C) = −1

B: r(C ,→) = −3

C: r(C) = −3

D: r(C ,→,D) = −4

30 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1,r(B,→) = −3,r(C ,←) = −1


A: r(C) = −1

B: r(C ,→) = −3

C: r(C) = −3

D: r(C ,→,D) = −4

31 / 59



(B,←,A,−1)(A,←, exit, 6)




I r(B,←) = −1,r(B,→) = −3,r(C ,←) = −1, r(C ,→) = −3

Discussion point, do we need more reward values?

A: Yes, for all states and actions.

B: No.

C: Yes, for terminal states.

32 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)



World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3

Add also the terminal state rewards: r({A,D}, {←,→}) = 6

33 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)



World structure: A B C DReward function: r({B,C},←) = −1,r({B,C},→) = −3 r({A,D}, {←,→}) = 6

Do we have all we need?

A: Yes

B: No

34 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)





A: Yes

B: No

Let’s compute the policy.

35 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)





A: Yes

B: No

Let’s compute the policy.

35 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




Observation: Immediate rewards significantly decrease state value.

A: Best is to go directly to terminal state

B: We can go to the terminal state arbitrarily

36 / 59

Example I


(B,←,A,−1)(A,←, exit, 6)




Observation: Immediate rewards significantly decrease state value.

A: Best is to go directly to terminal state

B: We can go to the terminal state arbitrarily

37 / 59



(B,←,A,−1)(A,←, exit, 6)




Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateCompute:

A: q(B,←) = 5

B: q(B,←) = 3

C: q(B,←) = −1

D: q(B,←) = −3

38 / 59



(B,←,A,−1)(A,←, exit, 6)





A: q(B,←) = 5

B: q(B,←) = 3

C: q(B,←) = −1

D: q(B,←) = −3

38 / 59



(B,←,A,−1)(A,←, exit, 6)





A: q(B,←) = B ← A = 6− 1 = 5

B: q(B,←) = 3

C: q(B,←) = −1

D: q(B,←) = −3

39 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(B,←) = 5

(What can we assume about π(C)?)

A: q(B,→) = 5

B: q(B,→) = 3

C: q(B,→) = 040 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(B,←) = 5


A: q(B,→) = 5

B: q(B,→) = 3

C: q(B,→) = 040 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(B,←) = 5


A: q(B,→) = 5

B: q(B,→) = 3

C: q(B,→) = B → C → D = 6− 3− 3 = 041 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(B,←) = 5

I q(B,→) = 0

→ π(B) =←

42 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(B,←) = 5

I q(B,→) = 0

→ π(B) =←

42 / 59



(B,←,A,−1)(A,←, exit, 6)




Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal stateπ(B) =←Compute now π(C):

A: q(C ,→) = 5

B: q(C ,→) = 3

C: q(C ,→) = 0

D: q(C ,→) = −343 / 59



(B,←,A,−1)(A,←, exit, 6)





A: q(C ,→) = 5

B: q(C ,→) = 3

C: q(C ,→) = 0

D: q(C ,→) = −343 / 59



(B,←,A,−1)(A,←, exit, 6)





A: q(C ,→) = 5

B: q(C ,→) = C → D = 6− 3 = 3

C: q(C ,→) = 0

D: q(C ,→) = −344 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(C ,→) = 3

A: q(C ,←) = 4

B: q(C ,←) = 3

C: q(C ,←) = 0 45 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(C ,→) = 3

A: q(C ,←) = C ← B ← A = 6− 1− 1 = 4

B: q(C ,←) = 3

C: q(C ,←) = 0 46 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(C ,→) = 3

I q(C ,←) = 4

→ π(C) =←47 / 59



(B,←,A,−1)(A,←, exit, 6)





I q(C ,→) = 3

I q(C ,←) = 4

→ π(C) =←47 / 59



(B,←,A,−1)(A,←, exit, 6)




Obs.: Immediate rewards significantly decrease state value. → Best is to go directly to terminal state

Solution:

I π(B) =←I π(C) =←

48 / 59

Example II

Episode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)

(B,←, A,−1) (A,←, exit, 6) (D,←, exit, 6)(A,←, exit, 6)

Calculating policy

I state set S ,

I action set A,

I rewards r ,

I transition model p(s ′|s, a)

I policy π

49 / 59

Example II




A: deterministic


50 / 59

Example II



What is a correct transitional probability?

A p(C |B,→) = 0.75

B p(A|B,→) = 0.75

C p(A|B,←) = 0.25

D p(D|B,←) = 0.75

51 / 59

Example II



What is a correct transitional probability?

A p(C |B,→) = 0.75, see the episodes(B,→) occurs 4 times, three of which lead to C , one case to A thus also p(A|B,→) = 0.25

B p(A|B,→) = 0.75

C p(A|B,←) = 0.25

D p(D|B,←) = 0.75

Transition model: Similarly for other probabilities. Agent follows the direction given with probability

0.75. Otherwise, it goes the other direction.

52 / 59

Example II



What is the reward function?

A r(B,→,C ) = −3

B p(B,→,A) = −3

C p(B,←,A) = −3

D p(B,←,C ) = −3

53 / 59

Example II



What is the reward function?

A r(B,→,C ) = −3

B p(B,→,A) = −3

C p(B,←,A) = −3

D p(B,←,C ) = −3

54 / 59

Example IIEpisode 1 Episode 2 Episode 3 Episode 4 Episode 5 Episode 6 Episode 7 Episode 8

(B,→, C ,−3) (B,←, A,−1) (C ,→,D,−3) (C ,←, B,−1) (B,←, C ,−3) (B,→, A,−1) (C ,→, B,−1) (C ,→,D,−3)(C ,→,D,−3) (A,→, exit, 6) (D,→, exit, 6) (B,→, C ,−3) (C ,←, B,−1) (A,→, exit, 6) (B,→, C ,−3) (D,→, exit, 6)(D,←, exit, 6) (C ,←, B,−1) (B,←, A,−1) (C ,←,D,−3)


Result:

I States: S = {A,B,C ,D}, terminal= {A,D}, nonterminal= {B,C}I Action set: {←,→}I Rewards:

r(B, {←,→},C ) = −3, r(B, {←,→},A) = −1,r(C , {←,→},B) = −1, r(C , {←,→},D) = −3

I World structure:A B C D

I Transition model: Agent follows the direction given with probability 0.75. Otherwise, it goes theother direction.

I Policy: π(B) =?, π(C ) =?

55 / 59

Example II



Policy evaluation:

←,→ q(B,←) =?, q(C ,→) =?

→,→ q(B,→) =?, q(C ,→) =?

→,← q(B,→) =?, q(C ,←) =?

←,← q(B,←) =?, q(C ,←) =?

56 / 59

Example II



A single policy computation:

←,→ q(B,←) =?, q(C ,→) =?

A q(B,←) = .5 · −1 + .5 · −3,q(C ,→) = .5 · −1 + .5 · −3

B q(B,←) = .25 · (6− 1) + .75 · (−3 + V (C )),q(C ,→) = .25 · −1 + .75 · (−3 + V (B))

C q(B,←) = .75 · (6− 1) + .25 · (−3 + V (C )),q(C ,→) = .75 · (−3 + 6) + .25 · (−1 + V (B))

D q(B,←) = .75 · (6− 1) + .25 · −3,q(C ,→) = .5 · −1 + .25 · −3

57 / 59

Example II



A single policy computation:

←,→ q(B,←) =?, q(C ,→) =?

A q(B,←) = .5 · −1 + .5 · −3,q(C ,→) = .5 · −1 + .5 · −3

B q(B,←) = .25 · (6− 1) + .75 · (−3 + V (C )),q(C ,→) = .25 · −1 + .75 · (−3 + V (B))

C q(B,←) = .75 · (6− 1) + .25 · (−3 + V (C )),q(C ,→) = .75 · (−3 + 6) + .25 · (−1 + V (B))

D q(B,←) = .75 · (6− 1) + .25 · −3,q(C ,→) = .5 · −1 + .25 · −3

58 / 59

Example II



A single policy computation. As the policy is fixed V (B) = q(B,←),V (C ) = q(C ,→):

I q(B,←) = .75 · (6− 1) + .25 · (−3 + q(C ,→))

I q(C ,→) = .75 · (−3 + 6) + .25 · (−1 + q(B,←))

Therefore:

I q(B,←) = .75 · 5 + .25 · (−3 + .75 · 3 + .25 · (−1 + q(B,←)) = ... ≈ 3.73

I q(C ,→) = .75 · 3 + .25 · (−1 + 3.73) ≈ 2.93

And we calculate for the remaining policies.

59 / 59

Example II



←,→ q(B,←) ≈ 3.73,q(C ,→) ≈ 2.93

→,→ q(B,→) ≈ 0.62,q(C ,→) ≈ 2.15

→,← q(B,→) ≈ −2.29,q(C ,←) ≈ −1.71

←,← q(B,←) ≈ 3.70,q(C ,←) ≈ 2.77

And we can determine the best policy: π(B) =←, π(C ) =→

60 / 59

Documents

Policy estimate from training episodes · Policy estimate from training episodes We have: I unknown squared world of unknown size and structure I robot/agents moves in unknown directions