34
LAO* Paper Presentation Jonathan Eidelman CSC2542 University of Toronto Fall 2016

LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

Embed Size (px)

Citation preview

Page 1: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

LAO*PaperPresentationJonathanEidelman

CSC2542UniversityofToronto

Fall2016

Page 2: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

Acknowledgements

TheslidesonLAO*arethoseofPascalPoupart.

TheslidesonAO*arethoseofGholamreza Ghassem-Sani.

Thankyoutobothresearchersforsharingtheirslidesontheweb.

Page 3: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

Module 9 LAO*

CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Page 4: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

2

Large State Space

• Value Iteration, Policy Iteration and Linear Programming – Complexity at least quadratic in |𝑆|

• Problem: |𝑆| may be very large

– Queuing problems: infinite state space – Factored problems: exponentially many states

Page 5: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

3

Mitigate Size of State Space

• Two ideas:

• Exploit initial state – Not all states are reachable

• Exploit heuristic ℎ

– approximation of optimal value function – usually an upper bound ℎ 𝑠 ≥ 𝑉∗ 𝑠 ∀𝑠

Page 6: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

4

State Space

State space |𝑆|

𝑠0 Reachable states

States reachable by 𝜋∗

Page 7: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

6

LAO* Algorithm • Related to

– A*: path heuristic search – AO*: tree heuristic search – LAO*: cyclic graph heuristic search

• LAO* alternates between

– State space expansion – Policy optimization

• value iteration, policy iteration, linear programming

Page 8: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

Slides by: Gholamreza Ghassem-Sani

AO* REVIEW

Page 9: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

AND/OR graphs▪ Some problems are best represented as

achieving subgoals, some of which achieved simultaneously and independently (AND)

▪ Up to now, only dealt with OR options

Possess TV set

Steal TV Earn Money Buy TV

Page 10: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

Searching AND/OR graphs▪ A solution in an AND-OR tree is a sub tree

whose leafs are included in the goal set

▪ Cost function: sum of costs in AND node f(n) = f(n1) + f(n2) + …. + f(nk)

▪ How can we extend A* to search AND/OR trees? The AO* algorithm.

Page 11: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

AND/OR search ▪ We must examine several nodes

simultaneously when choosing the next move

A

B C D38

E F G H I J17 9 27

(5) (10) (3) (4) (15) (10)

A

B C D(3)(4)

(5)(9)

Page 12: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

AND/OR Best-First-Search▪ Traverse the graph (from the initial node)

following the best current path. ▪ Pick one of the unexpanded nodes on that

path and expand it. Add its successors to the graph and compute f for each of them

▪ Change the expanded node’s f value to reflect its successors. Propagate the change up the graph.

▪ Reconsider the current best solution and repeat until a solution is found

Page 13: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

AND/OR Best-First-Search example

A

B C D(3)(4)

(5)(9)

A(5)

2.1.

A

B CD

E F(4) (4)(10)

(3)(9)

(4)(10)

3.

Page 14: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

AND/OR Best-First-Search example

B C D

G H E F(5) (7) (4) (4)(10)

(6)(12)

(4) (10)

4. A

Page 15: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

A Longer path may be better

B C D

G H E F

A

JI

Unsolvable B C D

G H E F

A

JI

Unsolvable

Page 16: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

Interacting Sub goals

C

D

E

A

(2)(5)

Page 17: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

AO* algorithm

1. Let G be a graph with only starting node INIT. 2. Repeat the followings until INIT is labeled SOLVED

or h(INIT) > FUTILITY a) Select an unexpanded node from the most promising

path from INIT (call it NODE) b) Generate successors of NODE. If there are none, set

h(NODE) = FUTILITY (i.e., NODE is unsolvable); otherwise for each SUCCESSOR that is not an ancestor of NODE do the following:

i. Add SUCCESSSOR to G. ii. If SUCCESSOR is a terminal node, label it SOLVED and

set h(SUCCESSOR) = 0. iii. If SUCCESSPR is not a terminal node, compute its h

Page 18: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

AO* algorithm (Cont.)c) Propagate the newly discovered information up the

graph by doing the following: let S be set of SOLVED nodes or nodes whose h values have been changed and need to have values propagated back to their parents. Initialize S to Node. Until S is empty repeat the followings:

i. Remove a node from S and call it CURRENT. ii. Compute the cost of each of the arcs emerging from

CURRENT. Assign minimum cost of its successors as its h. iii. Mark the best path out of CURRENT by marking the arc that

had the minimum cost in step ii iv. Mark CURRENT as SOLVED if all of the nodes connected to it

through new labeled arc have been labeled SOLVED v. If CURRENT has been labeled SOLVED or its cost was just

changed, propagate its new cost back up through the graph. So add all of the ancestors of CURRENT to S.

Page 19: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

An Example

Page 20: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

An ExampleA(8)

Page 21: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

An Example

CDB

A

(8)(1)

(2)

[12]4 5

5

[13]

Page 22: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

An Example

CDB

A

(8)(4)

(2)

[15]4 5

5

[13]

2

Page 23: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

An Example

CDB

A

(3)(4)

G

E

(2)

(1)

(0)

[15]4 5

52

24

[8]

Page 24: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

An Example

CDB

A

(4)(4)

G

E

(2)

(3)

(0)

[15]4 5

52

22

4

[9]

3

Page 25: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

An Example

CDB

A

(4)

G

E

(2)

(3)

(0)

[15]4 5

52

22

4

Solved

3

Solved

Solved

Page 26: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

▪ Considers the cost (> 0) for switching from one branch to another in the search

▪ Example: path finding in real life

A CBDF E G

11 4 1 2 167

f(B) = 1 + 1 = 2 (A) f(C) = 1 + 2 = 3 f(D) = 1 + 4 = 5 (B) f(A) = 1 + 3 = 4 f(B) = 1 + 5 = 6 (A) f(C) = 1 + 2 = 3 f(A) = 1 + 6 = 7 (C) f(E) = 1 + 7 = 8 f(B) = 1 + 5 = 6 (A) f(C) = 1 + 8 = 9 f(D) = 1 + 4 = 5 (B) f(A) = 1 + 9 = 10 f(F)=1+11= 12 (D) f(B) = 1 + 10 = 11

Real Time A*▪ Considers the cost (> 0) for switching from one branch to

another in the search ▪ Example: path finding in real life

A CBDF E G

11 4 1 2 167

f(B) = 1 + 1 = 2 (A) f(C) = 1 + 2 = 3 f(D) = 1 + 4 = 5 (B) f(A) = 1 + 3 = 4 f(B) = 1 + 5 = 6 (A) f(C) = 1 + 2 = 3 f(A) = 1 + 6 = 7 (C) f(E) = 1 + 7 = 8 f(B) = 1 + 5 = 6 (A) f(C) = 1 + 8 = 9 f(D) = 1 + 4 = 5 (B) f(A) = 1 + 9 = 10 f(F)=1+11= 12 (D) f(B) = 1 + 10 = 11

Page 27: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

Another ExampleCurrent State = S f(A) = 3 + 5 = 8 f(B) = 2 + 4 = 6

Current State = B f(S) = 2 + 8 = 10

f(A) = 4 + 5 = 9 f(C) = 1 + 5 = 6 f(E) = 4 + 2 = 6

Current State = C f(H) = 2 + 4 = 6 f(B) = 1 + 6 = 7

A B

C

S

(4)

(4) D

(5)2

(2)

G

(1)

E

(2)H

(0)

34

41

(5)

F2 13

2

Page 28: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

Current State = H f(C) = 2 + 7 = 9

Current State = C f(B) = 1 + 6 = 7 f(H) = ∞ Current State = B f(S) = 2 + 8 = 10 f(A) = 4 + 5 = 9 f(E) = 4 + 2 = 6 f(C) = ∞

Current State = E f(B) = 4 + 9 = 13 f(D) = 3 + 2 = 5 f(F) = 1 + 1 = 2

A B

C

S

(4)

(4) D

(5)2

(2)

G

(1)

E

(2)H

(0)

34

41

(5)

F2 13

2

Another Example

Page 29: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

A B

C

S

(4)

(4) D

(5)2

(2)

G

(1)

E

(2)H

(0)

34

41

(5)

F2 13

2

Current State = F f(E) = 1 + 5 = 6

Current State = E f(D) = 3 + 2 = 5 f(B) = 4 + 9 = 13 f(F) = ∞ Current State = D f(G) = 2 + 0 = 2 f(E) = 3 + 13 = 16

Visited Nodes = S, B, C, H, C, B, E, F, E, D, G

Path = S, B, E, D, G

Another Example

Page 30: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

7

Terminology • 𝑆: state space

• 𝑆𝐸 ⊆ 𝑆: envelope – Growing set of states

• 𝑆𝑇 ⊆ 𝑆𝐸: terminal states – States whose children are not in the envelope

• 𝑆𝑠0𝜋 ⊆ 𝑆𝐸: states reachable from 𝑠0 by following 𝜋

• ℎ(𝑠): heuristic such that ℎ 𝑠 ≥ 𝑉∗ 𝑠 ∀𝑠 – E.g., ℎ 𝑠 = max

𝑠,𝑎𝑅(𝑠, 𝑎)/(1 − 𝛾)

Page 31: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

8

LAO* Algorithm

LAO*(MDP, heuristic ℎ) 𝑆𝐸 ← {𝑠0}, 𝑆𝑇 ← {𝑠0} Repeat

Let 𝑅𝐸 𝑠, 𝑎 = ℎ(𝑠) 𝑠 ∈ 𝑆𝑇𝑅(𝑠, 𝑎) otherwise

Let 𝑇𝐸(𝑠′|𝑠, 𝑎) = 0 𝑠 ∈ 𝑆𝑇

Pr (𝑠′|𝑠, 𝑎) otherwise

Find optimal policy 𝜋 for 𝑆𝐸, 𝑅𝐸, 𝑇𝐸 Find reachable states 𝑆𝑠0

𝜋 Select reachable terminal states s1, … , sk ⊆ 𝑆𝑠0

𝜋 ∩ 𝑆𝑇 𝑆𝑇 ← (𝑆𝑇 ∖ 𝑠1, … , 𝑠𝑘 ) ∪ (𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛 𝑠1, … , 𝑠𝑘 ∖ 𝑆𝐸) 𝑆𝐸 ← 𝑆𝐸 ∪ 𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛( 𝑠1, … , 𝑠𝑘 ) Until 𝑆𝑠0

𝜋 ∩ 𝑆𝑇 is empty

Page 32: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

9

Efficiency

Efficiency influenced by

1. Choice of terminal states to add to envelope

2. Algorithm to find optimal policy – Can use value iteration, policy iteration, modified

policy iteration, linear programming – Key: reuse previous computation

• E.g., start with previous policy or value function at each iteration

Page 33: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

10

Convergence • Theorem: LAO* converges to the optimal policy • Proof:

– Fact: At each iteration, the value function 𝑉 is an upper bound on 𝑉∗ due to the heuristic function ℎ

– Proof by contradiction: suppose the algorithm stops, but 𝜋 is not optimal.

• Since the algorithm stopped, all states reachable by 𝜋 are in 𝑆𝐸 ∖ 𝑆𝑇

• Hence, the value function 𝑉 is the value of 𝜋 and since 𝜋 is suboptimal then 𝑉 < 𝑉∗, which contradicts the fact that 𝑉 is an upper bound on 𝑉∗

Page 34: LAO* Paper Presentation - Teaching Labscsc2542h/fall/material/csc2542f16_lao... · The slides on AO* are those of Gholamreza Ghassem-Sani. ... Mark the best path out of CURRENT by

CS886 (c) 2013 Pascal Poupart

11

Summary

• LAO* – Extension of basic solution algorithms (value iteration,

policy iteration, linear programming) – Exploit initial state and heuristic function – Gradually grow an envelope of states – Complexity depends on # of reachable states instead

of size of state space