26
1 Fitted/batch /model-based RL: A (sketchy, biased) overview(?) Csaba Szepesvári University of Alberta

Fitted/ batch /model-based RL: A (sketchy, biased) overview(?)

Embed Size (px)

DESCRIPTION

Fitted/ batch /model-based RL: A (sketchy, biased) overview(?). Csaba Szepesv ári University of Alberta. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. Contents. What, why? Constraints How? Model-based learning Model learning - PowerPoint PPT Presentation

Citation preview

Page 1: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

1

Fitted/batch/model-based RL: A (sketchy, biased) overview(?)

Csaba SzepesváriUniversity of Alberta

Page 2: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

2

Contents

What, why? Constraints How? Model-based learning

Model learning Planning

Model-free learning Averagers Fitted RL

Page 3: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

3

Motto

“Nothing is more practical than a good theory” [Lewin]

“He who loves practice without theory is like the sailor who boards ship without a rudder and compass

and never knows where he may cast.” [Leonardo da Vinci]

Page 4: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

4

What? Why? What is batch RL?

Input: Samples (algorithm cannot influence samples)

Output: A good policy Why?

Common problem Sample efficiency -- data is expensive Building block

Why not? Too much work (for nothing?) –

“Don’t worry, be lazy!” Old samples are irrelevant Missed opportunities (evaluate a policy!?)

Page 5: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

5

Constraints

Large (infinite) state/action space Limits on

Computation Memory use

Page 6: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

6

How?

Model learning + planning Model free

Policy search DP

Policy iteration Value iteration

Page 7: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

7

Model-based learning

Page 8: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

8

Model learning

Page 9: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

9

Model-based methods

Model-learning: How? Model: What happens if ..? Features vs. observations vs. states System identification? Satinder! Carlos! Eric! …

Planning: How? Sample + learning! (batch RL? ..but you

can influence the samples) What else? (Discretize? Nay..)

Pro: Model is good for multiple things Contra: Problem is doubled: need of

high fidelity models, good planning

Problem 1: Should planning take into account the uncertainties in the model? (“robustification”)

Problem 2: How to learn relevant, compact models? For example: How to reject irrelevant features and keep the relevant ones?

Need: Tight integration of planning and learning!

Page 10: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

10

Planning

Page 11: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

11

Bad news.. Theorem (Chow, Tsitsiklis ’89)

Markovian Decision Problems d dimensional state space Bounded transition probabilities, rewards Lipschitz-continuous transition probabilities and

rewards

Any algorithm computing an ²-approximation of the optimal value function needs (²-d) values of p and r.

What’s next then?? Open: Policy approximation?

Page 12: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

12

The joy of laziness

Don’t worry, be lazy: “If something is too hard to do, then it's

not worth doing” Luckiness factor:

“If you really want something in this life, you have to work for it - Now quiet, they're about to announce the lottery numbers!”

Page 13: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

13

Sparse lookahead trees [Kearns et al., ’02] Idea: Computing a good action

´ planning build a lookahead tree

Size of the tree: S = c |A|H (²) (unavoidable), where H(²) = Kr/(²(1-°))

Good news: S is independent of d!

Bad news: S is exponential in H(²)

Still attractive: Generic, easy to implement

Problem: Not really practical

Page 14: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

14

Idea.. Be more lazy Need to propagate

values from good leaves as early as possible

Why sample suboptimal actions at all?

Breadth-first Depth-

first! Bandit algorithms

Upper Confidence Bounds

UCT UCT

[KoSze ’06]

Remi

Similar ideas: [Peret and Garcia, ’04][Chang et al., ’05]

Page 15: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

15

Results: Sailing ‘Sailing’: Stochastic shortest path

State-space size = 24*problem-size Extension to two-player, full information games Good results in go! ( Remi, David!)

Open: Why (when) does UCTwork so well? Conjecture: When being (very) optimistic does not abuse search

How to improve UCT?

Page 16: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

16

Random Discretization Method [Rust’97]

Method: Random base points Value function computed at these points (weighted importance

sampling) Compute values at other points at run-time (“half-lazy

method”) Why Monte-Carlo? Avoid grids! Result:

State space: [0,1]d

Action space: finite p(y|x,a), r(x,a) Lipschitz continuous, bounded Theorem [Rust ’97]:

Theorem [Sze’01]: Poly samples are enough to come up with ²-optimal actions (poly dependence on H). Smoothness of the value function is not required

E [kVN (x) ¡ V¤(x)k1 ] · C djA j5=4

(1¡ ° )2N 1=4

Open: Can we improve the result by changing the distribution of samples? Idea: Presample + Follow the obtained policy

Open: Can we get poly dependence on both d and H without representing a value function? (e.g. lookahead trees)

Page 17: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

17

Pegasus [Ng & Jordan ’00] Idea: Policy search + method of common

random numbers (“scenarios”) Results:

Condition: Deterministic simulative model Thm: Finite action space, finite complexity

policy class polynomial sample complexity Thm: Infinite action spaces, Lipschitz continuity

of trans.probs + rewards polynomial sample complexity

Thm: Finitely computable models + policies polynomial sample complexity

Pro: Nice results Contra: Global search? What policy space?

Problem 1: How to avoid global search?

Problem 2: When can we find a good policy efficiently? How?

Problem 3: How to choose the policy class?

Page 18: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

18

Other planning methods Your favorite RL method!

+Planning is easier than learning: You can reset the state! Dyna-style planning with prioritized sweeping

Rich Conservative policy iteration

Problem: Policy search, guaranteed improvement in every iteration

[K&L’00]: Bound for finite MDPs, policy class ´ all policies

[K’03]: Arbitrary policies, reduction-style result Policy search by DP [Bagnell, Kakade, Ng &

Schneider ’03] Similar to [K’03], finite horizon problems

Fitted value iteration ..

Page 19: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

19

Model-free: Policy Search

????

Open: How to do it??

(I am serious)

Open: How to evaluate a policy/policy gradient given some samples?

(partial result: In the limit, under some conditions, policies can be evaluated [AnSzeMu’08])

Page 20: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

20

Model-free: Dynamic Programming

Policy Iteration How to evaluate policies? Do good value functions give rise to good policies?

Value Iteration Use action-value functions How to represent value functions? How to do the updates?

Page 21: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

21

Value-function based methods

Questions: What representation to use? How are errors propagated?

Averagers [Gordon ’95] ~ kernel methods Vt+1 = ¦F T Vt

L1 theory Can we have an L2 (Lp) theory? Counterexamples [Boyan&Moore ’95, Baird’95, BeTsi’96]

L2 error propagation [Munos ’03 ’05]

Page 22: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

22

Fitted methods Idea:

Use regression/classification with value/policy iteration

Notable examples: Fitted Q-iteration

Use trees ( averagers; Damien!) Use neural nets ( L2, Martin!)

Policy iteration LSTD [Bradtke&Barto ’96, Boyan ‘99]

BRM [AnSzeMu’06,’08] LSPI: Use action-value functions + iterate

[Lagoudakis & Parr ’01, ’03] RL as classification [La & Pa ’03]

Page 23: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

23

Results for fitted algorithms

Results for LSPI/BRM-PI, FQI: Finite action-, continuous state-space Smoothness conditions on MDP Representative training set Function class (F) large (Bellman error of

F is small), but controlled complexity Polynomial rates (similar to supervised learning)

FQI, continuous action-spaces Similar conditions + restricted policy

class Polynomial rates, but bad scaling with the dimension of the action space

[AnSzeMu ’06-’08]

Open: How to choose the function space in an adaptive way?~ model selection in supervised learning

Supervised learning does not work without model selection? Why would RL work?

NO, IT DOES NOT.

Idea: Regularize! Problem: How to evaluate policies?

Page 24: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

24

Regularization

Page 25: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

25

Final thoughts

Batch RL: Flourishing area Many open questions More should! come soon! Some good results in practice Take computation cost seriously? Connect to on-line RL?

Page 26: Fitted/ batch /model-based RL:  A (sketchy, biased) overview(?)

26

Batch RL

Let’s switch to that policy – after all the paper says that learning converges at an optimal rate!