Upload
axel-walter
View
24
Download
2
Embed Size (px)
DESCRIPTION
Fitted/ batch /model-based RL: A (sketchy, biased) overview(?). Csaba Szepesv ári University of Alberta. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A. Contents. What, why? Constraints How? Model-based learning Model learning - PowerPoint PPT Presentation
Citation preview
1
Fitted/batch/model-based RL: A (sketchy, biased) overview(?)
Csaba SzepesváriUniversity of Alberta
2
Contents
What, why? Constraints How? Model-based learning
Model learning Planning
Model-free learning Averagers Fitted RL
3
Motto
“Nothing is more practical than a good theory” [Lewin]
“He who loves practice without theory is like the sailor who boards ship without a rudder and compass
and never knows where he may cast.” [Leonardo da Vinci]
4
What? Why? What is batch RL?
Input: Samples (algorithm cannot influence samples)
Output: A good policy Why?
Common problem Sample efficiency -- data is expensive Building block
Why not? Too much work (for nothing?) –
“Don’t worry, be lazy!” Old samples are irrelevant Missed opportunities (evaluate a policy!?)
5
Constraints
Large (infinite) state/action space Limits on
Computation Memory use
6
How?
Model learning + planning Model free
Policy search DP
Policy iteration Value iteration
7
Model-based learning
8
Model learning
9
Model-based methods
Model-learning: How? Model: What happens if ..? Features vs. observations vs. states System identification? Satinder! Carlos! Eric! …
Planning: How? Sample + learning! (batch RL? ..but you
can influence the samples) What else? (Discretize? Nay..)
Pro: Model is good for multiple things Contra: Problem is doubled: need of
high fidelity models, good planning
Problem 1: Should planning take into account the uncertainties in the model? (“robustification”)
Problem 2: How to learn relevant, compact models? For example: How to reject irrelevant features and keep the relevant ones?
Need: Tight integration of planning and learning!
10
Planning
11
Bad news.. Theorem (Chow, Tsitsiklis ’89)
Markovian Decision Problems d dimensional state space Bounded transition probabilities, rewards Lipschitz-continuous transition probabilities and
rewards
Any algorithm computing an ²-approximation of the optimal value function needs (²-d) values of p and r.
What’s next then?? Open: Policy approximation?
12
The joy of laziness
Don’t worry, be lazy: “If something is too hard to do, then it's
not worth doing” Luckiness factor:
“If you really want something in this life, you have to work for it - Now quiet, they're about to announce the lottery numbers!”
13
Sparse lookahead trees [Kearns et al., ’02] Idea: Computing a good action
´ planning build a lookahead tree
Size of the tree: S = c |A|H (²) (unavoidable), where H(²) = Kr/(²(1-°))
Good news: S is independent of d!
Bad news: S is exponential in H(²)
Still attractive: Generic, easy to implement
Problem: Not really practical
14
Idea.. Be more lazy Need to propagate
values from good leaves as early as possible
Why sample suboptimal actions at all?
Breadth-first Depth-
first! Bandit algorithms
Upper Confidence Bounds
UCT UCT
[KoSze ’06]
Remi
Similar ideas: [Peret and Garcia, ’04][Chang et al., ’05]
15
Results: Sailing ‘Sailing’: Stochastic shortest path
State-space size = 24*problem-size Extension to two-player, full information games Good results in go! ( Remi, David!)
Open: Why (when) does UCTwork so well? Conjecture: When being (very) optimistic does not abuse search
How to improve UCT?
16
Random Discretization Method [Rust’97]
Method: Random base points Value function computed at these points (weighted importance
sampling) Compute values at other points at run-time (“half-lazy
method”) Why Monte-Carlo? Avoid grids! Result:
State space: [0,1]d
Action space: finite p(y|x,a), r(x,a) Lipschitz continuous, bounded Theorem [Rust ’97]:
Theorem [Sze’01]: Poly samples are enough to come up with ²-optimal actions (poly dependence on H). Smoothness of the value function is not required
E [kVN (x) ¡ V¤(x)k1 ] · C djA j5=4
(1¡ ° )2N 1=4
Open: Can we improve the result by changing the distribution of samples? Idea: Presample + Follow the obtained policy
Open: Can we get poly dependence on both d and H without representing a value function? (e.g. lookahead trees)
17
Pegasus [Ng & Jordan ’00] Idea: Policy search + method of common
random numbers (“scenarios”) Results:
Condition: Deterministic simulative model Thm: Finite action space, finite complexity
policy class polynomial sample complexity Thm: Infinite action spaces, Lipschitz continuity
of trans.probs + rewards polynomial sample complexity
Thm: Finitely computable models + policies polynomial sample complexity
Pro: Nice results Contra: Global search? What policy space?
Problem 1: How to avoid global search?
Problem 2: When can we find a good policy efficiently? How?
Problem 3: How to choose the policy class?
18
Other planning methods Your favorite RL method!
+Planning is easier than learning: You can reset the state! Dyna-style planning with prioritized sweeping
Rich Conservative policy iteration
Problem: Policy search, guaranteed improvement in every iteration
[K&L’00]: Bound for finite MDPs, policy class ´ all policies
[K’03]: Arbitrary policies, reduction-style result Policy search by DP [Bagnell, Kakade, Ng &
Schneider ’03] Similar to [K’03], finite horizon problems
Fitted value iteration ..
19
Model-free: Policy Search
????
Open: How to do it??
(I am serious)
Open: How to evaluate a policy/policy gradient given some samples?
(partial result: In the limit, under some conditions, policies can be evaluated [AnSzeMu’08])
20
Model-free: Dynamic Programming
Policy Iteration How to evaluate policies? Do good value functions give rise to good policies?
Value Iteration Use action-value functions How to represent value functions? How to do the updates?
21
Value-function based methods
Questions: What representation to use? How are errors propagated?
Averagers [Gordon ’95] ~ kernel methods Vt+1 = ¦F T Vt
L1 theory Can we have an L2 (Lp) theory? Counterexamples [Boyan&Moore ’95, Baird’95, BeTsi’96]
L2 error propagation [Munos ’03 ’05]
22
Fitted methods Idea:
Use regression/classification with value/policy iteration
Notable examples: Fitted Q-iteration
Use trees ( averagers; Damien!) Use neural nets ( L2, Martin!)
Policy iteration LSTD [Bradtke&Barto ’96, Boyan ‘99]
BRM [AnSzeMu’06,’08] LSPI: Use action-value functions + iterate
[Lagoudakis & Parr ’01, ’03] RL as classification [La & Pa ’03]
23
Results for fitted algorithms
Results for LSPI/BRM-PI, FQI: Finite action-, continuous state-space Smoothness conditions on MDP Representative training set Function class (F) large (Bellman error of
F is small), but controlled complexity Polynomial rates (similar to supervised learning)
FQI, continuous action-spaces Similar conditions + restricted policy
class Polynomial rates, but bad scaling with the dimension of the action space
[AnSzeMu ’06-’08]
Open: How to choose the function space in an adaptive way?~ model selection in supervised learning
Supervised learning does not work without model selection? Why would RL work?
NO, IT DOES NOT.
Idea: Regularize! Problem: How to evaluate policies?
24
Regularization
25
Final thoughts
Batch RL: Flourishing area Many open questions More should! come soon! Some good results in practice Take computation cost seriously? Connect to on-line RL?
26
Batch RL
Let’s switch to that policy – after all the paper says that learning converges at an optimal rate!