Upload
others
View
16
Download
0
Embed Size (px)
Citation preview
Optimistic Policy Optimization via Multiple Importance Sampling
Matteo Papini Alberto Maria MetelliLorenzo Lupo Marcello Restelli
19th September 2019Markets, Algorithms, Prediction and Learning Workshop, Politecnico di Milano, Milano, Italy
1Reinforcement Learning [Sutton and Barto, 2018]
1
Policy π : S Ñ ∆pAqTrajectories τ “ ps0, a0, r1, s1, . . . qReturn Rpτq “ řT´1
t“0 γtrt`1
Goal: maxπ
Eπ rRpτqsM. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
2Exploration vs Exploitation
2
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
3Continuous Reinforcement Learning
3
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
4Policy Optimization
4
Parameter space Θ Ď Rd
A parametric policy πθ for each θ P Θ
Each inducing a distribution pθ over trajectories
Goal: maxθPΘ
Jpθq “ Eτ„pθ rRpτqs
Θ
θ
T
Jpθq
pθ
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
5Policy Gradient Methods
5
Gradient ascent on Jpθq
Popular algorithms: REINFORCE [Williams, 1992], PGPE [Sehnke et al., 2008],TRPO [Schulman et al., 2015], PPO [Schulman et al., 2017]
Dota 2 [OpenAI, 2018] Manipulation [Andrychowicz et al., 2018]
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
6Exploration in Policy Optimization
6
Policy Gradient fails with sparse rewards [Kakadeand Langford, 2002]
Non-convex objective ùñ local minima
Entropy bonus [Haarnoja et al., 2018]:
Undirected
Unsafe
Little theoretical understanding [Ahmed et al.,2018]
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
7Multi Armed Bandit (MAB)
7
Arms a P A
Expected payoff µpaq
Goal: minRegretpT q “Tř
t“1rµpa˚q ´ µpatqs
Wide literature on directed exploration [Bubecket al., 2012, Lattimore and Szepesvari, 2019]
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
8Optimism in Face of Uncertainty
8
OFU strategy (e.g., UCB [Lai and Robbins, 1985]):
at “ arg maxaPA
pµpaqloomoon
ESTIMATE
` C
d
logp1δ q
#alooooomooooon
EXPLORATION BONUS
Idea: be optimistic about unknown arms
Can be applied to RL (e.g., Jaksch et al. [2010])
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
9Policy Optimization as a MAB
9
Arms: parameters θ
Payoff: expected return Jpθq
Continuous MAB: we need structure [Kleinberget al., 2013]
θt “ arg maxθPΘ
pJpθtq ` C
d
logp1δ q
#θ
Θ
θA
θB
JpθAq JpθBq
MAB
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
10Exploiting Arm Correlation
10
Θ
θA
θB
TpθA
JpθAq
TpθB
JpθBq
IS
Arms correlate through overlapping trajectorydistributions
Use Importance Sampling (IS) to transferinformation
JpθBq “ Eτ„pθA
„
pθBpτq
pθApτqRpτq
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
11The OPTIMIST index [Papini et al., 2019]
11
A UCB-like index:
θt “ arg maxθPΘ
qJtpθqloomoon
ESTIMATE
a robust multipleimportance sampling estimator
` C
d
d2ppθ}Φtq log 1δt
tlooooooooooomooooooooooon
EXPLORATION BONUS:
distributional distancefrom previous solutions
Tpθ1
pθ2
pθt´1 pθ
MIS
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
12Robust Multiple Importance Sampling Estimator
12
Use Multiple Importance Sampling (MIS) [Veach and Guibas, 1995] to reuse allpast experience
Use dynamic truncation to prevent heavy-tails [Bubeck et al., 2013, Metelliet al., 2018]
pJtpθq “ 1
t
t´1ÿ
k“0
pθpτkqΦtpτkqloomoon
MIS weight
Rpτkq, Φtpτq “ 1
t
t´1ÿ
k“0
pθkpτq
loooooooooooomoooooooooooon
mixture
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
12Robust Multiple Importance Sampling Estimator
12
Use Multiple Importance Sampling (MIS) [Veach and Guibas, 1995] to reuse allpast experience
Use dynamic truncation to prevent heavy-tails [Bubeck et al., 2013, Metelliet al., 2018]
qJtpθq “ 1
t
t´1ÿ
k“0
min
"
Mt,pθpτkqΦtpτkq
*
Rpτkq, Mt “d
td2ppθ}Φtqlogp1{δtq
looooooooooomooooooooooon
threshold
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
13Exploration Bonus
13
Measure novelty with the exponentiated Renyi divergence [Cortes et al., 2010,Metelli et al., 2018]
d2ppθ}Φtq “żˆ
dpθdΦt
˙2
dΦt
Used to upper bound the true value (OFU):
Jpθq ď qJtpθq ` C
d
d2ppθ}Φtq log 1δt
twith high probability
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
14Sublinear Regret
14
RegretpT q “ řTt“0 Jpθ˚q ´ Jpθtq
Compact, d-dimensional parameter space Θ
Under mild assumptions on the policy class, with high probability:
RegretpT q “ rO´?
dT¯
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
15OPTIMIST in Practice
15
Easy implementation only for parameter-based exploration Sehnke et al. [2008]
Difficult index optimization ùñ discretization
Computational time can be traded-off with regret
rO´
dT p1´ε{dq¯
regret ùñ O´
tp1`εq¯
time
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
16Empirical Results
16
River Swim
0 1,000 2,000 3,000 4,000 5,0000
0.5
1
Episodes
Cum
ulat
ive
Ret
urn
OPTIMISTPGPE
Mountain Car
0 1,000 2,000 3,000 4,000 5,000
0
50
100
Episodes
Cum
ulat
ive
Ret
urn
OPTIMISTPGPE
PB-POIS
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
17Future Work
17
Extend to action-based exploration
Improve index optimization
Posterior sampling [Thompson, 1933]
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
18The Abstract Problem [Chu et al., 2019]
18
Outcome space Z
Decision set P Ď ∆pZq
Payoff f : Z Ñ R
maxpPP
Ez„p rfpzqs
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
Thank you for your attention!
Papini, Matteo, Alberto Maria Metelli, Lorenzo Lupo, and Marcello Restelli.”Optimistic Policy Optimization via Multiple Importance Sampling.” In InternationalConference on Machine Learning, pp. 4989-4999. 2019.
Code: github.com/WolfLo/optimist
Contact: [email protected]
Web page: t3p.github.io/icml19
20References
20
Ahmed, Z., Roux, N. L., Norouzi, M., and Schuurmans, D. (2018). Understanding the impact of entropy inpolicy learning. arXiv preprint arXiv:1811.11214.
Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M.,Powell, G., Ray, A., et al. (2018). Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177.
Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed banditproblems. Foundations and Trends R© in Machine Learning, 5(1):1–122.
Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions onInformation Theory, 59(11):7711–7717.
Chu, C., Blanchet, J., and Glynn, P. (2019). Probability functional descent: A unifying perspective on gans,variational inference, and reinforcement learning. In International Conference on Machine Learning, pages1213–1222.
Cortes, C., Mansour, Y., and Mohri, M. (2010). Learning bounds for importance weighting. In Lafferty, J. D.,Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., editors, Advances in Neural InformationProcessing Systems 23, pages 442–450. Curran Associates, Inc.
Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deepreinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference onMachine Learning, ICML 2018, Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018, pages 1856–1865.
Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal ofMachine Learning Research, 11(Apr):1563–1600.
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
21References (cont.)
21
Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning.
Kleinberg, R., Slivkins, A., and Upfal, E. (2013). Bandits and experts in metric spaces. arXiv preprintarXiv:1312.1277.
Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in appliedmathematics, 6(1):4–22.
Lattimore, T. and Szepesvari, C. (2019). Bandit Algorithms. Cambridge University Press (preprint).
Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. (2018). Policy optimization via importance sampling. InAdvances in Neural Information Processing Systems, pages 5447–5459.
OpenAI (2018). Openai five. https://blog.openai.com/openai-five/.
Papini, M., Metelli, A. M., Lupo, L., and Restelli, M. (2019). Optimistic policy optimization via multipleimportance sampling. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th InternationalConference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages4989–4999, Long Beach, California, USA. PMLR.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889–1897.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347.
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019
22References (cont.)
22
Sehnke, F., Osendorfer, C., Ruckstieß, T., Graves, A., Peters, J., and Schmidhuber, J. (2008). Policy gradientswith parameter-based exploration for control. In International Conference on Artificial Neural Networks,pages 387–396. Springer.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of theevidence of two samples. Biometrika, 25(3/4):285–294.
Veach, E. and Guibas, L. J. (1995). Optimally combining sampling techniques for Monte Carlo rendering. InProceedings of the 22nd annual conference on Computer graphics and interactive techniques - SIGGRAPH’95, pages 419–428. ACM Press.
Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256.
M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019