Matteo Papini Alberto Maria Metelli Lorenzo Lupo Marcello ...The OPTIMIST index [Papini et al., 2019] 11 11 A UCB-like index: t argmax P Jq loomoontp q ESTIMATE a robustmultiple importance

Optimistic Policy Optimization via Multiple Importance Sampling

Matteo Papini Alberto Maria MetelliLorenzo Lupo Marcello Restelli

19th September 2019Markets, Algorithms, Prediction and Learning Workshop, Politecnico di Milano, Milano, Italy

1Reinforcement Learning [Sutton and Barto, 2018]

1

Policy π : S Ñ ∆pAqTrajectories τ “ ps0, a0, r1, s1, . . . qReturn Rpτq “ řT´1

t“0 γtrt`1

Goal: maxπ

Eπ rRpτqsM. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019

2Exploration vs Exploitation

2

M. Papini Optimistic Policy Optimization via Multiple Importance Sampling MAPLE 2019

3Continuous Reinforcement Learning

3


4Policy Optimization

4

Parameter space Θ Ď Rd

A parametric policy πθ for each θ P Θ

Each inducing a distribution pθ over trajectories

Goal: maxθPΘ

Jpθq “ Eτ„pθ rRpτqs

Θ

θ

T

Jpθq

pθ


5Policy Gradient Methods

5

Gradient ascent on Jpθq

Popular algorithms: REINFORCE [Williams, 1992], PGPE [Sehnke et al., 2008],TRPO [Schulman et al., 2015], PPO [Schulman et al., 2017]

Dota 2 [OpenAI, 2018] Manipulation [Andrychowicz et al., 2018]


6Exploration in Policy Optimization

6

Policy Gradient fails with sparse rewards [Kakadeand Langford, 2002]

Non-convex objective ùñ local minima

Entropy bonus [Haarnoja et al., 2018]:

Undirected

Unsafe

Little theoretical understanding [Ahmed et al.,2018]


7Multi Armed Bandit (MAB)

7

Arms a P A

Expected payoff µpaq

Goal: minRegretpT q “Tř

t“1rµpa˚q ´ µpatqs

Wide literature on directed exploration [Bubecket al., 2012, Lattimore and Szepesvari, 2019]


8Optimism in Face of Uncertainty

8

OFU strategy (e.g., UCB [Lai and Robbins, 1985]):

at “ arg maxaPA

pµpaqloomoon

ESTIMATE

` C

d

logp1δ q

#alooooomooooon

EXPLORATION BONUS

Idea: be optimistic about unknown arms

Can be applied to RL (e.g., Jaksch et al. [2010])


9Policy Optimization as a MAB

9

Arms: parameters θ

Payoff: expected return Jpθq

Continuous MAB: we need structure [Kleinberget al., 2013]

θt “ arg maxθPΘ

pJpθtq ` C

d

logp1δ q

#θ

Θ

θA

θB

JpθAq JpθBq

MAB


10Exploiting Arm Correlation

10

Θ

θA

θB

TpθA

JpθAq

TpθB

JpθBq

IS

Arms correlate through overlapping trajectorydistributions

Use Importance Sampling (IS) to transferinformation

JpθBq “ Eτ„pθA

„

pθBpτq

pθApτqRpτq


11The OPTIMIST index [Papini et al., 2019]

11

A UCB-like index:

θt “ arg maxθPΘ

qJtpθqloomoon

ESTIMATE

a robust multipleimportance sampling estimator

` C

d

d2ppθ}Φtq log 1δt

tlooooooooooomooooooooooon

EXPLORATION BONUS:

distributional distancefrom previous solutions

Tpθ1

pθ2

pθt´1 pθ

MIS


12Robust Multiple Importance Sampling Estimator

12

Use Multiple Importance Sampling (MIS) [Veach and Guibas, 1995] to reuse allpast experience

Use dynamic truncation to prevent heavy-tails [Bubeck et al., 2013, Metelliet al., 2018]

pJtpθq “ 1

t

t´1ÿ

k“0

pθpτkqΦtpτkqloomoon

MIS weight

Rpτkq, Φtpτq “ 1

t

t´1ÿ

k“0

pθkpτq

loooooooooooomoooooooooooon

mixture


12Robust Multiple Importance Sampling Estimator

12

Use Multiple Importance Sampling (MIS) [Veach and Guibas, 1995] to reuse allpast experience

Use dynamic truncation to prevent heavy-tails [Bubeck et al., 2013, Metelliet al., 2018]

qJtpθq “ 1

t

t´1ÿ

k“0

min

"

Mt,pθpτkqΦtpτkq

*

Rpτkq, Mt “d

td2ppθ}Φtqlogp1{δtq

looooooooooomooooooooooon

threshold


13Exploration Bonus

13

Measure novelty with the exponentiated Renyi divergence [Cortes et al., 2010,Metelli et al., 2018]

d2ppθ}Φtq “żˆ

dpθdΦt

˙2

dΦt

Used to upper bound the true value (OFU):

Jpθq ď qJtpθq ` C

d

d2ppθ}Φtq log 1δt

twith high probability


14Sublinear Regret

14

RegretpT q “ řTt“0 Jpθ˚q ´ Jpθtq

Compact, d-dimensional parameter space Θ

Under mild assumptions on the policy class, with high probability:

RegretpT q “ rO´?

dT¯


15OPTIMIST in Practice

15

Easy implementation only for parameter-based exploration Sehnke et al. [2008]

Difficult index optimization ùñ discretization

Computational time can be traded-off with regret

rO´

dT p1´ε{dq¯

regret ùñ O´

tp1`εq¯

time


16Empirical Results

16

River Swim

0 1,000 2,000 3,000 4,000 5,0000

0.5

1

Episodes

Cum

ulat

ive

Ret

urn

OPTIMISTPGPE

Mountain Car

0 1,000 2,000 3,000 4,000 5,000

0

50

100

Episodes

Cum

ulat

ive

Ret

urn

OPTIMISTPGPE

PB-POIS


17Future Work

17

Extend to action-based exploration

Improve index optimization

Posterior sampling [Thompson, 1933]


18The Abstract Problem [Chu et al., 2019]

18

Outcome space Z

Decision set P Ď ∆pZq

Payoff f : Z Ñ R

maxpPP

Ez„p rfpzqs


Thank you for your attention!

Papini, Matteo, Alberto Maria Metelli, Lorenzo Lupo, and Marcello Restelli.”Optimistic Policy Optimization via Multiple Importance Sampling.” In InternationalConference on Machine Learning, pp. 4989-4999. 2019.

Code: github.com/WolfLo/optimist

Contact: [email protected]

Web page: t3p.github.io/icml19

github.com/WolfLo/optimist

t3p.github.io/icml19

20References

20

Ahmed, Z., Roux, N. L., Norouzi, M., and Schuurmans, D. (2018). Understanding the impact of entropy inpolicy learning. arXiv preprint arXiv:1811.11214.

Andrychowicz, M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M.,Powell, G., Ray, A., et al. (2018). Learning dexterous in-hand manipulation. arXiv preprint arXiv:1808.00177.

Bubeck, S., Cesa-Bianchi, N., et al. (2012). Regret analysis of stochastic and nonstochastic multi-armed banditproblems. Foundations and Trends R© in Machine Learning, 5(1):1–122.

Bubeck, S., Cesa-Bianchi, N., and Lugosi, G. (2013). Bandits with heavy tail. IEEE Transactions onInformation Theory, 59(11):7711–7717.

Chu, C., Blanchet, J., and Glynn, P. (2019). Probability functional descent: A unifying perspective on gans,variational inference, and reinforcement learning. In International Conference on Machine Learning, pages1213–1222.

Cortes, C., Mansour, Y., and Mohri, M. (2010). Learning bounds for importance weighting. In Lafferty, J. D.,Williams, C. K. I., Shawe-Taylor, J., Zemel, R. S., and Culotta, A., editors, Advances in Neural InformationProcessing Systems 23, pages 442–450. Curran Associates, Inc.

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deepreinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference onMachine Learning, ICML 2018, Stockholmsmassan, Stockholm, Sweden, July 10-15, 2018, pages 1856–1865.

Jaksch, T., Ortner, R., and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal ofMachine Learning Research, 11(Apr):1563–1600.


21References (cont.)

21

Kakade, S. and Langford, J. (2002). Approximately optimal approximate reinforcement learning.

Kleinberg, R., Slivkins, A., and Upfal, E. (2013). Bandits and experts in metric spaces. arXiv preprintarXiv:1312.1277.

Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in appliedmathematics, 6(1):4–22.

Lattimore, T. and Szepesvari, C. (2019). Bandit Algorithms. Cambridge University Press (preprint).

Metelli, A. M., Papini, M., Faccio, F., and Restelli, M. (2018). Policy optimization via importance sampling. InAdvances in Neural Information Processing Systems, pages 5447–5459.

OpenAI (2018). Openai five. https://blog.openai.com/openai-five/.

Papini, M., Metelli, A. M., Lupo, L., and Restelli, M. (2019). Optimistic policy optimization via multipleimportance sampling. In Chaudhuri, K. and Salakhutdinov, R., editors, Proceedings of the 36th InternationalConference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages4989–4999, Long Beach, California, USA. PMLR.

Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. InInternational Conference on Machine Learning, pages 1889–1897.

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimizationalgorithms. arXiv preprint arXiv:1707.06347.


https://blog.openai.com/openai-five/

22References (cont.)

22

Sehnke, F., Osendorfer, C., Ruckstieß, T., Graves, A., Peters, J., and Schmidhuber, J. (2008). Policy gradientswith parameter-based exploration for control. In International Conference on Artificial Neural Networks,pages 387–396. Springer.

Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of theevidence of two samples. Biometrika, 25(3/4):285–294.

Veach, E. and Guibas, L. J. (1995). Optimally combining sampling techniques for Monte Carlo rendering. InProceedings of the 22nd annual conference on Computer graphics and interactive techniques - SIGGRAPH’95, pages 419–428. ACM Press.

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256.


Documents

Matteo Papini Alberto Maria Metelli Lorenzo Lupo Marcello ...The OPTIMIST index [Papini et al., 2019] 11 11 A UCB-like index: t argmax P Jq loomoontp q ESTIMATE a robustmultiple importance