Recent Trends in Neural Net Policy Learning

1. Learning Contact-Rich Manipulation Skills with Guided Policy Search Sergey Levine, Nolan Wagener, and Pieter Abbeel ICRA 2015 Presenter: Sungjoon Choi

2. Recent trends in Reinforcement Learning : Deep Neural Policy Learning based on my private opinion.. which can be somewhat misleading Presenter: Sungjoon Choi

3. Learning Contact-Rich Manipulation Skills with Guided Policy Search http://rll.berkeley.edu/icra2015gps/

4. Learning Contact-Rich Manipulation Skills with Guided Policy Search This paper wins the ICRA 2015 Best Manipulation Paper Award! But why? Whats so great about this paper? Personally, main contribution of this paper is to propose a direct policy learning method that can actually train a real-world robot to perform some tasks. Thats it?? I guess so! By the way, actually training a real-world robot is harder than you might imagine! You will see how brilliant this paper is!

5. Brief review of MDP and RL actionobservation reward Agent

6. Brief review of MDP and RL State Reward Value Policy Action Model

7. Brief review of MDP and RL Remember! The goal of MDP and RL is to find an optimal policy! It is like saying I will find a function which best satisfies given conditions!. However, learning a function is not an easy problem. (In fact, impossible unless we use some prior knowledge!) So, instead of learning a function itself, most of the works try to find the parameters of a function by restricting the solution space to a space of certain parametric functions such as linear functions.

8. Brief review of MDP and RL What are typical impediments in reinforcement learning? 2. However, linear functions do not work well in practice. In other words, why is it so HARD to find an optimal policy?? 1. We are living in a continuous world, not a discrete grid world. 3. (Dynamic) model, which is often required, is HARD to obtain. - In this continuous world, standard MDP cannot be established. - So instead, we usually use function approximation to handle this issue. - And, of course, nonlinear functions are hard to optimize. - The definition of value is expected sum of rewards!. Todays paper tackles all three problems listed above!!

9. RL: Reinforcement Learning IRL: Inverse Reinforcement Learning LfD: Learning from Demonstration DPL: Direct Policy Learning RL DPL IRL (=IOC) LfD Guided Policy Search Objective? Whats given? Whats NOT given Algorithms RL Find optimal policy Reward Dynamic model Policy Policy iteration, Value iteration, TD learning, Q learning IRL Find underlying reward Find optimal policy Experts demonstrations (often dynamic model) Reward Policy MaxEnt IRL, MaxMargin planning, Apprenticeship learning DPL Find optimal policy Experts demonstrations Reward Dynamic model (not always) Guided policy search LfD Find underlying reward Find optimal policy Experts demonstrations + others.. Dynamic model (not always) GP motion controller Big Picture (which might be wrong) IOC: Inverse Optimal Control Constrained Guided Policy Search

10. [10] Sergey Levine, Nolan Wagener, and Pieter Abbeel. "Learning Contact-Rich Manipulation Skills with Guided Policy Search." ICRA 2015 [2] Dvijotham, Krishnamurthy, and Emanuel Todorov. "Inverse optimal control with linearly-solvable MDPs." ICML 2010 [11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel. "Incentivizing Exploration In Reinforcement Learning With Deep Predictive Models." arXiv 2015 [3] Brian D. Ziebart, Andrew Maas, J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse Reinforcement Learning. AAAI 2008 [1] Emanuel Todorov. "Linearly-solvable Markov decision problems." NIPS 2006 [6] Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 [5] Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal control with locally optimal examples." ICML 2012 [9] Sergey Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End Training of Deep Visuomotor Policies." ICRA 2015 [7] Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014 [8] Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. MDP is powerful. But it requires heavy computation for finding the value function. LMDP [1] Lets use the LMDP in inverse optimal control problem! [2] How can we measure the probability of (experts) state-action sequence? [3] Can we learn nonlinear reward function? [4] Can we do that with locally optimal examples? [5] Given reward, how can we effectively learn optimal policy? [6] Re-formalize the guided policy search. [7] Let learn both dynamic model and policy!! [8] Image based control with CNN!! [9] Applied to a real-world robot, PR2!! [10] [4] Levine, Sergey, Zoran Popovic, and Vladlen Koltun. "Nonlinear inverse reinforcement learning with Gaussian processes" NIPS 2011 The beginning of a new era! (RL + Deep learning) Note that reward is Given!! How can we effectively search the optimal policy? [11] (latest)

11. Learning Contact-Rich Manipulation Skills with Guided Policy Search Main building block is a Guided Policy Search (GPS). GPS is a two stage algorithm consists of a trajectory optimization stage and policy learning stage. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Levine, Sergey, and Vladlen Koltun. "Learning complex neural network policies with trajectory optimization." ICML 2014 GPS is a direct policy search algorithm, that can effectively scale to high-dimensional systems. Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014.

12. Guided Policy Search Stage 1) Trajectory optimization (iterative LQR) Given a reward function and dynamic model, Each trajectory consists of (state-action) pairs. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013

13. Iterative LQR Iterative linear quadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear- quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Linear dynamics Quadratic reward

14. Iterative LQR Iterative linear quadratic regulator optimizes a trajectory by repeatedly solving for the optimal policy under linear- quadratic assumptions. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013

15. Iterative LQR Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 Iteratively compute a trajectory, find a deterministic policy based on the trajectory, and recomputed a trajectory until convergence. But this only results a deterministic policy. We need something stochastic! By exploiting the concept of linearly solvable MDP and maximum entropy control, one can derive following stochastic policy!

16. Guided Policy Search Stage 2) Policy learning From collected (state-action) pairs, Train neural network controllers, using Importance Sampled Policy Search. Levine, Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013

17. Importance Sampled Policy Search 1: = =1 ; , 2 Importance sampled policy search finds which maximizes following cost function. = =1 1: 1: reward (cost) Neural policy (data-fitting) Average guiding distributions or previous policy (compensate) Lower variance (exploration) Analytic gradient of () Neural network Back-propagation

18. Constrained Guided Policy Search Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. What if we dont know the dynamics of a robot? We can use real-world trajectories to locally approximate dynamic models.

19. Constrained Guided Policy Search Levine, Sergey, and Pieter Abbeel. "Learning neural network policies with guided policy search under unknown dynamics." NIPS 2014. However, as it is a local approximation, large deviation from previous trajectories might lead to disastrous optimization results. Gaussian mixture model is further used to reduce the number of examples to model a dynamic model. Impose a constraint on the KL-divergence between the old and new trajectory distribution!

20. Learning Contact-Rich Manipulation Skills with Guided Policy Search This paper use the constrained guided policy search to perform contact-rich manipulation skills.

21. Learning Contact-Rich Manipulation Skills with Guided Policy Search (a) stacking large lego blocks on a fixed base, (b) onto a free-standing block, (c) held in both gripper; (d) threading wooden rings onto a tight-fitting peg; (e) assembling a toy airplane by inserting the wheels into a slot; (f) inserting a shoe tree into a shoe; (g,h) screwing caps onto pill bottles and (i) onto a water bottle.

22. Learning Contact-Rich Manipulation Skills with Guided Policy Search Agent 7 Torques 1. Current joint angles and velocities 2. Cartesian velocities of two or three points on the manipulated object 3. Vector from the target positions of these points to their current position 4. Torque applied at the previous time step

23. Conclusion Constrained guided policy search is used to train a real-world PR2 robot to perform some contact-rich tasks. Policy function is modeled with a neural network. Prior knowledge about dynamics is NOT required. Iterative LQR is used for defining a guiding distribution which works as a proposal distribution in an importance sampled policy search.

24. Thank you! Any questions?

Engineering

Recent Trends in Neural Net Policy Learning