1. Learning Contact-Rich Manipulation Skills with Guided Policy
Search Sergey Levine, Nolan Wagener, and Pieter Abbeel ICRA 2015
Presenter: Sungjoon Choi
2. Recent trends in Reinforcement Learning : Deep Neural Policy
Learning based on my private opinion.. which can be somewhat
misleading Presenter: Sungjoon Choi
3. Learning Contact-Rich Manipulation Skills with Guided Policy
Search http://rll.berkeley.edu/icra2015gps/
4. Learning Contact-Rich Manipulation Skills with Guided Policy
Search This paper wins the ICRA 2015 Best Manipulation Paper Award!
But why? Whats so great about this paper? Personally, main
contribution of this paper is to propose a direct policy learning
method that can actually train a real-world robot to perform some
tasks. Thats it?? I guess so! By the way, actually training a
real-world robot is harder than you might imagine! You will see how
brilliant this paper is!
5. Brief review of MDP and RL actionobservation reward
Agent
6. Brief review of MDP and RL State Reward Value Policy Action
Model
7. Brief review of MDP and RL Remember! The goal of MDP and RL
is to find an optimal policy! It is like saying I will find a
function which best satisfies given conditions!. However, learning
a function is not an easy problem. (In fact, impossible unless we
use some prior knowledge!) So, instead of learning a function
itself, most of the works try to find the parameters of a function
by restricting the solution space to a space of certain parametric
functions such as linear functions.
8. Brief review of MDP and RL What are typical impediments in
reinforcement learning? 2. However, linear functions do not work
well in practice. In other words, why is it so HARD to find an
optimal policy?? 1. We are living in a continuous world, not a
discrete grid world. 3. (Dynamic) model, which is often required,
is HARD to obtain. - In this continuous world, standard MDP cannot
be established. - So instead, we usually use function approximation
to handle this issue. - And, of course, nonlinear functions are
hard to optimize. - The definition of value is expected sum of
rewards!. Todays paper tackles all three problems listed
above!!
9. RL: Reinforcement Learning IRL: Inverse Reinforcement
Learning LfD: Learning from Demonstration DPL: Direct Policy
Learning RL DPL IRL (=IOC) LfD Guided Policy Search Objective?
Whats given? Whats NOT given Algorithms RL Find optimal policy
Reward Dynamic model Policy Policy iteration, Value iteration, TD
learning, Q learning IRL Find underlying reward Find optimal policy
Experts demonstrations (often dynamic model) Reward Policy MaxEnt
IRL, MaxMargin planning, Apprenticeship learning DPL Find optimal
policy Experts demonstrations Reward Dynamic model (not always)
Guided policy search LfD Find underlying reward Find optimal policy
Experts demonstrations + others.. Dynamic model (not always) GP
motion controller Big Picture (which might be wrong) IOC: Inverse
Optimal Control Constrained Guided Policy Search
10. [10] Sergey Levine, Nolan Wagener, and Pieter Abbeel.
"Learning Contact-Rich Manipulation Skills with Guided Policy
Search." ICRA 2015 [2] Dvijotham, Krishnamurthy, and Emanuel
Todorov. "Inverse optimal control with linearly-solvable MDPs."
ICML 2010 [11] Stadie, Bradly C., Sergey Levine, and Pieter Abbeel.
"Incentivizing Exploration In Reinforcement Learning With Deep
Predictive Models." arXiv 2015 [3] Brian D. Ziebart, Andrew Maas,
J.Andrew Bagnell, and Anind K. Dey. "Maximum Entropy Inverse
Reinforcement Learning. AAAI 2008 [1] Emanuel Todorov.
"Linearly-solvable Markov decision problems." NIPS 2006 [6] Levine,
Sergey, and Vladlen Koltun. "Guided policy search." ICML 2013 [5]
Levine, Sergey, and Vladlen Koltun. "Continuous inverse optimal
control with locally optimal examples." ICML 2012 [9] Sergey
Levine, Chelsea Finn, Trevor Darrell, Pieter Abbeel. "End-to-End
Training of Deep Visuomotor Policies." ICRA 2015 [7] Levine,
Sergey, and Vladlen Koltun. "Learning complex neural network
policies with trajectory optimization." ICML 2014 [8] Levine,
Sergey, and Pieter Abbeel. "Learning neural network policies with
guided policy search under unknown dynamics." NIPS 2014. MDP is
powerful. But it requires heavy computation for finding the value
function. LMDP [1] Lets use the LMDP in inverse optimal control
problem! [2] How can we measure the probability of (experts)
state-action sequence? [3] Can we learn nonlinear reward function?
[4] Can we do that with locally optimal examples? [5] Given reward,
how can we effectively learn optimal policy? [6] Re-formalize the
guided policy search. [7] Let learn both dynamic model and policy!!
[8] Image based control with CNN!! [9] Applied to a real-world
robot, PR2!! [10] [4] Levine, Sergey, Zoran Popovic, and Vladlen
Koltun. "Nonlinear inverse reinforcement learning with Gaussian
processes" NIPS 2011 The beginning of a new era! (RL + Deep
learning) Note that reward is Given!! How can we effectively search
the optimal policy? [11] (latest)
11. Learning Contact-Rich Manipulation Skills with Guided
Policy Search Main building block is a Guided Policy Search (GPS).
GPS is a two stage algorithm consists of a trajectory optimization
stage and policy learning stage. Levine, Sergey, and Vladlen
Koltun. "Guided policy search." ICML 2013 Levine, Sergey, and
Vladlen Koltun. "Learning complex neural network policies with
trajectory optimization." ICML 2014 GPS is a direct policy search
algorithm, that can effectively scale to high-dimensional systems.
Levine, Sergey, and Pieter Abbeel. "Learning neural network
policies with guided policy search under unknown dynamics." NIPS
2014.
12. Guided Policy Search Stage 1) Trajectory optimization
(iterative LQR) Given a reward function and dynamic model, Each
trajectory consists of (state-action) pairs. Levine, Sergey, and
Vladlen Koltun. "Guided policy search." ICML 2013
13. Iterative LQR Iterative linear quadratic regulator
optimizes a trajectory by repeatedly solving for the optimal policy
under linear- quadratic assumptions. Levine, Sergey, and Vladlen
Koltun. "Guided policy search." ICML 2013 Linear dynamics Quadratic
reward
14. Iterative LQR Iterative linear quadratic regulator
optimizes a trajectory by repeatedly solving for the optimal policy
under linear- quadratic assumptions. Levine, Sergey, and Vladlen
Koltun. "Guided policy search." ICML 2013
15. Iterative LQR Levine, Sergey, and Vladlen Koltun. "Guided
policy search." ICML 2013 Iteratively compute a trajectory, find a
deterministic policy based on the trajectory, and recomputed a
trajectory until convergence. But this only results a deterministic
policy. We need something stochastic! By exploiting the concept of
linearly solvable MDP and maximum entropy control, one can derive
following stochastic policy!
17. Importance Sampled Policy Search 1: = =1 ; , 2 Importance
sampled policy search finds which maximizes following cost
function. = =1 1: 1: reward (cost) Neural policy (data-fitting)
Average guiding distributions or previous policy (compensate) Lower
variance (exploration) Analytic gradient of () Neural network
Back-propagation
18. Constrained Guided Policy Search Levine, Sergey, and Pieter
Abbeel. "Learning neural network policies with guided policy search
under unknown dynamics." NIPS 2014. What if we dont know the
dynamics of a robot? We can use real-world trajectories to locally
approximate dynamic models.
19. Constrained Guided Policy Search Levine, Sergey, and Pieter
Abbeel. "Learning neural network policies with guided policy search
under unknown dynamics." NIPS 2014. However, as it is a local
approximation, large deviation from previous trajectories might
lead to disastrous optimization results. Gaussian mixture model is
further used to reduce the number of examples to model a dynamic
model. Impose a constraint on the KL-divergence between the old and
new trajectory distribution!
20. Learning Contact-Rich Manipulation Skills with Guided
Policy Search This paper use the constrained guided policy search
to perform contact-rich manipulation skills.
21. Learning Contact-Rich Manipulation Skills with Guided
Policy Search (a) stacking large lego blocks on a fixed base, (b)
onto a free-standing block, (c) held in both gripper; (d) threading
wooden rings onto a tight-fitting peg; (e) assembling a toy
airplane by inserting the wheels into a slot; (f) inserting a shoe
tree into a shoe; (g,h) screwing caps onto pill bottles and (i)
onto a water bottle.
22. Learning Contact-Rich Manipulation Skills with Guided
Policy Search Agent 7 Torques 1. Current joint angles and
velocities 2. Cartesian velocities of two or three points on the
manipulated object 3. Vector from the target positions of these
points to their current position 4. Torque applied at the previous
time step
23. Conclusion Constrained guided policy search is used to
train a real-world PR2 robot to perform some contact-rich tasks.
Policy function is modeled with a neural network. Prior knowledge
about dynamics is NOT required. Iterative LQR is used for defining
a guiding distribution which works as a proposal distribution in an
importance sampled policy search.