Upload
fabiola-soulsby
View
223
Download
2
Tags:
Embed Size (px)
Citation preview
Overview
Policy Gradient Algorithms RL for Quadrupal Locomotion PEGASUS Algorithm
Autonomous Helicopter Flight High Speed Obstacle Avoidance
RL for Biped Locomotion Poincare-Map RL Dynamic Planning
Hierarchical Approach RL for Acquisition of Robot Stand-Up Behavior
RL for Quadruped Locomotion [Kohl04]
Simple Policy-Gradient Example Optimize Gait for
Sony-Aibo Robot
Use Parameterized Policy 12 Parameters
Front + rear locus (height, x-pos, y-pos) Height of the front and the rear of the body …
Quadruped Locomotion
Policy: No notion of state – open loop control! Start with initial Policy Generate t = 15 random policies Ri
is
Evaluate Value of each policy on the real robot Estimate gradient for each parameter Update policy into the direction of the gradient
Quadruped Locomotion
Estimation of the Walking Speed of a policy Automated process
of the Aibos Each Policy is
evaluated 3 times One Iteration (3 x
15 evaluations) takes 7.5 minutes
Pegasus [Ng00]
Policy Gradient Algorithms: Use finite time horizon, evaluate Value Value of a policy in a stochastic environment is hard to
estimate => Stochastic Optimization Process
PEGASUS: For all policy evaluation trials use fixed set of start states
(scenarios) Use „fixed randomization“ for policy evaluation
Only works for simulations! The same conditions for each evaluation trial => Deterministic Optimization Process!
Can be solved by any optimization method Commonly Used: Gradient Ascent, Random Hill Climbing
Autonomous Helicopter Flight [Ng04a, Ng04b]
Autonomously learn to fly an unmanned helicopter 70000 $ => Catastrophic Exploration!
Learn Dynamics from the observation of a Human pilot
Use PEGASUS to: Learn to Hover Learn to fly complex maneuvers Inverted Helicopter flight
Helicopter Flight: Model Indenfication
12 dimensional state space World Coordinates
(Position + Rotation) + Velocities
4-dimensional actions 2 rotor-plane pitch Rotor blade tilt Tail rotor tilt
Actions are selected every 20 ms
Helicopter Flight: Model Indenfication
Human pilot flies helicopter, data is logged 391s training data reduced to 8 dimensions (position
can be estimated from velocities) Learn transition probabilities
P(st+1|st, at) supervised learning with locally
weighted linear regression Model Gaussian noise for
stochastic model Implemented a simulator for
model validation
Helicopter Flight: Hover Control
Desired hovering position : Very Simple Policy Class
Edges are optained by human prior knowledge Learns more or less linear gains of the controller
Quadratic Reward Function: punishment for deviation of desired position and
orientation
Helicopter Flight: Flying maneuvers
Fly 3 manouvers from the most difficult RC helicopter competition class
Trajectory Following: punish distance from projected point on trajectory Additional reward for making progress along the trajectory
Helicopter Flight: Inverse Flight
Very difficult for humans Unstable!
Recollect data for inverse flight Use same methods
than before Learned in 4 days!
from data collection to flight experiment
Stable inverted flight controller sustained position
Video
High Speed Obstacle Avoidance [Michels05]
Obstacle Avoidance with RC car in unstructured Environments
Estimate depth information from monocular cues
Learn controller with PEGASUS for obstacle avoidance Graphical Simulation : Does it work in the
real environment?
Estimate Depths Information:
Supervised Learning Divide image into 16 horizontal stripes
Use features of the strip and the neighbored strips as input vectors.
Target Values (shortest distance within a strip) either from simulation or laser range finders
Linear Regression Output of the vision system
angle of the strip with the largest distance
Distance of the strip
Obstacle Avoidance: Control
Policy: 6 Parameters
Again, a very simple policy is used
Reward: Deviation of the desired
speed, Number of crashes
Obstacle Avoidance: Results
Using a graphical simulation to train the vision system also works for outdoor environments
Video
RL for Biped Robots
Often used only for simplified planar models
Poincare-Map based RL [Morimoto04] Dynamic Planning [Stilman05]
Other Examples for RL in real robots: Strongly Simplify the Problem: [Zhou03]
Poincare Map-Based RL
Improve walking controllers with RL Poincare map: Intersection-points of an n-dimensional
trajectory with an (n-1) dimensional Hyperplane Predict the state of the biped a half cycle ahead at the phases :
Poincare Map
Learn Mapping:
Input Space : x = (d, d‘) Distance between stance foot and body
Action Space :
Modulate Via-Points of the joint trajectories
Function Approximator: Receptive Field Locally Weighted Regression (RFWR) with a fixed grid
Via Points
Nominal Trajectories from human walking patterns
Control output
is used to modulate via points with a circle Hand selected via-points Increment via-points of one
joint by the same amount
Learning the Value function
Reward Function: 0.1 if height of the robot > 0.35m -0.1 else
Standard SemiMDP update rules Only need to learn the value function
for and Model-Based Actor-Critic Approach
A … Actor Update Rule:
Dynamic Programming for Biped Locomotion [Stilman05]
4-link planar robot Dynamic Programming
for Reduced Dimensional Spaces Manual temporal
decomposition of the problem into phases of single and double support
Use intuitive reductions fo the state space for both phases
State-Increment Dynamic Programming
8-dimensional state space:
Discretize State-Space by coarse grid Use Dynamic Programming:
Interval ε is defined as the minimum time intervall required for any state index to change
State Space Considerations
Decompose into 2 state space components (DS + SS) Important disctinctions between the
dynamcs of DS and SS Periodic System:
DP can not be applied separately to state space components
Establish mapping between the components for the DS and SS transition
State Space Reduction
Double Support: Constant step length (df)
Can not change during DS Can change after robot completes
SS Equivalent to 5-bar linkage model
Entire state space can be described by 2 DoF (use k1 and k2)
5-d state space 10x16x16x12x12 grid => 368640
States
State Space Reduction
Single Support Compass 2-link Model Assume k1 and k2 are constant
Stance knee angle k1 has small range in human walking
Swing knee k2 has strong effect on df, but can be prescribed in accordance with h2 with little effect on the robot‘s CoM
4-D state space 35x35x18x18 grid => 396900
states
State-Space Reduction
Phase Transitions DS to SS transition occurs when the rear
foot leaves the ground Mapping:
SS to DS transition occurs when the swing leg makes contact Mapping:
Action Space, Rewards
Use discretized torques DS: hip and both knee joints can accelerate the
CoM Fix hip action to zero to gain better resolution for the
knee joints Discretize 2-D action space from +- 5.4 Nm into 7x7
intervalls SS: Only choose hip torque
17 intervalls in the range of +- 1.8 Nm State x Actions
398640x49+396900x17 = 26280660 cells (!!) Reward:
Performance under error
Alter different properties of the robot in simulation
Do not relearn the policy Wide range of
disturbances are accepted Even if the used model of
the dynamics is incorrect! Wide set of acceptable
states allows the actual trajectory to be distinct from the expected limit cycle
Learning of a Stand-up Behavior [Morimoto00]
Learning to stand-up with a 3-linked planar robot.
6-D state space Angles + Velocities
Hierarchical Reinforcement Learning Task decomposition by Sub-goals Decompose task into:
Non–linear problem in a lower dimensional space
Nearly-linear problem in a high-dimensional space
Upper-level Learning
Coarse Discretization of postures No speed information in the state space
(3-d state space): Actions: Select sub-goals
New Sub-goal
Upper-Level Learning
Reward Function: Reward success of stand-up Reward also for the success of a subgoal Choosing sub-goals which are easier to reach from the
current state is prefered
Use Q(lambda) learning to learn the sequence of sub-goals
Lower-level learning
Lower level is free to choose at which speed to reach sub-goal (desired posture)
6-D state space Use Incremental Normalized Gaussian networks
(ING-net) as function approximator RBF network with rule for allocating new RBF-centers
Action Space: Torque-Vector:
Lower-level learning
Reward: -1.5 if the robot falls down
Continuous time actor critic learning [Doya99] Actor and Critic are learnt with ING-nets.
Control Output: Combination of linear servo controller
and non-linear feedback controller
Results:
Simulation Results Hierarchical architecture 2x faster than plain
architecture Real Robot
Before Learning During Learning After Learning
Learned on average in 749 trials (7/10 learning runs)
Used on average 4.3 subgoals
The end
For People who are interested in using RL: RL-Toolbox
www.igi.tu-graz.ac.at/ril-toolbox
Thank you
Literature
[Kohl04] Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion, N. Kohl and P. Stone, 2005
[Ng00] PEGASUS : A policy search method for large MDPs and POMDPs, A. Ng and M. Jordan, 2000
[Ng04a] Autonomous inverted helicopter flight via reinforcement learning, A. Ng et al., 2004
[Ng04b] Autonomous helicopter flight via reinforcement learning, A. Ng et al., 2004
[Michels05] High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning, J. Michels, A. Saxena and A. Ng, 2005
[Morimoto04] A Simple Reinforcement Learning Algorithm For Biped Walking, J. Morimoto and C. Atkeson, 2004
Literature
[Stilman05] Dynamic Programming in Reduced Dimensional Spaces: Dynamic Planning for Robust Biped Locomotion, M. Stilman, C. Atkeson and J. Kuffner, 2005
[Morimoto00] Acquisition of Stand-Up Behavior by a Real Robot using Hierarchical Reinforcement Learning, J. Morimoto and K. Doya, 2000
[Morimoto98] Hierarchical Reinforcement Learning of Low-Dimensional Subgoals and High-Dimensional Trajectories, J. Morimoto and K. Doya, 1998
[Zhou03] Dynamic Balance of a biped robot using fuzzy Reinforcement Learning Agents, C. Zhou and Q.Meng, 2003
[Doya99] Reinforcement Learning in Continuous Time And Space, K. Doya,1999