Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006

Reinforcement Learning Applications in Robotics

Gerhard Neumann,

Seminar A,

SS 2006

Overview

Policy Gradient Algorithms RL for Quadrupal Locomotion PEGASUS Algorithm

Autonomous Helicopter Flight High Speed Obstacle Avoidance

RL for Biped Locomotion Poincare-Map RL Dynamic Planning

Hierarchical Approach RL for Acquisition of Robot Stand-Up Behavior

RL for Quadruped Locomotion [Kohl04]

Simple Policy-Gradient Example Optimize Gait for

Sony-Aibo Robot

Use Parameterized Policy 12 Parameters

Front + rear locus (height, x-pos, y-pos) Height of the front and the rear of the body …

Quadruped Locomotion

Policy: No notion of state – open loop control! Start with initial Policy Generate t = 15 random policies Ri

is

Evaluate Value of each policy on the real robot Estimate gradient for each parameter Update policy into the direction of the gradient

Quadruped Locomotion

Estimation of the Walking Speed of a policy Automated process

of the Aibos Each Policy is

evaluated 3 times One Iteration (3 x

15 evaluations) takes 7.5 minutes

Quadruped Gait: Results

Better than the best known gait for AIBO!

Pegasus [Ng00]

Policy Gradient Algorithms: Use finite time horizon, evaluate Value Value of a policy in a stochastic environment is hard to

estimate => Stochastic Optimization Process

PEGASUS: For all policy evaluation trials use fixed set of start states

(scenarios) Use „fixed randomization“ for policy evaluation

Only works for simulations! The same conditions for each evaluation trial => Deterministic Optimization Process!

Can be solved by any optimization method Commonly Used: Gradient Ascent, Random Hill Climbing

Autonomous Helicopter Flight [Ng04a, Ng04b]

Autonomously learn to fly an unmanned helicopter 70000 $ => Catastrophic Exploration!

Learn Dynamics from the observation of a Human pilot

Use PEGASUS to: Learn to Hover Learn to fly complex maneuvers Inverted Helicopter flight

Helicopter Flight: Model Indenfication

12 dimensional state space World Coordinates

(Position + Rotation) + Velocities

4-dimensional actions 2 rotor-plane pitch Rotor blade tilt Tail rotor tilt

Actions are selected every 20 ms

Helicopter Flight: Model Indenfication

Human pilot flies helicopter, data is logged 391s training data reduced to 8 dimensions (position

can be estimated from velocities) Learn transition probabilities

P(st+1|st, at) supervised learning with locally

weighted linear regression Model Gaussian noise for

stochastic model Implemented a simulator for

model validation

Helicopter Flight: Hover Control

Desired hovering position : Very Simple Policy Class

Edges are optained by human prior knowledge Learns more or less linear gains of the controller

Quadratic Reward Function: punishment for deviation of desired position and

orientation

Helicopter Flight: Hover Control

Results: Better performance than Human Expert (red)

Helicopter Flight: Flying maneuvers

Fly 3 manouvers from the most difficult RC helicopter competition class

Trajectory Following: punish distance from projected point on trajectory Additional reward for making progress along the trajectory

Helicopter Flight: Results

Videos:

Video1 Video2

Helicopter Flight: Inverse Flight

Very difficult for humans Unstable!

Recollect data for inverse flight Use same methods

than before Learned in 4 days!

from data collection to flight experiment

Stable inverted flight controller sustained position

Video

High Speed Obstacle Avoidance [Michels05]

Obstacle Avoidance with RC car in unstructured Environments

Estimate depth information from monocular cues

Learn controller with PEGASUS for obstacle avoidance Graphical Simulation : Does it work in the

real environment?

Estimate Depths Information:

Supervised Learning Divide image into 16 horizontal stripes

Use features of the strip and the neighbored strips as input vectors.

Target Values (shortest distance within a strip) either from simulation or laser range finders

Linear Regression Output of the vision system

angle of the strip with the largest distance

Distance of the strip

Obstacle Avoidance: Control

Policy: 6 Parameters

Again, a very simple policy is used

Reward: Deviation of the desired

speed, Number of crashes

Obstacle Avoidance: Results

Using a graphical simulation to train the vision system also works for outdoor environments

Video

RL for Biped Robots

Often used only for simplified planar models

Poincare-Map based RL [Morimoto04] Dynamic Planning [Stilman05]

Other Examples for RL in real robots: Strongly Simplify the Problem: [Zhou03]

Poincare Map-Based RL

Improve walking controllers with RL Poincare map: Intersection-points of an n-dimensional

trajectory with an (n-1) dimensional Hyperplane Predict the state of the biped a half cycle ahead at the phases :

Poincare Map

Learn Mapping:

Input Space : x = (d, d‘) Distance between stance foot and body

Action Space :

Modulate Via-Points of the joint trajectories

Function Approximator: Receptive Field Locally Weighted Regression (RFWR) with a fixed grid

Via Points

Nominal Trajectories from human walking patterns

Control output

is used to modulate via points with a circle Hand selected via-points Increment via-points of one

joint by the same amount

Learning the Value function

Reward Function: 0.1 if height of the robot > 0.35m -0.1 else

Standard SemiMDP update rules Only need to learn the value function

for and Model-Based Actor-Critic Approach

A … Actor Update Rule:

Results:

Stable walking performance after 80 trials Beginning of

Learning

End of Learning

Dynamic Programming for Biped Locomotion [Stilman05]

4-link planar robot Dynamic Programming

for Reduced Dimensional Spaces Manual temporal

decomposition of the problem into phases of single and double support

Use intuitive reductions fo the state space for both phases

State-Increment Dynamic Programming

8-dimensional state space:

Discretize State-Space by coarse grid Use Dynamic Programming:

Interval ε is defined as the minimum time intervall required for any state index to change

State Space Considerations

Decompose into 2 state space components (DS + SS) Important disctinctions between the

dynamcs of DS and SS Periodic System:

DP can not be applied separately to state space components

Establish mapping between the components for the DS and SS transition

State Space Reduction

Double Support: Constant step length (df)

Can not change during DS Can change after robot completes

SS Equivalent to 5-bar linkage model

Entire state space can be described by 2 DoF (use k1 and k2)

5-d state space 10x16x16x12x12 grid => 368640

States

State Space Reduction

Single Support Compass 2-link Model Assume k1 and k2 are constant

Stance knee angle k1 has small range in human walking

Swing knee k2 has strong effect on df, but can be prescribed in accordance with h2 with little effect on the robot‘s CoM

4-D state space 35x35x18x18 grid => 396900

states

State-Space Reduction

Phase Transitions DS to SS transition occurs when the rear

foot leaves the ground Mapping:

SS to DS transition occurs when the swing leg makes contact Mapping:

Action Space, Rewards

Use discretized torques DS: hip and both knee joints can accelerate the

CoM Fix hip action to zero to gain better resolution for the

knee joints Discretize 2-D action space from +- 5.4 Nm into 7x7

intervalls SS: Only choose hip torque

17 intervalls in the range of +- 1.8 Nm State x Actions

398640x49+396900x17 = 26280660 cells (!!) Reward:

Results

11 hours of computation

The computed policy locates a limit cycle through the space.

Performance under error

Alter different properties of the robot in simulation

Do not relearn the policy Wide range of

disturbances are accepted Even if the used model of

the dynamics is incorrect! Wide set of acceptable

states allows the actual trajectory to be distinct from the expected limit cycle

Learning of a Stand-up Behavior [Morimoto00]

Learning to stand-up with a 3-linked planar robot.

6-D state space Angles + Velocities

Hierarchical Reinforcement Learning Task decomposition by Sub-goals Decompose task into:

Non–linear problem in a lower dimensional space

Nearly-linear problem in a high-dimensional space

Upper-level Learning

Coarse Discretization of postures No speed information in the state space

(3-d state space): Actions: Select sub-goals

New Sub-goal

Upper-Level Learning

Reward Function: Reward success of stand-up Reward also for the success of a subgoal Choosing sub-goals which are easier to reach from the

current state is prefered

Use Q(lambda) learning to learn the sequence of sub-goals

Lower-level learning

Lower level is free to choose at which speed to reach sub-goal (desired posture)

6-D state space Use Incremental Normalized Gaussian networks

(ING-net) as function approximator RBF network with rule for allocating new RBF-centers

Action Space: Torque-Vector:

Lower-level learning

Reward: -1.5 if the robot falls down

Continuous time actor critic learning [Doya99] Actor and Critic are learnt with ING-nets.

Control Output: Combination of linear servo controller

and non-linear feedback controller

Results:

Simulation Results Hierarchical architecture 2x faster than plain

architecture Real Robot

Before Learning During Learning After Learning

Learned on average in 749 trials (7/10 learning runs)

Used on average 4.3 subgoals

The end

For People who are interested in using RL: RL-Toolbox

www.igi.tu-graz.ac.at/ril-toolbox

Thank you

Literature

[Kohl04] Policy Gradient Reinforcement Learning for Fast Quadrupedal Locomotion, N. Kohl and P. Stone, 2005

[Ng00] PEGASUS : A policy search method for large MDPs and POMDPs, A. Ng and M. Jordan, 2000

[Ng04a] Autonomous inverted helicopter flight via reinforcement learning, A. Ng et al., 2004

[Ng04b] Autonomous helicopter flight via reinforcement learning, A. Ng et al., 2004

[Michels05] High Speed Obstacle Avoidance using Monocular Vision and Reinforcement Learning, J. Michels, A. Saxena and A. Ng, 2005

[Morimoto04] A Simple Reinforcement Learning Algorithm For Biped Walking, J. Morimoto and C. Atkeson, 2004

Literature

[Stilman05] Dynamic Programming in Reduced Dimensional Spaces: Dynamic Planning for Robust Biped Locomotion, M. Stilman, C. Atkeson and J. Kuffner, 2005

[Morimoto00] Acquisition of Stand-Up Behavior by a Real Robot using Hierarchical Reinforcement Learning, J. Morimoto and K. Doya, 2000

[Morimoto98] Hierarchical Reinforcement Learning of Low-Dimensional Subgoals and High-Dimensional Trajectories, J. Morimoto and K. Doya, 1998

[Zhou03] Dynamic Balance of a biped robot using fuzzy Reinforcement Learning Agents, C. Zhou and Q.Meng, 2003

[Doya99] Reinforcement Learning in Continuous Time And Space, K. Doya,1999

Documents

Reinforcement Learning Applications in Robotics Gerhard Neumann, Seminar A, SS 2006