Function Approximation for Imitation Learning in Humanoid Robots Rajesh P. N. Rao Dept of Computer Science and Engineering University of Washington, Seattle

Function Approximation for Function Approximation for Imitation Learning in Humanoid Imitation Learning in Humanoid

Robots Robots

Rajesh P. N. RaoRajesh P. N. RaoDept of Computer Science and EngineeringDept of Computer Science and Engineering

University of Washington, SeattleUniversity of Washington, Seattleneural.cs.washington.eduneural.cs.washington.edu

Students: Rawichote Chalodhorn, David Grimes

Funding: ONR, NSF, Packard Foundation

The Problem: Robotic Imitation of Human Actions

Teacher (David Grimes)

HOAP-2 Humanoid Robot(Morpheus or Mo)

Example of Motion Capture DataExample of Motion Capture Data

Motion Capture Sequence Attempted Imitation

GoalsGoals

Learn from only observations of teacher statesLearn from only observations of teacher states Expert Expert does notdoes not control robot control robot Also called “implicit imitation” Also called “implicit imitation” (Price & Boutilier, 1999)(Price & Boutilier, 1999)

Similar to how humans learn from imitationSimilar to how humans learn from imitation Avoid hand-coded physics-based modelsAvoid hand-coded physics-based models Learn dynamics in terms of sensory Learn dynamics in terms of sensory

consequences of executed actionsconsequences of executed actions Use teacher demonstration to restrict search Use teacher demonstration to restrict search

space of feasible actionsspace of feasible actions

Step 1: Kinematic mappingStep 1: Kinematic mapping

Need to solve the Need to solve the “correspondence “correspondence problem”problem”

Solved by assuming Solved by assuming markers are on markers are on scaled version of scaled version of robot bodyrobot body

Standard inverse Standard inverse kinematics recovers kinematics recovers joint angles for joint angles for motionmotion

Step 2: Dimensionality Step 2: Dimensionality ReductionReduction

Humanoid robots have large DOF, making Humanoid robots have large DOF, making action optimization intractableaction optimization intractable HOAP-2 has 25 DOFHOAP-2 has 25 DOF

Fortunately, most actions are highly Fortunately, most actions are highly redundantredundant

Can use dimensionality reduction Can use dimensionality reduction techniques (e.g., PCA) to represent states techniques (e.g., PCA) to represent states and actionsand actions

Posture Representation using Posture Representation using EigenposesEigenposes

Eigenposes for WalkingEigenposes for Walking

Step 3: Learning Forward Models Step 3: Learning Forward Models using Function Approximationusing Function Approximation

Basic Idea: Basic Idea: 1. Learn forward model in the neighborhood of 1. Learn forward model in the neighborhood of

teacher demonstrationteacher demonstration Use function approximation techniques to map Use function approximation techniques to map

actions to observed sensory consequences actions to observed sensory consequences

2. Use the learned model to infer stable actions 2. Use the learned model to infer stable actions for imitationfor imitation

3. Iterate between 1 and 2 for higher accuracy3. Iterate between 1 and 2 for higher accuracy

Approach 1: RBF Networks for Approach 1: RBF Networks for Deterministic Action SelectionDeterministic Action Selection

Radial Basis Function (RBF) network used Radial Basis Function (RBF) network used to learn the n-th order Markov function:to learn the n-th order Markov function:

st is the sensory state vector is the sensory state vector E.g., E.g., st = t (3D gyroscope signal)

at is the action vector in latent space E.g., Servo joint angle commands in latent space

),,,,,( 111 nttnttt F aasss

Action Selection using the Action Selection using the Learned FunctionLearned Function

Select optimal action for next time step Select optimal action for next time step tt: :

measures torso stability based on predicted measures torso stability based on predicted gyroscope signals:gyroscope signals:

Search for optimal action Search for optimal action att* limited to local * limited to local

region around teacher trajectory in subspaceregion around teacher trajectory in subspace

),,,,,(minarg 11*

nttnttt Ft

aassaa

222zzyyxx ω

(Chalodhorn et al., Humanoids, 2005; IJCAI 2007; IROS, 2009)

Example: Learning to WalkExample: Learning to Walk

Human motion capture data Unoptimized (kinematic) imitation

Example: Learning to WalkExample: Learning to Walk

Motion scalingTake baby steps first (literally!)

Final Result

(Chalodhorn et al., IJCAI 2007)

Result: Learning to WalkResult: Learning to Walk

Optimized Stable WalkHuman Motion Capture

Approach 2: Gaussian Processes for Approach 2: Gaussian Processes for Probabilistic Action SelectionProbabilistic Action Selection

Dynamic Bayesian Network (DBN) for Imitation

[Slice at time t]

Ot are observations of states St

St = low-D joint space, gyro, foot pressure readingsCt are constraints on states (e.g., gyroscope values near zero)

(Grimes et al., RSS 2006; NIPS 2007; IROS 2007; IROS 2008)

DBN for Imitative LearningDBN for Imitative Learning

Gaussian Process-based Forward Model (input [st-1,at]):

)()(*

)(*1

)( ;,),|( it

iitt

it NP sass

(Grimes, Chalodhorn, & Rao, RSS 2006)

Action Inference using Action Inference using Nonparametric Belief PropagationNonparametric Belief Propagation

Maximum marginal posterior actions

Evidence (blue nodes)

Summary of ApproachSummary of Approach

Learning and action inference are interleaved to yield progressively more accurate forward models and actions

Example of LearningExample of Learning

Progression of Imitative LearningProgression of Imitative Learning

Human Action Imitation

Result after LearningResult after Learning

(Grimes, Rashid, & Rao, NIPS 2007)

Other ExamplesOther Examples

From Planning to Policy LearningFrom Planning to Policy Learning

Behaviors shown in the previous slides were Behaviors shown in the previous slides were open-loop, based on planning by inferenceopen-loop, based on planning by inference

Can we learn closed-loop “reactive” behaviors?Can we learn closed-loop “reactive” behaviors? Idea: Idea:

Learn state-to-action mappings (“policies”) Learn state-to-action mappings (“policies”) based on the final optimized output of the based on the final optimized output of the planner and resulting sensory measurements planner and resulting sensory measurements

Policy Learning using Gaussian ProcessesPolicy Learning using Gaussian Processes

For a parameterized task T(For a parameterized task T(), watch ), watch demonstrations for particular values of demonstrations for particular values of E.g., Teacher lifting objects of different weightE.g., Teacher lifting objects of different weight Parameter Parameter not given but intrinsically encoded in not given but intrinsically encoded in

sensory measurementssensory measurements Use inference-based planning to infer stable Use inference-based planning to infer stable

actions aactions at t and states sand states stt for demonstrated values for demonstrated values

of of Learn Gaussian process policy based on {sLearn Gaussian process policy based on {s tt, a, att}: }:

(Grimes & Rao, IROS 2008)

Example: Learning to Lift Objects of Example: Learning to Lift Objects of Different WeightsDifferent Weights

Generalization by Gaussian Process PolicyGeneralization by Gaussian Process Policy

Generalizing to a Novel ObjectGeneralizing to a Novel Object

Summary and ConclusionsSummary and Conclusions Stable full-body human imitation in a humanoid robot Stable full-body human imitation in a humanoid robot

may be achievable without a physics-based modelmay be achievable without a physics-based model

Function approximation techniques play a crucial role in Function approximation techniques play a crucial role in learning a forward model and in action inferencelearning a forward model and in action inference RBF networks, Gaussian processesRBF networks, Gaussian processes

Function approximation also used to learn policies for Function approximation also used to learn policies for reactive behaviorreactive behavior

Dimensionality reduction using PCA (via “eigenposes”) Dimensionality reduction using PCA (via “eigenposes”) helps keep learning and inference tractablehelps keep learning and inference tractable

Challenges: Scaling up to large number of actions, Challenges: Scaling up to large number of actions, smooth transition between actions, hierarchical controlsmooth transition between actions, hierarchical control

Documents

Function Approximation for Imitation Learning in Humanoid Robots Rajesh P. N. Rao Dept of Computer Science and Engineering University of Washington, Seattle