Learning the Locomotion of Different Body Sizes Using Deep

Learning the Locomotion of

Different Body Sizes Using Deep

Reinforcement Learning

Wenying Wu

Master of Science

Artificial Intelligence

School of Informatics

University of Edinburgh

2019

Abstract

Ensuring the realism of 3D animation is essential to maintaining audience immersion.

Motion synthesis aims to generate realistic motion from only high-level control pa-

rameters (e.g. the trajectory to follow). In particular, physical motion synthesis uses

physically-based animation, where all motion is the result of simulated forces. To

ensure realism, these techniques can aim to imitate reference motion capture data.

The advantage of physically-based animation is the ability to produce emergent, and

physically-realistic, responses to unexpected external disturbances e.g. different ter-

rains.

An important consideration for methods that use motion capture data is retarget-

ing, i.e. how to apply the resulting animations to characters whose physical properties

do not match those of the motion capture actor. Existing physical motion synthesis re-

search does not place much focus on efficient retargeting, and thus existing techniques

can take many hours to train controllers for a modified character model.

In this dissertation, we look at whether it is possible to train adaptive controllers

that can quickly adapt to different body parameters. This could be used to develop a

tool for physically-based animation where animators can edit characters interactively.

We build our experiments around DeepMimic, a recent method for physical motion

synthesis that uses deep Reinforcement Learning to imitate reference motion capture

data. The DeepMimic authors show it is possible to retarget the motion data to different

models with DeepMimic, but this requires many hours of training. We use this method

as our baseline.

We implement and compare two approaches (both inspired by sim-to-real transfer

research for robotics): linearly combining base controllers to produce a new controller

fitted to the desired body parameters, and domain randomisation. We evaluate these

methods on a humanoid model with varying scales and limb lengths, and demonstrate

that it is indeed possible to train controllers that can adapt to different body parameters

with no online computation as opposed to needing many hours. These controllers

follow the reference motion as well as the baseline can, whilst also demonstrating

comparable robustness to external forces.

i

Acknowledgements

I firstly would like to thank my supervisor Taku Komura for suggesting this very in-

teresting research project and for his input throughout the process. I would also like

to thank Sebastian Starke, Levi Fussell and Robin Trute for their valuable ideas and

insights at the start of the project. I would like to thank Erwin Coumans and Yunfei

Bai for the open-source PyBullet simulator which the project uses. Finally, I would

like to thank my friends here in Edinburgh (especially my fellow Floor 6 frequenters)

for their company, and my parents for their continual support.

ii

Table of Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Outline of dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Background and Related Work 42.1 Character animation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Motion synthesis . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Retargeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Retargeting kinematic animation . . . . . . . . . . . . . . . . 6

2.2.2 Retargeting physically-based animation . . . . . . . . . . . . 6

2.3 Controller transfer in robotics . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Domain adaptation and transfer learning . . . . . . . . . . . . 7

2.3.2 Domain randomisation . . . . . . . . . . . . . . . . . . . . . 8

2.3.3 Linear combination of policies . . . . . . . . . . . . . . . . . 8

3 DeepMimic 93.1 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Proximal policy optimisation . . . . . . . . . . . . . . . . . . 10

3.1.2 Algorithm and hyperparameters . . . . . . . . . . . . . . . . 11

3.2 Problem representation . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Physics simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Character and reference data . . . . . . . . . . . . . . . . . . . . . . 13

3.4.1 Modifying body parameters . . . . . . . . . . . . . . . . . . 14

3.4.2 Retargeting motion capture data . . . . . . . . . . . . . . . . 15

3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

iii

4 Combination of Controllers 164.1 Base policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4.1.1 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . 16

4.2 Combining policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.3 Interpolating two policies . . . . . . . . . . . . . . . . . . . . . . . . 18

4.4 Linearly combining multiple policies . . . . . . . . . . . . . . . . . . 19

4.4.1 Black-box optimisation . . . . . . . . . . . . . . . . . . . . . 20

4.4.2 Corrective base policies . . . . . . . . . . . . . . . . . . . . 22

4.4.3 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . . . 24

4.5 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.5.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.5.2 K-Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . 27

4.5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Domain Randomisation 295.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . 29

5.1.2 PPO hyperparameters . . . . . . . . . . . . . . . . . . . . . . 30

5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

6 Results and Evaluation 336.1 Training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1.1 Offline time . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

6.1.2 Online time . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

6.2 Episode return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

6.3 Controller robustness . . . . . . . . . . . . . . . . . . . . . . . . . . 37

7 Conclusions and Future Work 39

Bibliography 41

A Reward Function Terms 46

B Training Graphs 48B.1 Domain randomisation . . . . . . . . . . . . . . . . . . . . . . . . . 48

B.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

iv

C Combination of Policies 50C.1 Transferred vs non-transferred base policies . . . . . . . . . . . . . . 50

C.2 Segmenting the action vector . . . . . . . . . . . . . . . . . . . . . . 51

D Domain Randomisation 53D.1 Neural network architecture diagrams . . . . . . . . . . . . . . . . . 53

D.2 Adjusting output of a frozen network . . . . . . . . . . . . . . . . . . 53

v

Chapter 1

Introduction

1.1 Motivation

3D animation is ubiquitous in films and video games. In order to ensure the audience

remains immersed, it is essential that animation appears believable and realistic. In the

last couple of decades, motion capture has become a widely-used technique for ani-

mation [19], however it is expensive, requires manual post-processing, and is limited

in terms of flexibility and controllability. Motion synthesis research aims to address

these issues by looking at how to generate realistic-looking motions according to only

high-level control inputs, often relying on a database of motion capture data [33].

Most character animation in industry is kinematic (operates directly on positions

and velocities), however there has also long been interest in refining physically-based

animation techniques where all motion is the result of simulated forces [8]. Physical

motion synthesis is an exciting avenue of research that offers a lot of potential for pro-

ducing characters that can react realistically to unseen environmental changes, a feature

that would be incredibly useful in interactive applications. To ensure the synthesised

motions appear realistic, physical motion synthesis methods may use techniques such

as space-time optimisation [44, 2] or reinforcement learning (RL) [34] to minimise the

difference between the simulated motion and reference motion capture data.

One issue with motion capture data is that often we wish to animate characters

whose physical characteristics do not match those of the motion capture actor. For

example, we may want to animate non-human characters, or characters with artistic

but unrealistic proportions, or simply to re-use the same data for different characters

to reduce costs. In these situations, we must take care to avoid undesirable artefacts

such as foot skating, floor penetration, or unnatural movement when transferring the

1

Chapter 1. Introduction 2

motion. This transferring is known as retargeting, and established techniques exist

both in the literature and in industry [11].

Most of these retargeting methods are for kinematic animation rather than physically-

based animation. It has been demonstrated that one can use motion capture data to

produce physical controllers for character models with different body parameters, but

as it stands, these methods require long and compute-heavy training/fitting procedures

for each different model [34, 2]. Depending on the method, this can take up to several

days, ruling out the possibility of an interactive editor. Ideally, we would like to be

able to develop tools which allow animators to adjust character models without being

exposed to long waiting times or the under-the-hood details of training.

1.2 Objective

This dissertation looks at whether it is possible to train adaptive physical controllers

that can quickly be used for models with different body parameters. This would al-

low animators to be more experimental and flexible with character designs as they

work with physically-based animation and motion capture data. In particular, we ad-

dress whether techniques proposed in robotics research for overcoming the Reality

Gap (the discrepancy between simulated training environments and the real world) can

be adapted to physically-based animation; and if so, which techniques are the most

promising with respect to the quality of the motion produced, the amount of online and

offline training time, and the controllers’ robustness to environmental perturbations.

1.3 Contributions

In this dissertation, we look at training adaptive controllers from DeepMimic con-

trollers. DeepMimic is a state-of-the-art physical motion synthesis method which is

able to train physically-animated characters to imitate motion capture clips [34] by

using Proximal Policy Optimisation (PPO), a deep RL algorithm [39]. Peng et al.

demonstrate that their method can train controllers for an Atlas robot model using hu-

man motion capture data, despite the fact that Atlas has different proportions and mass

distribution to the human actor. The issue is that this takes 1-2 days to train.

We design and implement two possible approaches for producing adaptive con-

trollers, and evaluate them on humanoid character models with varying body parame-

ters (e.g. limb length). The approaches are:

Chapter 1. Introduction 3

• Learning how to linearly combine multiple controllers in order to produce anew controller that is best fitted to the given body parameters. We show that

linear interpolation does not always find the best combination weights and that

black-box optimisation methods such as Bayesian optimisation can be applied

instead. However we also demonstrate that once we have the optimised weights

for a set of body parameters, one can use methods such as K-Nearest Neighbours

and cubic interpolation to cover the whole body parameter space.

• Using domain randomisation to train controllers that adjust their outputsdepending on the inputted body parameters. This involves trying different

neural network architectures and tuning the PPO hyperparameters. We also ran

some preliminary experiments using progressive networks [36].

Both approaches are inspired by robotics research [49, 50] and, as far as we are

aware, have not been applied to DeepMimic or any other physical motion synthesis

method. Our results show that it is possible to produce adaptive controllers using

both approaches, which is a novel discovery for computer graphics. We also suggest

directions for future research and improvement, and point out the observed limitations.

1.4 Outline of dissertation

We begin in Chapter 2 by presenting an overview of relevant concepts for the project,

relating them to the graphics literature, and further motivating the work carried out. We

discuss existing research into retargeting for physically-based animation, as well as the

robotics research that inspired the methods we use. Chapter 3 explains RL, PPO and

DeepMimic in more detail. We also describe the humanoid character model we use

in our experiments. Chapters 4 and 5 describe our implementations of linearly com-

bining controllers and domain randomisation respectively. The experimental results

that motivated design choices are also presented. The main evaluation and discussion

takes place in Chapter 6, where we take the best methods from Chapters 4 and 5 and

compare them against each other, as well as against the baseline of basic DeepMimic,

in order to address the questions in the objective. Finally, Chapter 7 draws conclusions

about the different methods and suggests ideas for future work.

Supplementary video We provide a video of some of the resulting animations:

https://drive.google.com/open?id=1Ff88lhvs8mLrOH0vzHBi3X75sMvoX73o

https://drive.google.com/open?id=1Ff88lhvs8mLrOH0vzHBi3X75sMvoX73o

Chapter 2

Background and Related Work

In this chapter, we discuss the relevant background. We begin by explaining the moti-

vation behind physical motion synthesis methods such as DeepMimic. We then discuss

existing retargeting techniques. Finally, we outline the robotics research into overcom-

ing the Reality Gap that provided inspiration for our methods in later chapters.

2.1 Character animation

This dissertation looks at 3D animation. The character being animated is represented as

a hierarchy of bones linked together by joints; this structure is known as the skeleton.

Character poses can be represented by specifying the position of the root body part

(often the pelvis) and the rotations of each joint. Forward kinematics can then compute

the position of end-effectors (hands, feet). One way to create animations is to manually

design a sequence of poses (keyframes), and use interpolation on the joint rotations to

smoothly transition the character from one keyframe to the next. These keyframes can

also be obtained from motion capture data. These approaches to animation are purely

kinematic - we operate directly on the positions and rotations of joints, and are not

constrained to following the laws of physics. This is advantageous in that it offers the

animator more control, but this can also be a drawback: if we want to adapt a motion

to a different character, or to react to an external force, we have to do this manually.

Physically-based animation is an alternative where all motion is the result of sim-

ulated forces, and we design controllers that drive the character’s joints. This is less

intuitive than designing kinematic motions; however, unlike kinematic methods, char-

acters can react to perturbations in a physically realistic way. This is an especially

desirable property for interactive applications such as video games [8].

4

Chapter 2. Background and Related Work 5

2.1.1 Motion synthesis

The idea of motion synthesis is to artificially synthesise realistic character motion that

follows high-level control inputs (such as the trajectory to be followed), avoiding the

time-consuming task of designing motions frame-by-frame at a low level. Such tech-

niques would be especially useful in the development of interactive video games where

character motion depends on real-time input and it is thus impossible to pre-design

every possible movement. Industry currently relies on methods such as inverse kine-

matics [4] to adapt animation clips to react to control inputs, but this often produces

undesirable visual artefacts such as foot skating and terrain penetration [33].

Motion synthesis methods can be broadly categorised into two approaches: example-

based and simulation-based. Example-based (i.e. data-driven) methods use a large

database of reference motions and apply statistical methods such as deep learning and

autoencoders [18] to produce different motions from the data. On the other hand,

simulation-based methods use physically-based animation to synthesise the motions.

Geijtenbeek et al. [8] provide a survey of physical motion synthesis. Older research

mostly uses motion tracking and model-based methods. Tracking methods however

require motion clips that are feasible to track, and the ability to generalise beyond

the clip is limited [42]. Model-based methods require expert insight into designing

the controller, limiting the range of motions they are applicable for, e.g. SIMBICON

only works for biped locomotion [47]. Stimulus-response network controllers aim to

be more general and reduce domain engineering [8]. The controller is represented

as a parameterised function, e.g. a neural network, and trained using methods such

as RL. DeepMimic [34] is a recent method that uses deep RL to train physical con-

trollers to imitate reference motion data. SAMCON is another method that instead

uses sampling-based optimisation [29]. Other recent research includes [28, 45, 30].

We chose to use DeepMimic in our research as its framework is simple and general;

the controller is just a neural network, and the method has been shown to work on

motions from crawling to doing backflips, as well as on a variety of characters such

as dragons. One can also specify a task during training (e.g. to move in a certain

direction, or throw a ball at a target), and the training will produce a controller that

satisfies this whilst still staying faithful to the reference motion. Being able to quickly

retarget DeepMimic controllers would thus open the doors to interactive tools for this

large range of motions. Furthermore, the controllers also demonstrate robustness to

uneven terrain or unexpected forces. DeepMimic is explained in detail in Chapter 3.


2.2 Retargeting

Retargeting is where we adapt motion that was recorded (or designed manually using

keyframes) for one character model so it can be used by another character model.

This means we can re-use the motion, saving time and cost. Sometimes retargeting is

unavoidable, e.g. using motion-capture data to animate models that do not match the

motion capture actor because they have unrealistic proportions, or are non-human [46].

We focus on retargeting between models with the same topological structure (i.e. the

skeleton parts are connected in the same way, but may differ in dimension).

2.2.1 Retargeting kinematic animation

Most research into retargeting looks at kinematic animation; Guo et al. [11] provide a

survey. Pioneering work by Gleicher et al. [10] describes how to use space-time op-

timisation to perform retargeting. They propose an objective function that minimises

the pose difference in each frame, combined with constraints on the character’s joints

e.g. ‘elbows cannot bend backwards’, ‘this foot should be stationary in this time in-

terval’. The difficulty with retargeting kinematic animation is that constraints must be

designed manually and do not necessarily generalise to all models and motions.

2.2.2 Retargeting physically-based animation

Retargeting physically-based animation can in some ways be more straightforward

than retargeting kinematic animation since we do not need to explicitly design con-

straints to avoid artefacts like foot skating or floor penetration.

Peng et al. [34] demonstrate that DeepMimic is able to successfully train an Atlas

robot model to imitate human motion capture data, despite the differing physical shape

and mass distribution. They do not modify the motion capture clip at all, and train from

scratch from a randomly initialised policy. The reported training times for the Atlas

model are roughly the same as those for the human model (40-70 million samples,

depending on the motion clip). This takes 1-2 days even when using many parallel

cores. The authors did not explore techniques for speeding up the retargeting process.

Al Borno et al. [2] propose a novel method for performing physics-based motion

retargeting for different human body shapes. They use space-time optimisation to per-

form the retargeting [1]; the cost function encourages the physically-simulated char-

acter to follow a motion capture clip as closely as possible. The authors then use Co-


variance Matrix Adaptation Evolutionary Strategy (CMA-ES) [14], a sampling-based

optimisation method, to find the best control trajectories according to the cost func-

tion. In order to make a controller that is robust to external perturbations rather than

just tracking the motion clip, the authors build a random tree of Linear Quadratic Reg-

ulators (LQRs). The authors state that constructing the tree can take multiple days.

Although both of these methods were demonstrated to work, the issue lies in the

long training/optimisation process needed for each new model and motion clip. Our

work explores approaches for creating adaptive controllers that can quickly be applied

to modified models without extensive retraining, something that has not been addressed

for DeepMimic or other comparably-general physical motion synthesis methods.

2.3 Controller transfer in robotics

In robotics, the Reality Gap is the discrepancy between simulated training environ-

ments and the real world. This often results in controllers that were trained in simula-

tion not being able to perform in reality. Sim-to-real transfer is important since train-

ing robots in simulation is faster, cheaper and safer than training in reality. We look

at whether approaches from robotics can be used for obtaining adaptive controllers for

physical animation. To our knowledge, this has not been looked at yet in the literature.

2.3.1 Domain adaptation and transfer learning

Domain adaptation (also referred to as transfer learning) looks at how to transfer mod-

els between different conditions (i.e. domains). The motivation is to reduce the training

time needed in the target domain by bringing over knowledge from a source domain.

It is generally assumed that the different domains are related and thus representations

or behaviours learned in one domain will facilitate learning in the other domain.

There is much domain adaptation research for computer vision [31]. There has also

been work into adapting control policies in the RL community: approaches include

learning invariant features between the two domains [12], transferring training samples

[26], and fine-tuning neural networks that were pre-trained on the source domain [9,

36]. We focus on the final approach. Glatt et al. [9] show that initialising a network

with weights from a network trained on the source domain allows knowledge transfer

between Atari games, whilst Rusu et al. [36] propose adding new layers to a network

trained on the source domain and freezing the layers belonging to the source network.


2.3.2 Domain randomisation

The idea of domain randomisation is to randomly vary the simulation conditions dur-

ing training, such that the resulting controller is exposed to a range of conditions that

hopefully encompasses the conditions of the target domain (e.g. reality). Unlike do-

main adaptation, no training needs to be done in the target domain.

There are two different approaches: we can train robust policies that are able to

handle different environments without needing to identify them [35], or we can train

adaptive policies that first identify the environment (e.g. by performing calibrating ac-

tions) and then adjust themselves accordingly [48]. Since we will know the properties

of the character model we are retargeting to, the latter fits our problem better.

Domain randomisation has been successfully combined with RL algorithms to train

adaptive robotic controllers. For example, Yu et al. [49] use Trust Region Policy Op-

timisation (TRPO), a deep RL algorithm, and explicitly augment the state with the

domain parameters - we take the same approach in our experiments. They demonstrate

their algorithm on simulated continuous control problems such as a robot arm throwing

a ball whose mass varies. Heess et al. [16] look at training controllers for traversing

different terrains; they design curricula for training rather than entirely randomising.

2.3.3 Linear combination of policies

Another way of obtaining adaptive controllers is to combine the outputs of multiple

base controllers, where each base controller has been trained on a different domain.

Zhang et al. [50] look at obtaining policies for robotic arms, and they vary the

physical properties such as the length of the links. They propose Policy Self Evolu-

tion by Calibration (POSEC) for learning how to linearly combine base controllers for

different arm properties. The first step after obtaining the base controllers (trained us-

ing TRPO) is to use CMA-ES to find the optimal weights for a set of different arms.

These weights are then used to train a regression model to take in a feature vector

(representing the arm properties) and output the base controller combination weights.

When designing our experiments, we also took inspiration from graphics research

such as motion blending [23, 24] and facial animation using blend shapes [21]. Both

methods use linear combinations to produce new motions/facial expressions. In partic-

ular, the intuition provided by Joshi et al. [21] of placing blend shapes on the hypercube

formed by the parameters being varied is beneficial, and Kovar et al. [24] inspired the

idea of using K-Nearest Neighbours to compute the combination weights.

Chapter 3

DeepMimic

DeepMimic [34] is a state-of-the-art method for training physically-simulated virtual

characters to imitate motion capture or keyframe data, and it is the method our exper-

iments are built around. Given a character model and a reference motion, DeepMimic

uses deep RL to produce a physical controller for the character. We first formally in-

troduce RL and PPO, which is the particular algorithm used by DeepMimic. We then

explain how the animation problem is defined as an RL problem. We also describe the

character model and motion capture data, and finally we discuss the training process.

3.1 Reinforcement learning

Reinforcement Learning (RL) is a type of Machine Learning which is inspired by how

humans and animals learn through experimentation and interaction with the real world

[43]. It has been applied in many areas including robotics and computer animation.

In an RL problem, there is an agent which can take actions in an environment.

These actions affect the state of the agent and its environment, as specified by a transi-

tion function. We additionally specify a reward function that dictates the scalar reward

rt that the agent receives at each step t of its interaction with the environment. These

properties can be formalised as a Markov decision process. We define return as the

sum of rewards obtained in an episode of interaction. Expected return is denoted J.

The goal is for the agent to learn a policy π(aaa|sss) (a mapping from the currently

observed state sss to the action aaa to be performed) that maximises expected return. This

is done by allowing the agent to explore and gather experience (samples) in the envi-

ronment, and adjust its policy accordingly from the rewards it receives.

Many different RL algorithms exist, however we will focus on policy gradient

9

Chapter 3. DeepMimic 10

methods. These methods use a parametric representation of the policy πθ (e.g. a neural

network) and estimate the gradient of the expected return ∇θJ(θ) by collecting samples

in the environment. The sequence of states, actions and rewards observed in an episode

of interaction is called a trajectory.

We can also use a parametric function to represent the value function V (ssst) which

predicts the expected future return (value) if we are currently in state ssst . Actor-critic

methods are those that learn both the policy π (actor) and V (critic). The critic uses

samples gathered by the actor to improve its predictions to match the observed returns.

The return Rt obtained in a trajectory from t onward is defined: Rt = ΣTk=tγ

krk (3.1).

γ ∈ [0,1] denotes the discount factor - this controls how much emphasis the agent

should put on immediate rewards. T denotes the timestep at which the episode ends.

We can then define advantage At which measures the return we obtained from this

state onward compared to what the critic V predicts: At = Rt−V (ssst).

Schulman et al. [38] propose another way of computing At known as the Gener-

alised Advantage Estimator (GAE); it uses λ-returns [43] instead of the definition in

Equation 3.1, where λ is a hyperparameter ∈ [0,1] trading off the bias and variance in

the estimates (Equation 3.1 corresponds to setting λ = 1).

The Policy Gradient Theorem [43] defines how to make the gradient estimate:

∇θJ(θ) = Eπ[At∇θlog(πθ(aaat |ssst))] (3.2)

Policy gradient algorithms alternate between sampling and optimising; we first gather

a batch of samples, then estimate the gradient from these samples and apply gradient

ascent to update θ such that J(θ) improves. To improve efficiency, we can use impor-

tance sampling to estimate the gradient using samples collected by a previous policy

rather than having to re-collect experience after every update. This means multiplying

by importance weights: ∇θJ(θ) = Eπ[wt(θ)At∇θlog(πθ(aaat |ssst))] where wt(θ) denotes

the importance weight πθ(aaat |ssst)πθold (aaat |ssst)

(so wt(θ) = 1 means the policy hasn’t changed). This

can be interpreted as optimising the following objective function: L(θ) = Et [wt(θ)At ].

3.1.1 Proximal policy optimisation

PPO is an example of an actor-critic policy gradient RL method. Its proposing paper

[39] presents it as a modification to TRPO [37] that is simpler to implement. The idea

of TRPO is to maintain a trust region within which policy updates are restricted i.e.

we prevent the policy from changing too much in a single optimisation step, hopefully


avoiding poor updates that we are unable to recover from. The difference between

policies is measured by Kullback-Keibler (KL) divergence. Thus, TRPO optimises

L(θ) subject to a hard constraint (δ is a hyperparameter defining the trust region):

maxθ

L(θ) = Et [wt(θ)At ] subject to Et [KL[πθold(.|ssst),πθ(.|ssst)]]≤ δ (3.3)

The issue with TRPO is the difficulty in implementing the hard constraint as it re-

quires expensive second-order derivatives. PPO instead uses unconstrained first-order

optimisation and a clipped objective to de-incentivise large policy updates:

LCLIP(θ) = Et [min(wt(θ)At ,clip(wt(θ),1− ε,1+ ε)At)] (3.4)

where ε is the clipping factor. We provide some intuition: suppose aaat has At > 0,

so we wish to increase its probability. This means increasing wt(θ), but clip stops us

increasing it > 1+ ε. The min allows us to ‘undo’ mistakes - suppose we reduced the

probability of aaa due to a noisy sample, so w(θ) became < 1. When we see a sample

where aaa has positive advantage, we will be able to increase w(θ) again up to 1+ ε.

3.1.2 Algorithm and hyperparameters

For this project, we used the DeepMimic implementation provided by PyBullet [5].

The training algorithm alternates between gathering batches of data and making neu-

ral network updates. Data is gathered in episodes: each episode begins from some

state sss and we simulate until a fixed time horizon (20 seconds) or early termination

is triggered (a body part that is not a foot touches the floor). The starting states are

chosen according to reference state initialisation (RSI): rather than always starting at

the beginning of the motion, we sample from all the poses in the reference motion clip.

The actor and the critic each have their own network. To speed-up training, one

can have N agents collecting samples in parallel, sharing the same actor and critic

networks. Once a full batch of samples is obtained, we update to the network parame-

ters by sampling minibatches from the data to estimate gradients. The actor computes

gradients of L(θ) (Equation 3.4); for the critic, we use the TD(λ) algorithm (gradient

descent to minimise difference from target values computed using λ-returns) [43].

We use the same architecture as Peng et al. [34]: two fully-connected layers with

1024 and 512 ReLU units. In the actor network, these layers are followed by a linear

output layer (same dimension as the action space). For the critic network, it is a single

linear unit (the value estimate). The actor network output is taken to be the mean µµµ(sss)


of a Gaussian representing the policy: π(aaa|sss) = N(µµµ(sss),Σ). The covariance matrix Σ

is diagonal and fixed for all sss,,,aaa; it is treated as a hyperparameter (exploration noise).

Hyperparameters The hyperparameter settings we use are as follows:

batch size: 4096 (no. of samples collected before performing policy updates)

minibatch size: 256 (no. of samples used for each gradient estimate)

λ: 0.95 (used in GAE(λ) and TD(λ) calculations)

discount factor γ: 0.95

clipping factor ε: 0.2

actor step-size: 2.5×10−6 with momentum 0.9 (how far to move in each update)

critic step-size: 10−3 with momentum 0.9

exploration noise: linearly interpolated from 1 to 0 over 40 million samples

3.2 Problem representation

State and action space The character is the agent, and its controller is the policy.

The state of the character at time t is denoted ssst , and it is a vector containing the global

co-ordinates of the root, the positions of each joint (relative to the character’s root), the

rotations of each joint (as quaternions), and the angular and linear velocities of each

joint. For the humanoid character, the state has 197 features. The action at time t is

denoted aaat . It is a vector containing target rotations for each actuable joint; these values

are fed into a proportional-derivative (PD) controller at each joint which computes the

forces to apply. For the humanoid character, the action space has 36 dimensions.

Reference motion DeepMimic uses a reference motion specified as a sequence of

target poses {qqqt}. Each qqqt is a vector of all the reference motion’s joint rotations at

timestep t, and also the global co-ordinates of the reference motion’s root.

Reward function The reward function is designed to encourage the character to

mimic the reference motion. The reward at each timestep t is computed as follows:

rt = ωprp

t +ωvrv

t +ωere

t +ωcrc

t (3.5)

rpt is a pose reward, computed from the difference between the joint rotations (quater-

nions) of the simulated character and the reference motion. rvt is a velocity reward,


computed from the difference between the simulated character’s joint velocities (an-

gular) and the reference motion’s. ret is the end-effector reward, computed from the

difference in global positions of the simulated character and reference motion’s hands

and feet. Finally rct is computed from the difference in global position of the simulated

character and reference motion’s centre of mass (in our implementation, we just use

the root which is the pelvis). The full definitions are in the Appendix (Section A).

The ω terms weight the different rewards. After some experimentation, we fixed

the values: ωp = 0.5,ωv = 0.05,ωe = 0.15,ωc = 0.3. These sum to 1, so the maximum

possible return in a 20s episode equals the number of timesteps (≈ 600). In practice,

we did not observe returns this high since the character eventually deviates from the

straight trajectory followed by the reference motion.

3.3 Physics simulation

The character is represented as a skeleton of rigid parts connected by joints. Most of

these joints are actuable (can be controlled). These joints are driven by PD controllers

which essentially abstract away the low-level control and allow us to specify actions

as target angles rather than forces. Forces are computed as

τt =−kp(θt− θt)− kdθ

t (3.6)

where t is the timestep, τt is the resulting control force for the joint, θt and θt are

the current joint angle and velocity, θt is the target joint angle, and kp and kd are the

proportional and derivative gains (parameters that control how quickly to reach θt).

The physics simulation is carried out using PyBullet [5] at 240Hz. The character’s

action is updated at 30Hz (i.e. how often we query the policy).

3.4 Character and reference data

In this dissertation, we focus on animating humans to run. We use a running clip

from the freely-available CMU human motion capture dataset (http://mocap.cs.

cmu.edu/), which is also what Peng et al. [34] used. The humanoid model for this

data, shown in Figure 3.1, has 43 degrees of freedom (DOF). 36 of these are actuable,

corresponding to the action space. By default it is 1.5 metres tall and weighs 45 kg.

The motion capture clip is encoded as a series of frames, where each frame contains

the global co-ordinates of the root and the rotations of each joint. One can playback

http://mocap.cs.cmu.edu/

http://mocap.cs.cmu.edu/


Figure 3.1: The humanoid model for different body parameter vectors ρρρ. The left image

shows different scales (1.0, 1.2, 1.5). The central image shows an added arm length of

1.0 (80% increase). The right image shows an added leg length of 1.0 (40% increase).

the motion by using forward kinematics to compute the pose at each frame, and then

interpolating between frames (see our Supplementary Video, linked in Section 1.4).

3.4.1 Modifying body parameters

We treat the humanoid model on the very left of Figure 3.1 as our default model. In

this dissertation, we consider the following adjustments (all pictured in Figure 3.1):

• Modifying the overall scale of the model: this means uniformly scaling the di-

mensions and mass of all the body parts. The default scale is defined as 1.

• Extending the length of the model’s arms: this means adding an equal length

to the upper and lower arm segments, and also increasing their mass such that

the density does not change. We vary the increase between 0% and 80% - 80%

corresponds to an added length of 1.

• Extending the length of the model’s legs: this means adding an equal length to

the upper and lower leg segments, and again keeping density constant. We vary

the increase between 0% and 40% - 40% corresponds to an added length of 1.

The modified humanoid models that we use can all be characterised by a vector of

body parameters which we denote as ρρρ = [scale, added arm length, added leg length].

The default ρρρ is thus [s = 1, a = 0, l = 0]; we will use this notation throughout.

We note that a controller trained on the default model produces unsteady, unnatural-

looking motion when used with scale > 1.1, added arm length > 0.1 or added leg

length > 0.1. If values are increased further, the model will eventually fall over.


3.4.2 Retargeting motion capture data

One advantage of physical motion synthesis is that it is generally less reliant than

kinematic approaches on using manually-engineered heuristics to ensure e.g. feet do

not penetrate the floor. However, when training a modified model, unlike Peng et al.

[34] who make no modifications, we apply adjustments to the motion capture data

(which is for the default model) to ensure that accelerations stay constant. This is

explained in [17], which looks at the adjustments that need to be made to motion data

to account for modifying a model’s overall scale. By looking at units, it can be derived

that if a model is scaled by factor L, then time should scale as L1/2 and displacement

should scale as L. Thus, when training a model with scale L, we multiply the length

of each frame in the reference motion clip by L1/2, and the global co-ordinates of the

model’s root in each frame by L.

These scaling rules assume all parts of the model are scaled uniformly, however

we wish to vary limb length independently. After some experimentation, we decided

not to modify the motion capture data when modifying arm length, but it is essential to

modify the data when modifying leg length otherwise the model will penetrate (or

levitate above) the floor. We therefore compute the effective scale factor as: L =

modified leg length / default leg length (i.e. as if the whole model had been scaled).

3.5 Training

During the training process, we monitor two performance metrics: training return and

test return. Training return is the average episode return obtained over the parallel

workers during an iteration (i.e. after collecting a batch of samples). After every 400

iterations, we pause training and run 32 test episodes. For these test episodes, we

disable RSI such that each episode always begins at the start of the motion clip. The

test return is the average episode return obtained in these test episodes.

When reporting returns to compare trained controllers, we also disable RSI as well

as setting exploration noise to 0 (meaning the episode proceeds deterministically).

Episode return is a quantitative way of measuring how closely the controller is fol-

lowing the reference motion, which is an indicator of how realistic the motion is.

We train using 12-16 cores (each running an agent), and stop when the training

return has stopped noticeably increasing and we are > 40 million samples.

Chapter 4

Combination of Controllers

This chapter investigates linearly combining separately-trained base policies (con-

trollers) to produce a novel policy fitted to the desired body parameters ρρρ. The idea is

inspired by Zhang et al. [50] who were addressing the Reality Gap, and by graphics

research such as facial blendshapes [21] and motion blending [23] (Section 2.3.3). We

first describe the base policies used, before describing the method for combining poli-

cies. We explain using black-box optimisation on combination weights, which we also

compare against linear interpolation. We conclude by looking at using model fitting

methods on a training set to make weight predictions across the whole body parameter

space. We include the experimental results that motivated our design choices.

4.1 Base policies

We began with training a policy on the default ρρρ [s = 1, a = 0, l = 0]. We then used

transfer learning (discussed next) to obtain policies for the following ρρρ’s: [s = 1.5, a =

0, l = 0] (large), [s = 1, a = 1, l = 0] (long arms), [s = 1, a = 0, l = 1] (long legs).

Each one represents the extreme of one dimension of ρρρ.

4.1.1 Transfer learning

As discussed in Section 2.3.1, it is possible to apply transfer learning ideas to RL in

order to speed-up training times on difficult tasks by bringing over knowledge from

a source task. One way to transfer knowledge to a target task is to carry over all the

weights from a network trained on the source task i.e. rather than randomly initialising

the actor and critic network weights when we begin training on the target task, we

16

Chapter 4. Combination of Controllers 17

initialise them to match the networks trained on the source task [9]. We investigated

whether this could be used to transfer a policy trained on the default body parameters

to a different ρρρ. We do not alter any of the hyperparameters, and the exploration noise

is linearly decreased in the same way. We experimented with two different target ρρρ’s:

[s = 1.5,a = 0, l = 0] and [s = 1,a = 1, l = 1].

0 1 2 3 4Samples 1e7

0

50

100

150

200

250

300

Return

Random InitialisationTransfer From Default

0 1 2 3 4Samples 1e7

0

50

100

150

200

250

Return

Random InitialisationTransfer From DefaultTransfer From Long ArmsTransfer From Long Legs

Figure 4.1: Training graphs comparing policies trained using transfer against policies

trained from random initialisation. The left graph is for [s = 1.5,a = 0, l = 0], the right

is for [s = 1,a = 1, l = 1]. The solid lines show training return whilst the dashed lines

show test return. (To reduce clutter, we only show training return on the right)

The training curves are compared in Figure 4.1. It can be seen in both graphs that

the transferred policies’ returns increase more quickly than the randomly initialised

policies’ at the start of training, achieving similar performance at 2×107 samples as

the randomly-initialised polices achieve at 4×107 samples. Furthermore, we observed

that combining policies trained using transfer could sometimes achieve higher returns

than policies trained from scratch - this is discussed in the Appendix (Section C.1). We

therefore use these base policies trained using transfer for the rest of the chapter.

4.2 Combining policies

As explained in Section 3.1, a policy π(aaa|sss) maps the input state sss to the probability

of taking action aaa. Suppose we have a set of base policies π0, ...,πn that we wish to

linearly combine into a single policy π. We do this by passing sss through each base


0.0 0.2 0.4 0.6 0.8 1.0Added Arm Length

220

240

260

280

300

Return

Best ReturnInterpolation Return

0.0

0.2

0.4

0.6

0.8

1.0

Weigh

t

Best WeightInterpolation Weight

0.0 0.2 0.4 0.6 0.8 1.0Added Leg Length

220

240

260

280

300

Return

0.0

0.2

0.4

0.6

0.8

1.0

Weigh

t

1.0 1.1 1.2 1.3 1.4 1.5Scale

220

240

260

280

300Re

turn

0.0

0.2

0.4

0.6

0.8

1.0

Weigh

t

Figure 4.2: Comparison between using linear interpolation to compute the weight w and

using brute-force (at resolution 0.1). The interpolation results are shown in blue and the

brute-force results are shown in red. The solid lines show episode return (left axis) and

the dashed lines show w (right axis).

policy, obtaining actions aaa0, ...,aaan, and then linearly combining these actions to obtain

our final action aaa. We have n combination weights wi, one for each base policy.

π(aaa|sss) = w0π0(aaa|sss)+ ...+wnπn(aaa|sss)

Σiwi = 1 and ∀i.wi ∈ [0,1](4.1)

4.3 Interpolating two policies

Suppose we have two base policies π0,π1 which were trained for body parameters

ρρρ0,ρρρ1. The equation for interpolating between these two policies is as follows:

π(aaa|sss) = (1−w)π0(aaa|sss)+wπ1(aaa|sss) (w ∈ [0,1]) (4.2)

Since we are assuming the weights for both policies should add up to one, we only

need to explicitly calculate one weight w. Intuitively, by increasing w from 0 to 1, we

should be able to obtain policies for all ρρρ lying along the line from ρρρ0 to ρρρ1 in body

parameter space (i.e. for all ρρρ’s where ∃c ∈ [0,1]. ρρρ = ρρρ0 + c(ρρρ1−ρρρ0) ).


Clearly, a simple approach for computing w for such a ρρρ would be to compute

how far along the line ρρρ lies i.e. computing the value of c for that ρρρ: c = ρρρ−ρρρ0ρρρ1−ρρρ0

(4.3).

However, this is making the assumption that there is an underlying linear relationship.

Since the search space for w is one-dimensional, it is possible to simply exhaus-

tively evaluate the resulting policy for all values between 0 and 1 at some resolution. In

order to assess whether a simple linear interpolation calculation (as in Equation 4.3) is

sufficient for finding the optimal w, we performed this exhaustive search at a resolution

of 0.1 for several pairs of base policies. The results are shown in Figure 4.2.

It can be seen that linear interpolation does not always find the best weight; the per-

formance of linear interpolation for changing the arm length is especially poor com-

pared to brute-force. Interestingly, the trend of best weights found by brute-force is

not monotonically increasing when changing leg length and scale, and furthermore the

best weight never reaches 1.0 in any of the graphs. This could be explained by the fact

that PPO is not guaranteed to converge to a globally optimal policy.

Nevertheless, it seems possible to interpolate between two policies in order to cover

the range of ρρρ’s lying along the line connecting them in body parameter space. We now

look at extending this idea to multiple base policies.

4.4 Linearly combining multiple policies

The idea can be extended to more than two base policies. The hope is that this in turn

enables multiple dimensions of the body parameters ρρρ to be successfully varied by

having a base policy for each dimension. This approach is inspired by research into

using blend shapes for facial animation [21]. The difficulty this introduces is that the

weight search space becomes multidimensional; e.g. in order to combine three base

policies, we require two combination weights (assuming the weights sum to one). For

multidimensional search spaces, brute-force is no longer sensible.

In this section we first look at using black-box optimisation methods for finding the

best weights. We then discuss the need for corrective base policies, before looking at

whether bilinear interpolation is effective for finding good weights. Finally, we look

at whether it is possible to apply model fitting methods such as K-Nearest Neighbours

to cover the space of ρρρ. We also explored the idea of segmenting the action vector and

using multiple weights per base policy to provide finer control, but since this did not

have strong results we discuss it in the Appendix (Section C.2).


4.4.1 Black-box optimisation

In this section we present the problem of finding the best combination weights for a

given ρρρ as an optimisation problem. The function we wish to maximise takes a vector

www and outputs the return achieved by running an episode (using ρρρ) with that www.

The optimisation of functions is an extensively studied area, however many meth-

ods assume access to derivative information, cannot handle noise, or require many

function evaluations, making them unsuitable for our problem of finding combination

weights (where a function evaluation means simulating a full episode, taking around 30

seconds). We therefore look at methods designed for performing global optimisation

on expensive black-box functions - in particular, Bayesian optimisation and CMA-ES.

We first explain the two methods, and then present our empirical comparison.

4.4.1.1 Bayesian optimisation

Bayesian optimisation uses evaluations of the function being optimised to build an ap-

proximation to it, which then guides the search [7]. It is often used for hyperparameter

tuning [41]. We use the Scikit-Optimize implementation [25].

Bayesian optimisation begins with a prior probability distribution representing our

beliefs about the possible functions. After each evaluation of the black-box function,

these beliefs can be updated to form a posterior distribution. The beliefs can be repre-

sented as a Gaussian Process (GP). GPs model the function as a multivariate Gaus-

sian, and use a kernel function to compute the covariance between function values (we

use the Matern kernel). This approach can also account for noise in the observations

- we can manually set the noise amount, or the value can be learned such that the

observations are best explained (this option is indicated in Figure 4.3 as ‘Gaussian’).

Bayesian optimisation uses an acquisition function to decide where to next evalu-

ate the black-box function, trading-off between exploring high-uncertainty regions and

exploiting the best region found so far. We tried the following acquisition functions:

Upper Confidence Bound: UCB(www) = µ(www)+κσ(www) (κ is a hyperparameter,

where a higher value favours exploration over exploitation)

Expected Improvement: EI(www) = E[ f (www)− f (wwwbest)]

Probability of Improvement: PI(www) = P( f (www)≥ f (wwwbest)+κ)

Hedge: stochastically selects one of the above three acquisition functions


4.4.1.2 CMA-ES

Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) is a stochastic method

for optimisation [14], and it is the method used by Zhang et al. [50] to fit combination

weights. We use the PyCMA implementation [15].

Evolutionary strategies (ES’s) operate by randomly generating populations of solu-

tions (e.g. by sampling from a multivariate Gaussian), and applying ‘natural selection’

such that the best solutions survive into the next generation. CMA-ES adaptively ad-

justs the covariance matrix used to generate the next generation according to the results

of the previous generation, in order to hopefully move towards the optimum.

There are various hyperparameters; we investigate the following:

initial standard deviation: how the first population is sampled

damping: dampens how quickly the covariance changes between populations

4.4.1.3 Comparison

We ran experiments to empirically compare both methods and a variety of hyperpa-

rameter settings. For these experiments we varied arm length and leg length, and used

three base policies: default [s= 1,a= 0, l = 0], long arms [s= 1,a= 1, l = 0], and long

legs [s = 0,a = 0, l = 1]. We selected 9 different ρρρ values and first used brute-force to

find the optimum return for each. The brute-force search was performed at a resolution

of 0.1, meaning 112 = 121 evaluations since there are two weights which can each take

11 values. This took around an hour for each ρρρ.

We then applied each black-box optimisation method on the same 9 ρρρ’s, allowing

40 function evaluations for each problem. On a machine with a 2.3GHz processor, 40

evaluations takes around 15 minutes. Due to the stochasticity involved in each method,

we solve each ρρρ 3 times. We compare the methods in terms of how far the returns lie

from the optimal solution found by brute-force. We report the decrease, so smaller

values are better - indeed, sometimes the optimisation methods can improve on the

return found by brute-force (in which case the decrease becomes negative) since they

are not constrained to a 0.1 resolution. The results are shown in Figure 4.3.

CMA-ES was able to achieve better (i.e. lower) means than Bayesian optimisation

but demonstrated higher variability across the 9 problems. For the rest of this chapter,

we use Bayesian optimisation with LCB as the acquisition function, κ = 10 and noise

= 1 (the lowermost bar in Figure 4.3) as it appears to provide the best balance between

a low mean and a small standard deviation. We use 40 function evaluations.


−20 −10 0 10 20 30 40 50Decrease from optimum found using brute-force

UCB, kappa=10, noise=1UCB, kappa=2, noise=1

Hedge, kappa=2, noise=1UCB, kappa=10, noise=gaussian

Hedge, kappa=2, noise=gaussianUCB, kappa=2, noise=gaussianEI, kappa=10, noise=gaussian

EI, kappa=10, noise=1EI, kappa=2, noise=gaussian

PI, kappa=10, noise=gaussianPI, kappa=10, noise=1PI, kappa=2, noise=1

std=3, damping=1std=1, damping=1

std=3, damping=0.5std=5, damping=1

std=1, damping=0.5

Figure 4.3: Comparison of different black-box optimisation methods. Bayesian optimi-

sation using GPs is shown in red, CMA-ES is shown in blue. The different hyperpa-

rameters are shown on the y-axis. The x-axis displays the mean decrease from the

optimum (a small decrease is better). The means and standard errors are computed

over 9 different problems that were each solved 3 times from different random seeds.

4.4.2 Corrective base policies

It may be the case that having one base policy for each dimension of body parameter

space is insufficient. When blendshapes are used in facial animation, it is common to

use corrective blendshapes (also known as combination blendshapes) which become

active according to the extent that certain combinations of other blendshape weights are

active [27]. This idea may be important for retargeting too; for example, lengthening

both the arms and legs may require policy adjustments that cannot be obtained by

combining policies trained on long arms and long legs separately.

We can visualise the problem as a d-dimensional hypercube, where d is the dimen-

sions of ρρρ, and each vertex represents a base policy [21]. For linear interpolation this

is a line, for bilinear interpolation it is a plane, e.t.c. The number of vertices increases

as 2d , so it is an important question whether we require a policy at each vertex.

To investigate this, we ran experiments that compare the performance obtainable

with and without corrective base policies. The first experiment looks at varying scale

and arm length (base policies: [s = 1,a = 0, l = 0], [s = 1,a = 1, l = 0], [s = 1.5,a =

0, l = 0], corrective: [s = 1.5,a = 1, l = 0]). The second looks at varying arm and leg

length (base policies: [s = 1,a = 0, l = 0], [s = 1,a = 1, l = 0], [s = 1,a = 0, l = 1],


corrective: [s = 1,a = 1, l = 1]). The results are shown in Figure 4.4.

In both experiments, including a corrective base policy achieves much better return

when both of the dimensions being varied are large, indicating that corrective base

policies are indeed needed. In theory, having a corrective base policy should guarantee

returns at least as large as not having one (we can simply set the corrective base policy’s

weight to zero); however, since Bayesian search is not guaranteed to find the global

optimum, performance is sometimes worse when there is a corrective base policy. To

avoid this issue, it may be possible to design heuristics for when to constrain corrective

base policy weights to be zero, or alternatively to force the Bayesian search to first

explore the weight subspace where corrective base policy weights are zero.

The implication from these experiments that corrective base policies are needed

unfortunately indicates that combining base policies may not be scalable; as the num-

ber of dimensions of ρρρ increases, the number of base policies needed increases expo-

nentially. Since the training of base policies only needs to happen once (for a single

reference motion), this may be not be too serious. However, it may be vital to design

good heuristics for weight optimisation as the search space dimensions grow.

a b c d e f g h i j k l m n o p q r s t uBody Parameters

100

150

200

250

300

Return

With Corrective PolicyWithout Corrective Policy

a b c d e f g h iBody Parameters

200

250

300

350

Return

With Corrective PolicyWithout Corrective Policy

scale\arm 0.0 0.5 1.0

1.00 a b c

1.08 d e f

1.16 g h i

1.24 j k l

1.32 m n o

1.40 p q r

1.50 s t u

arm\leg 0.0 0.5 1.0

0.0 a b c

0.5 d e f

1.0 g h i

Figure 4.4: Episode returns obtained when using a corrective base policy compared

against not using one. The ρρρ values for the experiments are shown in the tables on the

right. Since Bayesian search involves randomness, weights were fitted three times for

each ρρρ; we show mean and standard error.


4.4.3 Bilinear interpolation

Suppose we have a base policy at every corner of the hypercube formed by the dimen-

sions of ρρρ. It is then possible to apply mathematical interpolation formulae to compute

the computation weights by making the assumption that, for each vertex of the hy-

percube, the best solution is given by weighting the corresponding base policy by 1.

Linear interpolation can be extended to multiple dimensions; in two dimensions it is

known as bilinear interpolation. We experimented with using bilinear interpolation in

order to compute the combination weights when we have four base policies: default,

long arms, long legs, and a corrective policy with long arms and legs.

Bilinear interpolation assumes we know the values of f at 4 datapoints: Q00 =

(x0,y0),Q01 = (x0,y1),Q10 = (x1,y0) and Q11 = (x1,y1). Given a query point (x,y),

we estimate f (x,y)≈ a0+a1x+a2y+a3xy. The coefficients ai are found by solving a

linear system such that inputting Q00, ...,Q11 returns the correct values.

In our case, Q00, ...,Q11 are 4 datapoints in body parameter space. fff is a function

returning the combination weights (only 3 since we assume the 4 weights sum to one).

fff (Q00) = fff ([s = 1,a = 0, l = 0]) = [1,0,0],

fff (Q01) = fff ([s = 1,a = 1, l = 0]) = [0,1,0],

fff (Q10) = fff ([s = 1,a = 0, l = 1]) = [0,0,1],

fff (Q11) = fff ([s = 1,a = 1, l = 1]) = [0,0,0].

We perform bilinear interpolation independently for each of the 3 combination

weights i.e. we have three functions f0(ρρρ) = w0, f1(ρρρ) = w1, f2(ρρρ) = w2 such that

fff (ρρρ) = [w0,w1,w2].

We compare the results against what is found by Bayesian optimisation (we use

the hyperparameter settings found in Section 4.4.1 and solve each ρρρ three times). Fig-

ure 4.5 shows the results. As was observed in the 1D case (Section 4.3), Bayesian

optimisation almost always outperforms the solution found by bilinear interpolation,

indicating that the underlying relationship between optimal combination weights and

ρρρ values is more complex. We now see if this can be learned using model fitting.

4.5 Model fitting

Computing the weights for a given ρρρ is expensive, taking around 15 minutes if 40 iter-

ations of Bayesian optimisation are used, and this may not be sufficient for finding the


a b c d e f g h i j k l m n o p q r s t u v w x yBody Parameters

150

200

250

300

Return

Bilinear Interpolation Bayesian Optimisation

arm\leg 0.00 0.25 0.50 0.75 1.00

0.00 a b c d e

0.25 f g h i j

0.50 k l m n o

0.75 p q r s t

1.00 u v w x y

Figure 4.5: Comparison of bilinear interpolation against Bayesian optimisation on 25

ρρρ’s (shown in table). Four base policies were used. For each ρρρ we applied Bayesian

optimisation 3 times: we show the mean and standard error.

best solution as the number of base policies increases. Since the desire is an approach

able to obtain policies for any ρρρ quickly, such that it could be incorporated into an

interactive tool for animators, we decided to investigate if model fitting methods such

as interpolation and K-Nearest Neighbours (KNN) could be applied to learning the

best weights for every ρρρ. This is inspired by Zhang et al.’s work in [50] (where regres-

sion is used to predict base policy weights) and by motion blending research (Kovar et

al. [24] use KNN to predict blending weights).

The high-level steps are as follows:

1. Construct a training set {(ρρρ1,www1), ...,(ρρρM,wwwM)} of M samples. This is done by

using Bayesian optimisation to compute the best interpolation weights wwwi for

each ρρρi, i ∈ [1,M]. This happens offline and only needs to occur once, so many

iterations of optimisation can be used if desired.

2. This training set can be fed into a model fitting method such as KNN or regres-

sion to learn how to predict the best weights for any given ρρρ (in the space covered

by the training set).


We experimented with model fitting when varying added arm and leg length be-

tween 0 and 1. We used four base policies: default, long arms, long legs, and long

arms and legs. We used Bayesian optimisation with 80 function evaluations and the

best settings from Section 4.4.1. We constructed a training set PPPtr of size 36, where:

PPPtr = {0.0,0.2, ...,1.0}×{0.0,0.2, ...,1.0}

× denotes taking the Cartesian product of two sets(4.4)

We then constructed a validation set PPPv (size 10) and test set PPPte (size 20) by sam-

pling elements from the uniform distribution [0,1]. We note that these set sizes are

relatively small, however the domain of ρρρ is also fairly limited. Zhang et al. [50]

similarly only use small sets - their training and test sets are 20 samples each.

Metrics Most model fitting tasks use a cost function such as squared distance from

the target value to measure performance. However, for the task at hand, we are more

interested in finding weights that give high episode returns than necessarily matching

the fitted weights in PPPv and PPPte as closely as possible. Therefore, in this section, we re-

port two performance metrics: the Euclidean distance between the vector of predicted

weights and the vector of fitted weights (we refer to this as the weight distance), and

the decrease in return when using the predicted weights compared to when using the

fitted weights. For both metrics, a smaller value is better.

4.5.1 Interpolation

In earlier sections we looked at using interpolation to find combination weights by

making assumptions that each base policy corresponded to the best solution for one

corner of the body parameter hypercube. We now instead look at whether interpolation

can be used to essentially “fill in the spaces” between the ρρρ’s in a training set. We only

vary two dimensions of ρρρ (arm and leg length) so this is a 2D interpolation problem.

We try both linear interpolation and cubic interpolation. We use the griddata func-

tion provided by SciPy [20]. Linear interpolation works by triangulating the input

data (they use Delaunay triangulation [40]), and then performing linear interpolation

on each triangle. Cubic interpolation also involves triangulating the data, and then

constructing a cubic Bezier polynomial on each triangle using Clough-Tocher [3].

As in Section 4.4.3, we perform interpolation on each combination weight inde-

pendently. We explicitly fit 3 weights (clipping predictions to lie between 0 and 1)


and compute the final weight using the assumption they should all sum to one. If

w0 +w1 +w2 ≥ 1, we set w3 = 0 and normalise: wi =wi

w0+w1+w2for i = 0,1,2.

The validation set results are shown in Figure 4.6. Although both do similarly in

terms of weight distance, cubic interpolation significantly outperforms linear interpo-

lation in terms of return so it is the better method for our purposes.

Linear Cubic0

5

10

15

Decrea

se in

Return

Decrease in Return

0.0

0.1

0.2

0.3

Weigh

t Dist

ance

Weight Distance

Figure 4.6: Validation set results of linear and cubic interpolation.

4.5.2 K-Nearest Neighbours

KNN is a Machine Learning algorithm that can be used for both classification and

regression [6]. We use the Scikit-Learn implementation [32].

The idea behind KNN is simple: in order to predict the combination weights www for

a query point ρρρ, we find the K training samples that are closest to ρρρ in body parameter

space. We then use these K neighbours {(ρρρ1,www1), ...,(ρρρK,wwwK)} to make our prediction

www. We try two different neighbour-weighting methods:

• uniform weights: www = 1K ΣK

i=1wwwi (same as computing mean of the neighbours)

• distance-based weights: www = ΣKi=1aiwwwi, where ai =

1/diΣK

j=11/d j(di is the Euclidean

distance from ρρρ to ρρρi) (the closer the neighbour, the higher its weighting)

The only parameter to be set is K, the number of neighbours to use for each pre-

diction. We try values from 2 to 6; the validation set results are shown in Figure 4.7.

The performance in terms of return appears to depend more on K than the weighting

scheme. K = 2 with distance-based weights performed best.

4.5.3 Comparison

We evaluated the best setting from each of the two methods on the test set; the results

are shown in Table 4.1. On the full training set, cubic interpolation performed surpris-

ingly well, achieving a mean decrease in return of -11: in other words, on average it


K=2 K=3 K=4 K=5 K=6 K=2 K=3 K=4 K=5 K=60

5

10

15

Decr

ease

in R

etur

n

Decrease in Return

0.0

0.1

0.2

0.3

Wei

ght D

istan

ce

Weight Distance

Figure 4.7: Validation set results of different KNN models. K indicates the number of

neighbours used for each prediction. The non-hatched bars (left) weight all neighbours

uniformly; the hatched bars (right) weight neighbours according to distance.

Decrease in Return Weight DistanceFull (36) Reduced (9) Full (36) Reduced (9)

KNN (K=2, Weighted) 1 ± 2 15 ± 5 0.26 ± 0.04 0.22 ± 0.03

Interpolation (Cubic) -11 ± 13 10 ± 3 0.27 ± 0.05 0.24 ± 0.02

Table 4.1: Test set results of the two models that performed best on the validation set.

actually improved on the returns achieved when using Bayesian optimisation to fit the

test set. KNN also performed well, with a mean decrease of only 1.

Due to the good performance, we also decided to look at the performance of these

models when trained on a reduced training set that contains only 9 samples instead

of 36: PPP′tr = {0.0,0.5,1.0}×{0.0,0.5,1.0}. We evaluate them on the same test set.

As expected, the performances in terms of return degrade, but interpolation still out-

performs KNN. The mean weight distances are lower on the reduced set than the full

set, indicating that reducing weight distance does not guarantee the best returns.

Overall, cubic interpolation was the best method tried. Neither method requires

training, however, so in practice it would be straightforward to try multiple methods. A

caveat is that the method used for cubic interpolation only applies to 2D problems (i.e.

only varying 2 elements of ρρρ). Future work would be to look at a multidimensional

extension to cubic spline interpolation [13] and compare this to KNN. Other model

fitting methods that have been used in facial animation and motion blending (and can

extend beyond 2 dimensions) include Radial Basis Functions and neural networks [23].

Chapter 5

Domain Randomisation

In this chapter, we investigate whether domain randomisation can be used to train

adaptive DeepMimic controllers. It is an approach used in robotics for tackling the

Reality Gap [48], as discussed in Section 2.3.2.

The method works as follows. We use randomly-initialised neural networks for the

actor and critic. To enable domain adaptation, the neural networks’ inputs now include

the body parameters ρρρ as well as the character state ssst . This means that the policy

will change if ρρρ changes - it adapts. During training, ρρρ is changed at the start of every

episode, and the properties of the simulated character model are changed accordingly.

In our implementation, each element of ρρρ is sampled uniformly at random from a

continuous interval. This corresponds to the approach in [49], although Yu et al. use

TRPO rather than PPO. No other adjustments to the PPO algorithm were made.

In this section we first discuss design aspects such as the neural network architec-

ture and hyperparameters, before presenting our experimental results.

5.1 Design

5.1.1 Network architecture

We tried three different architectures: Default, Split A and Split B. Diagrams are in

the Appendix (Section D.1). The first is the default architecture from Peng et al. [34]

with two hidden layers; we simply append ρρρ to the first layer’s input. We then took

inspiration from the split architecture used by Peng et al. for vision-based tasks which

joins a convolutional network, which processes image data, to the default architecture.

Similarly, Split A and Split B process sss and ρρρ separately before combining to form aaa.

29

Chapter 5. Domain Randomisation 30

5.1.2 PPO hyperparameters

PPO is an algorithm with many hyperparameters. Although it was designed to be

more robust to hyperparameter settings than other policy gradient methods [39], tuning

is still important. Due to long training times, there was not time for comprehensive

tuning, but after inspecting the training metrics of unsuccessful runs, we chose to focus

on the following. (Other hyperparameters were kept at the values in Section 3.1.2.)

Critic learning rate When training policies across particularly large ranges (e.g.

varying scale between 0.5 and 1.5), we observed that the critic loss began to diverge.

This implied that the default critic learning rate (0.01) might be too large (and causing

overshooting) so we tried smaller values.

Clipping factor A key property of PPO is the clipped objective function to prevent

harmfully-large policy updates (Section 3.1.1). The clipping factor (ε) determines how

large an update we allow. When training policies across large ranges, we observed that

the clipping fraction (how often we have to apply clipping) was lower than usual,

suggesting the default (0.2) might be too large and that the allowed policy update

amount should be further restricted.

5.2 Experiments

In this section, we discuss our experimental results. Each policy was trained until train-

ing return stopped noticeably improving, which normally took > 60 million samples.

Overall Scaling

We first investigated whether it was possible to train a policy that adapted to different

scales, so the ρρρ fed into the network is a 1D vector [s]. Our first experiment tried the

three architectures with ρρρ varying between 1.0 and 1.2, whilst keeping all the PPO

hyperparameters as their defaults. The results are shown in Figure 5.1 (top). All three

architectures achieve returns > 200 across the range, with Split A performing the best.

Since the networks performed well on the [1.0,1.2] range, we then ran the same

experiments on the [1.0,1.5] range, corresponding to our scale experiments in Chapter

4. The results are shown in Figure 5.1 (middle). On this larger range, the adaptive

policies perform much worse, failing to achieve high returns consistently across the


1.00 1.04 1.08 1.12 1.16 1.20Scale

100

150

200

250

300

Return

DefaultSplit ASplit B

1.0 1.1 1.2 1.3 1.4 1.5Scale

100

150

200

250

300

Return

DefaultSplit ASplit B

1.0 1.1 1.2 1.3 1.4 1.5Scale

100

150

200

250

300

Return

Default (CF=0.2)Default (CF=0.1)Split A (CF=0.2)Split A (CF=0.1)Split B (CF=0.2)Split B (CF=0.1)

Figure 5.1: The episode returns of the different policies evaluated at different scales.

For the top graph, the policies were trained on the range [1, 1.2] using default hyperpa-

rameters. For the middle graph, we used [1,1.5] and default hyperparameters. For the

bottom graph, we used [1,1.5], halved the critic learning rate to be 0.005, and tried two

different clipping factor (CF) values (dashed is CF=0.2 (default), solid is CF=0.1).

whole range. The two split networks struggle at the higher end whilst the default

network performs worse at the lower end; however, it is difficult to draw concrete

conclusions due to the randomness involved in training. The issue of not being able

to handle the whole range may be related to catastrophic forgetting [22] - when neural

networks lose knowledge about previously-learned tasks when trained on new tasks.

For the next set of experiments, we made the hyperparameter adjustments discussed


in Section 5.1.2 to see if these performed better on the [1.0,1.5] range. We halved the

critic learning rate to be 0.005, and tried two different clipping factors 0.1 and 0.2.

The results are shown in Figure 5.1 (bottom). The default architecture still struggles,

however the two split architectures now achieve more consistent performance. Split A

with clipping factor 0.1 achieved the highest average return across the range.

Leg Length

Having seen that domain randomisation can successfully train policies that adapt to

different scales, we turned to look at varying leg length. As in Chapter 4, we look at

adding up to 40% length (referred to as 1) to each leg. For these experiments we again

used the halved critic learning rate (0.005). The results are shown in Figure 5.2.

One can see that unfortunately all the policies failed to perform well for increased

leg lengths, all falling below 150 when adding more than 0.2. (Our Chapter 4 demon-

strated that it is possible to achieve returns above 240 for added leg length up to 1.) It

remains as future work to see whether different architectures, hyperparameter settings,

or ideas from curriculum learning [16] could successfully produce an adaptive policy

for leg length. Since we did not manage to use domain randomisation to train an adap-

tive controller for leg length, we did not investigate varying multiple elements of ρρρ as

we did in Chapter 4.

0.0 0.2 0.4 0.6 0.8 1.0Added Leg Length

100

150

200

250

Ret

urn

Default (CLR=0.005, CF=0.1)Split A (CLR=0.005, CF=0.2)Split A (CLR=0.005, CF=0.1)Split B (CLR=0.005, CF=0.1)

Figure 5.2: Episode returns of different policies evaluated at different leg lengths.

Adjusting output of a frozen network

We ran some experiments using an idea similar to progressive networks [36] for train-

ing adaptive policies. We did not produce strong results (we did not have time for

trying many architectures) so we discuss this in the Appendix (Section D.2).

Chapter 6

Results and Evaluation

In this chapter, we compare the methods explored in Chapters 4 and 5. The basic

DeepMimic retargeting method presented by Peng et al. [34] is our baseline. We look

at training time, episode return, and controller robustness. For the baseline, we ran-

domly initialise weights, use the hyperparameters listed in Section 3.1.2, and train until

training return stops clearly increasing. (True convergence is hard to observe due to the

noisiness of deep RL.) We compare to Peng et al.’s values, though we use a different

physics simulator so we do not expect to match exactly. We also quote values from Al

Borno et al. [2] who use LQR-trees to represent physical controllers that track motion.

6.1 Training time

In this section, we compare the different approaches in terms of their offline and on-

line computation times. One of the main motivations behind this dissertation was the

time-consuming nature of DeepMimic. Training a new policy from scratch for each

character model takes many hours, ruling out the possibility of an interactive editor.

All DeepMimic training was run either on Edinburgh’s ECDF cluster (https://

www.ed.ac.uk/information-services/research-support/research-computing/

ecdf/), using 16 cores with 64GB RAM; or on a machine with 12 cores and 64GB

RAM. Bayesian optimisation was run with a 2.3GHz CPU and 16GB RAM.

6.1.1 Offline time

Offline time is the work done in order to reduce the online time. It should only be

needed once for each character type and motion clip.

33

https://www.ed.ac.uk/information-services/research-support/research-computing/ecdf/



Chapter 6. Results and Evaluation 34

Baseline For the baseline, we assume no offline computation.

LQR-Trees Since all training is model-specific, we assume no offline computation.

Combination of policies Training base policies takes place offline. This is an expen-

sive process - at least one base policy is required for each dimension of ρρρ, and perhaps

more if the range to be covered is large (e.g. varying scale between 1 and 2 may re-

quire a base policy at 1.5 as well as ones at 1 and 2). Furthermore, if corrective policies

are used, the number of base policies in fact grows exponentially with ρρρ’s dimensions

(Section 4.4.2). However, training time can be reduced by using transfer learning, as

shown by the experiments in Section 4.1.1, meaning each policy only takes around 20

million samples (≈ 10 hours using 16 cores) rather than 40 million. If model fitting

methods such as KNN are used, we must construct a training set. Our Section 4.5

experiments showed that 36 samples (each taking ≈ 30 min. to fit) were sufficient for

good predictions when varying arm length up to 180% and leg length up to 140%.

Domain randomisation The training can take place offline, since it is not specific to

any ρρρ. We trained the policies for 50-60 million samples (25-30 hours using 16 cores),

which is not much more than the 40-60 million samples the baseline policies took, and

yet this was sufficient to produce adaptive controllers that adapt to scales up to 1.5.

It could be that reducing exploration noise more slowly and training for longer could

improve performance further. Training graphs are in the Appendix (Chapter B).

6.1.2 Online time

Online time is the time taken to produce a controller for a new modified character. Our

main motivation was to investigate methods that reduce this as much as possible.

Baseline The baseline takes at least 40 million samples before the training return

stops clearly improving. When using 16 cores, this is at least 20 hours of training

time. Peng et al. [34] similarly report that their running controller was trained for≈ 50

million samples. Our training graphs are shown in the Appendix (Section B.2).

LQR-Trees The first part of the method uses CMA-ES to do space-time optimisation

and produce a controller that tracks the reference motion; this takes around 5 hours.

The next builds the LQR-tree to add robustness to external forces; this takes 2 days.


Combination of policies When combining policies, online computation is required

for finding the best interpolation weights. In Section 4, we used Bayesian optimisa-

tion for this search. One can specify how many function evaluations the optimisation

routine should use; when using 4 base policies, we observed that increasing this be-

yond 40 stopped yielding improvements. Figure 6.1 shows the returns achieved on 5

different ρρρ’s when using different numbers of function evaluations - the lines become

mostly flat above 40. On a 2.3GHz CPU, 40 iterations takes around 15 minutes.

One way to avoid this online search is to perform more offline computation, namely

to use model fitting to predict the weights (as explored in Section 4.5). In this case, the

online time of running ρρρ through the model (e.g. KNN) is practically negligible.

0 10 20 30 40 50 60No. of Function Evaluations

50

100

150

200

250

300

Return

Figure 6.1: We apply Bayesian optimisation to 5 different ρρρ’s, varying the number of

function evaluations allowed from 2 and 60. For each ρρρ and function evaluation no. we

solve 3 times; we show mean and standard error. Each line represents a different ρρρ.

Domain randomisation No online computation is needed - we simply feed ρρρ into

the adaptive policy at runtime.

6.2 Episode return

The total reward obtained during an episode (the return) can be used as a quantitative

measure of how successfully a policy is following the motion capture data. It should

only be used to compare policies trained on the same ρρρ, since the maximum achievable

return varies. We note however that a high reward doesn’t always mean good-looking

motion, and some motions still appear unsteady. With more time, we would experiment

further with hyperparmeters and reward function weights to try and address this.


Uniform character scaling

We first look at changing the overall scale of the model between 1 and 1.5. We compare

the following methods: a) the baseline, b) the combination of two base policies ([s =

1,a = 0, l = 0] and [s = 1.5,a = 0, l = 0]) where weights are found using brute-force,

and c) an adaptive neural network trained using domain randomisation (using the Split

A architecture which performed best in Chapter 5). Figure 6.2 shows the results.

Domain randomisation (with CF=0.1) achieved high returns across most of the

range, although it performed poorly for the default scale 1. Future work would be to

repeat the training multiple times with different random seeds to see if we were simply

‘unlucky’ here. At higher scales it outperforms the baseline, which was unexpected -

this could perhaps be due to the reduced CF, since we used CF=0.2 for all the baseline

models. (For comparison, we also show domain randomisation with CF=0.2, which

performs worse than the baseline.) Combination performs similarly to the baseline at

1 and 1.5 (where it is essentially using the same policies), but struggles between 1 and

1.3. Using another base policy (e.g. at 1.25) may be one way to improve this.

Overall, in these experiments it was domain randomisation that performed best.

1.0 1.1 1.2 1.3 1.4 1.5Scale

200

220

240

260

280

300

Return

BaselineCombinationDomain Randomisation (CF=0.1)Domain Randomisation (CF=0.2)

Figure 6.2: Episode returns achieved at different character scales.

Scaling individual limbs

We now look at increasing the length of the arms and legs. We did not succeed at

training policies for this using domain randomisation, so we only compare the base-

line and combination. We fitted the combination weights using Bayesian optimisation

(using the hyperparameters identified in Section 4.4.1) and 40 function evaluations.

The results are in Figure 6.5, with the corresponding ρρρ’s shown in Table 6.1.

Combination performs similarly to the baseline, even demonstrating notable im-

provements for some ρρρ’s. This indicates that DeepMimic is not always reliable at


arm \leg 0.00 0.25 0.50 0.75 1.00

0.00 a b c d e

0.25 f - - j -

0.50 g - k - l

0.75 h m - - -

1.00 i - n - o

Table 6.1: Body parameter (ρρρ) values used in Figures 6.3 and 6.5.

finding an optimal policy when retargeting to models that do not match the motion

capture data, and that the flexibility offered by linearly combining base policies can

outperform it. It could be that we were unlucky with our weight initialisations, and

future work would be to train each baseline model multiple times to account for this.

Nevertheless, it can be seen that combination can achieve high returns.

a b c d e f g h i j k l m n oBody Parameters

0

100

200

300

Return

Baseline Combination

Figure 6.3: Episode returns achieved for different arm/leg lengths (see Table 6.1).

6.3 Controller robustness

In this section, we evaluate the robustness of the different controllers to externally-

applied forces. One of the main advantages of using physically-based animation over

kinematic animation is the ability to produce emergent reactions to perturbations.

We follow the methodology of Peng et al. in the DeepMimic paper [34]. They

apply an external force halfway through the running motion cycle. It acts on the char-

acter’s pelvis in the direction perpendicular to forward movement, and is applied for

0.2 seconds. We begin with a 50N force, and if the character does not fall, we restart

the episode and increase the force applied by 25N, repeating until the character falls.

Peng et al. reported that their running controller could tolerate 300N (i.e. a 300×0.2 = 60Ns impulse), and stated this was comparable to other physical motion synthe-


sis methods such as SAMCON [29] (whose running controller can withstand a 50Ns

impulse). It is worth noting that no external forces are applied during training; Peng

et al. hypothesise that robustness comes from the exploration noise exposing the char-

acter to more states when training. Al Borno et al. [2] train using external forces, yet

report worse figures (their controller can withstand a 50Ns impulse≈30% of the time).

Uniform character scaling

Figure 6.4 shows the results when varying scale. Although we saw that domain ran-

domisation had the highest returns, the methods do similarly to each other here. (Means:

340±5N (baseline), 350±4N (combination), 340±10N (domain randomisation))

1.0 1.1 1.2 1.3 1.4 1.5Scale

200

250

300

350

400

450

Maximum

Force

(N)

BaselineCombination

Domain Randomisation (CF=0.1)

Figure 6.4: Maximum sideways force tolerated by controllers for different scales.

Scaling individual limbs

Figure 6.5 shows the results when arm and leg length are varied. As may be ex-

pected, extending limb length results in less robust controllers, however the methods

still achieve good means: 310±30N (baseline), 330±15N (combination).

a b c d e f g h i j k l m n oBody Parameters

0

100

200

300

400

500

Max

imum

For

ce (N

)

Baseline Combination

Figure 6.5: Sideways force tolerated by controllers for different limb lengths (Table 6.1).

Chapter 7

Conclusions and Future Work

Conclusions

We investigated how to train adaptive controllers for physically-based animation which

can be quickly applied to character models whose body parameters have been modi-

fied. Two approaches were implemented: linearly combining controllers; and domain

randomisation. Both were inspired by sim-to-real research in robotics, and to our

knowledge have not been used for animation. We built our methods around Deep-

Mimic [34], a physical motion synthesis method that uses deep RL. We evaluated

them on animating a humanoid (with variable scale and limb length) to run.

Combination of controllers Our experiments used up to 4 base controllers, and

showed that it was possible to produce good controllers for the body parameter space

spanned by these controllers. We observed that linear interpolation did not find the op-

timal combination weights, and therefore we proposed using black-box optimisation.

Finally, we demonstrated that it is possible to apply model fitting methods (cubic inter-

polation and KNN) to learn from a training set of fitted combination weights in order

to cover the whole body parameter space and remove the need for online computation.

Domain randomisation We tried three different neural network architectures

and several different PPO hyperparameter settings. By inputting the character model’s

scale to the neural network, and randomly varying this during training, we demonstrate

that the resulting controller is able to adapt to different character model scales.

Comparison Both approaches could produce adaptive controllers for varying

character scales up to 150%. No online computation is required (if model fitting is

used for the linear combination approach), a substantial improvement from the > 20

hours needed by the baseline (using DeepMimic to train from scratch). In terms of

39

Chapter 7. Conclusions and Future Work 40

offline time, domain randomisation did not require much longer than the baseline’s

online time. The combination approach required training two base policies (speed-up

can be obtained using transfer learning). In terms of episode return (i.e. how closely

the controller follows the reference motion), domain randomisation outperformed both

linear combination and the baseline. In terms of robustness to external forces, all

approaches performed similarly, and comparably to results in the literature [34, 29, 2].

The combination approach could produce controllers for varying arm/leg length

(increasing by up to 80% and 40% respectively), achieving similar or better returns and

robustness to the baseline. Domain randomisation was unable to handle varying leg

length, indicating it is a harder approach to tune. However, the combination procedure

is more complex, and as we vary more body parameter dimensions, the number of base

policies needed (and weight search space dimensions) grows exponentially.

Future work

We showed that training adaptive controllers using DeepMimic is possible. They re-

quire no online computation, whilst generating motion that is robust and mostly fol-

lows the reference motion as well as the DeepMimic baseline. However, the motions

do not quite look consistent enough to be used in practice. Given more time, we would

try to reduce the unsteadiness in the motions by tuning the reward function and hyper-

parameters (e.g. PD gains) better. We also note that PyBullet [5] currently does not

allow setting rotation limits on joints with 4 DOF, which can lead to strange motions.

We would like to try other body variations and motions, and also non-humans. It

would also be important to see if our controllers can be made interactively-controllable.

We note the scalability of the combination approach to more body parameter di-

mensions as a potential issue; future work could investigate this. In particular, it should

be possible to design heuristics to guide the weight optimisation process. The model

fitting method that works best may also change as dimensions grow.

We believe that tuning the hyperparameters and architecture more thoroughly is key

to improving domain randomisation’s performance. We would also have liked to carry

out more repeats to assess domain randomisation’s reliability across random seeds. We

only performed preliminary experiments using progressive networks (Section D.2) but

believe they are an interesting avenue, we just did not find the correct architecture in

the time we had. Another idea is curriculum learning [16] which is where one designs

how the domain varies during training rather than using pure randomisation.

Bibliography

[1] Mazen Al Borno, Martin De Lasa, and Aaron Hertzmann. Trajectory optimiza-

tion for full-body movements with complex contacts. IEEE transactions on visu-

alization and computer graphics, 19(8):1405–1414, 2012.

[2] Mazen Al Borno, Ludovic Righetti, Michael J Black, Scott L Delp, Eugene Fi-

ume, and Javier Romero. Robust physics-based motion retargeting with realistic

body shapes. In Computer Graphics Forum, volume 37, pages 81–92. Wiley

Online Library, 2018.

[3] Peter Alfeld. A trivariate cloughtocher scheme for tetrahedral data. Computer

Aided Geometric Design, 1(2):169–181, 1984.

[4] Bobby Bodenheimer, Chuck Rose, Seth Rosenthal, and John Pella. The process

of motion capture: Dealing with the data. In Computer Animation and Simula-

tion97, pages 3–18. Springer, 1997.

[5] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simula-

tion for games, robotics and machine learning. GitHub repository, 2016.

[6] T Cover and P Hart. Nearest neighbor pattern classification. IEEE Transactions

on Information Theory, 13(1):21–27, 1967.

[7] Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint

arXiv:1807.02811, 2018.

[8] Thomas Geijtenbeek and Nicolas Pronost. Interactive character animation us-

ing simulated physics: A state-of-the-art review. In Computer graphics forum,

volume 31, pages 2492–2515. Wiley Online Library, 2012.

[9] Ruben Glatt, Felipe Leno Da Silva, and Anna Helena Reali Costa. Towards

knowledge transfer in deep reinforcement learning. In Intelligent Systems

(BRACIS), 2016 5th Brazilian Conference on, pages 91–96. IEEE, 2016.

41

Bibliography 42

[10] Michael Gleicher. Retargetting motion to new characters. In Proceedings of the

25th annual conference on Computer graphics and interactive techniques, pages

33–42. ACM, 1998.

[11] Shihui Guo, Richard Southern, Jian Chang, David Greer, and Jian Zhang. Adap-

tive motion synthesis for virtual characters: a survey. The Visual Computer,

31(5):497–512, 2015.

[12] Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine.

Learning invariant feature spaces to transfer skills with reinforcement learning.

arXiv preprint arXiv:1703.02949, 2017.

[13] Christian Habermann and Fabian Kindermann. Multidimensional spline inter-

polation: Theory and applications. Computational Economics, 30(2):153–169,

2007.

[14] Nikolaus Hansen. The cma evolution strategy: A tutorial. arXiv preprint

arXiv:1604.00772, 2016.

[15] Nikolaus Hansen, Youhei Akimoto, and Petr Baudis. CMA-ES/pycma on Github.

Zenodo, DOI:10.5281/zenodo.2559634, February 2019.

[16] Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval

Tassa, Tom Erez, Ziyu Wang, SM Eslami, Martin Riedmiller, et al. Emergence of

locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286,

2017.

[17] Jessica K Hodgins and Nancy S Pollard. Adapting simulated behaviors for new

characters. In Proceedings of the 24th annual conference on Computer graphics

and interactive techniques, pages 153–162. ACM Press/Addison-Wesley Pub-

lishing Co., 1997.

[18] Daniel Holden, Jun Saito, and Taku Komura. A deep learning framework for

character motion synthesis and editing. ACM Transactions on Graphics (TOG),

35(4):138, 2016.

[19] Mohd Izani, AR Eshaq, et al. Keyframe animation and motion capture for cre-

ating animation: a survey and perception from industry people. In Proceedings.

Student Conference on Research and Development, 2003. SCORED 2003., pages

154–159. IEEE, 2003.

Bibliography 43

[20] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific

tools for Python, 2001-2019.

[21] Pushkar Joshi, Wen C Tien, Mathieu Desbrun, and Frederic Pighin. Learning

controls for blend shape based realistic facial animation. In ACM Siggraph 2006

Courses, page 17. ACM, 2006.

[22] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume

Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag-

nieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural

networks. Proceedings of the national academy of sciences, 114(13):3521–3526,

2017.

[23] Taku Komura, Ikhsanul Habibie, Jonathan Schwarz, and Daniel Holden. Data-

driven character animation synthesis. Handbook of Human Motion, pages 1–29,

2017.

[24] Lucas Kovar and Michael Gleicher. Automated extraction and parameterization

of motions in large data sets. In ACM Transactions on Graphics (ToG), vol-

ume 23, pages 559–568. ACM, 2004.

[25] GL Manoj Kumar and Tim Head. Scikit-optimize. Tim Head and contributors,

2017.

[26] Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Transfer of samples

in batch reinforcement learning. In Proceedings of the 25th International Con-

ference on Machine Learning, ICML ’08, pages 544–551, New York, NY, USA,

2008. ACM.

[27] John P Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Frederic H Pighin, and

Zhigang Deng. Practice and theory of blendshape facial models. Eurographics

(State of the Art Reports), 1(8):2, 2014.

[28] Libin Liu and Jessica Hodgins. Learning basketball dribbling skills using trajec-

tory optimization and deep reinforcement learning. ACM Transactions on Graph-

ics (TOG), 37(4):142, 2018.

[29] Libin Liu, Michiel Van De Panne, and KangKang Yin. Guided learning of con-

trol graphs for physics-based characters. ACM Transactions on Graphics (TOG),

35(3):29, 2016.

Bibliography 44

[30] Josh Merel, Yuval Tassa, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg

Wayne, and Nicolas Heess. Learning human behaviors from motion capture by

adversarial imitation. arXiv preprint arXiv:1707.02201, 2017.

[31] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Vi-

sual domain adaptation: A survey of recent advances. IEEE signal processing

magazine, 32(3):53–69, 2015.

[32] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel,

Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron

Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal

of machine learning research, 12(Oct):2825–2830, 2011.

[33] Tomislav Pejsa and Igor S Pandzic. State of the art in example-based motion

synthesis for virtual characters in interactive applications. In Computer Graphics

Forum, volume 29, pages 202–226. Wiley Online Library, 2010.

[34] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deep-

mimic: Example-guided deep reinforcement learning of physics-based character

skills. ACM Trans. Graph., 37(4):143:1–143:14, 2018.

[35] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine.

Epopt: Learning robust neural network policies using model ensembles. arXiv

preprint arXiv:1610.01283, 2016.

[36] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James

Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progres-

sive neural networks. arXiv preprint arXiv:1606.04671, 2016.

[37] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp

Moritz. Trust region policy optimization. In International Conference on Ma-

chine Learning, pages 1889–1897, 2015.

[38] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter

Abbeel. High-dimensional continuous control using generalized advantage es-

timation. arXiv preprint arXiv:1506.02438, 2015.

[39] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.

Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

2017.

Bibliography 45

[40] Jonathan Richard Shewchuk. Delaunay refinement algorithms for triangular mesh

generation. Computational geometry, 22(1-3):21–74, 2002.

[41] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian opti-

mization of machine learning algorithms. In Advances in neural information

processing systems, pages 2951–2959, 2012.

[42] Kwang Won Sok, Manmyung Kim, and Jehee Lee. Simulating biped behaviors

from human motion data. In ACM Transactions on Graphics (TOG), volume 26,

page 107. ACM, 2007.

[43] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction.

MIT press, 2018.

[44] Andrew Witkin and Michael Kass. Spacetime constraints. ACM Siggraph Com-

puter Graphics, 22(4):159–168, 1988.

[45] Jungdam Won, Jongho Park, Kwanyu Kim, and Jehee Lee. How to train your

dragon: example-guided control of flapping flight. ACM Transactions on Graph-

ics (TOG), 36(6):198, 2017.

[46] Katsu Yamane, Yuka Ariki, and Jessica Hodgins. Animating non-humanoid

characters with human motion data. In Proceedings of the 2010 ACM SIG-

GRAPH/Eurographics Symposium on Computer Animation, pages 169–178. Eu-

rographics Association, 2010.

[47] KangKang Yin, Kevin Loken, and Michiel Van de Panne. Simbicon: Simple

biped locomotion control. In ACM Transactions on Graphics (TOG), volume 26,

page 105. ACM, 2007.

[48] Wenhao Yu, C Karen Liu, and Greg Turk. Policy transfer with strategy optimiza-

tion. arXiv preprint arXiv:1810.05751, 2018.

[49] Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. Preparing for the unknown:

Learning a universal policy with online system identification. arXiv preprint

arXiv:1702.02453, 2017.

[50] Chao Zhang, Yang Yu, and Zhi-Hua Zhou. Learning environmental calibration

actions for policy self-evolution. In IJCAI, pages 3061–3067, 2018.

Appendix A

Reward Function Terms

These are the terms used to compute the reward (Section 3.2). We use the same defini-

tions as given in the DeepMimic paper [34]. Each joint j additionally has an associated

weight w j to denote its relative importance. These are displayed in Figure A.1.

rpt = exp

[−2(

Σ jw j||q jt q j

t ||2)]

(A.1)

q jt denotes the angle of the simulated character’s joint j at time t (represented as

a quaternion). q jt denotes the corresponding target angle from the reference motion.

takes the quaternion difference, and ||.|| turns a quaternion into the scalar rotation

about its axis.

rvt = exp

[−0.1

(Σ jw j||q j

t − q jt ||2)]

(A.2)

q jt and ˆq j

t denote the simulated character and reference motion’s joint velocities

respectively. ˆq jt is computed from the reference motion using finite difference.

ret = exp

[−40

(Σewe||pe

t − pet ||2)]

(A.3)

pet and pe

t represent the global positions of end-effector e for the simulated character

and reference motion respectively.

rct = exp

[−10

(wc||pc

t − pct ||2)]

(A.4)

pct and pc

t represent the global position of the root (pelvis) for the simulated char-

acter and reference motion respectively.

46

Appendix A. Reward Function Terms 47

Figure A.1: Diagram of the actuable joints in the humanoid model. Each joint is labelled

with a black circle, along with its degrees-of-freedom (DOF) and reward function weight

w j.

Appendix B

Training Graphs

B.1 Domain randomisation

0

100

200

Default (CF=0.2) Default (CF=0.1) Split A (CF=0.2)

0 2 4 61e7

0

100

200

Split A (CF=0.1)

0 2 4 61e7

Split B (CF=0.2)

0 2 4 61e7

Split B (CF=0.1)

Samples

Return

Figure B.1: Training graphs for the domain randomisation experiments in Section 5.2

(the bottom graph in Figure 5.1). Split A (CF=0.1) corresponds to the domain randomi-

sation controller used for evaluation in Chapter 6.

48

Appendix B. Training Graphs 49

B.2 Baseline

0

100

200

300s=1.00 s=1.12 s=1.24 s=1.36 s=1.50

0

100

200

300a (same as s=1.00) b c d e

0

100

200

300f g h i j

0 2 4 61e7

0

100

200

300k

0 2 4 61e7

l

0 2 4 61e7

m

0 2 4 61e7

n

0 2 4 61e7

o

Samples

Return

Figure B.2: Training graphs for the baseline policies used for evaluation in Chapter 6.

The solid lines show training return, the dashed lines show test return.

Appendix C

Combination of Policies

C.1 Transferred vs non-transferred base policies

In Section 4.1.1, we observed that using transfer learning enabled base policies to be

trained more quickly. An interesting question is whether interpolation works better

when using base policies that were all trained using transfer from the default base

policy than when using base policies that were trained from a random initialisation.

Since the policy space in which DeepMimic operates will contain many local optima,

policies trained from different random initialisations are likely to end up in different

local optima, which may mean they combine together less effectively.

We ran two sets experiments to test this hypothesis. The first looks at varying

overall scale (base policies: [s = 1,a = 0, l = 0] and [s = 1.5,a = 0, l = 0]); the second

looks at varying both arm and leg length at once (base policies: [s = 1,a = 0, l = 0]

and [s = 1,a = 1, l = 1]). All experiments use the same default model policy. We then

compare the results of training the second policy from scratch against training it using

transfer from the default policy. Figure C.1 shows the results.

In both graphs, neither scheme consistently outperforms the other. When varying

scale, the transfer learning policy does much better than the randomly-initialised policy

for larger scales; when varying limb length, it performs better for smaller lengths.

Overall, there is not enough evidence to strongly support our hypothesis (that policies

initialised from the same default policy combine together better), but since it does not

make performance notably worse (and does better at high scales), and it also speeds up

training times, we used transferred policies in our Chapter 4 and 6 experiments.

50

Appendix C. Combination of Policies 51

1.0 1.1 1.2 1.3 1.4 1.5Scale

200

220

240

260

280

300

Return

Using Randomly-Initialised PolicyUsing Transfer Learning Policy

0.0 0.2 0.4 0.6 0.8 1.0Added Arm and Leg Length

175

200

225

250

275

300

325

350

Return

Using Randomly-Initialised PolicyUsing Transfer Learning Policy

Figure C.1: Comparison of episode returns for different ρρρ’s when using the default policy

and another policy trained from scratch, against using the default policy and another

policy initialised from the default policy. In the top graph we vary scale, in the bottom

graph we modify arm and leg length simultaneously (so e.g. 0.5 on the x-axis denotes

[s = 1,a = 0.5, l = 0.5]).

C.2 Segmenting the action vector

An experimental idea was to split the action vector into segments, and use a different

interpolation weight for each segment of the vector. A similar idea has been proposed

for facial animation using blend shapes where the face is split into three areas that

can be manipulated independently [21]. However, the finer-grain control over what

is taken from each base policy comes at the cost of increasing the dimensions of the

weight search space.

Four different ways of segmenting the action vector were tried:

• 1 segment• 3 segments: head and torso, arms, legs

Appendix C. Combination of Policies 52

• 5 segments (a): head and torso, right arm, right leg, left arm, left leg

• 5 segments (b): head and torso, shoulders, elbows and wrists, hips, knees and

ankles

In order to assess the performance of these different segmentation approaches, we

ran experiments using each approach where we used Bayesian optimisation to find the

best interpolation weights. We ran one set of experiments that varies arm length (base

policies: [s = 1,a = 0, l = 0], [s = 1,a = 1, l = 0]), and another that varies leg length

(base policies: [s= 1,a= 0, l = 0], [s= 1,a= 0, l = 1]). We used the parameter settings

that performed best in Section 4.4.1 and use 40 function evaluations. Figure C.2 shows

the results.

0.0 0.2 0.4 0.6 0.8 1.0Added Arm Length

250

260

270

280

290

300

310

Return

1 segment3 segments5 segments (a)5 segments (b)

0.0 0.2 0.4 0.6 0.8 1.0Added Leg Length

200

220

240

260

280

300

320

Return

1 segment3 segments5 segments (a)5 segments (b)

Figure C.2: Episode returns obtained using different approaches for segmenting the ac-

tion vector. For the single-segment approach, the best weights were found using brute-

force at a resolution of 0.1. For the multiple-segment approaches, Bayesian search was

run three times for each ρρρ; the bar plots show the mean and standard error. In the left

plot, the arm length was varied. In the right plot, the leg length was varied.

It can be seen that segmenting the action vector does not offer any clear improve-

ments, with the returns of the different approaches all being similar. For some ρρρ’s,

the segmented approaches even do worse than using a single segment; this is an issue

with Bayesian optimisation not finding the true optimum (since the multiple-segment

approaches can represent every policy that the single-segment approach can). This il-

lustrates the trade-off between increasing return and increasing the dimensions of the

search space.

Future work could look at whether allowing more function evaluations during op-

timisation could improve the performance of segmented approaches. Heuristics could

also be investigated e.g. first optimising using the single-segment approach, and using

the weights found to initialise the search process for a multiple-segment approach.

Appendix D

Domain Randomisation

D.1 Neural network architecture diagrams

Figure D.1: Actor network architectures used in Chapter 5. Left to right: Default, Split A,

and Split B. The critic network architectures are the same except that the final aaa linear

output layer is replaced by a single linear unit. We also show whether layers use the

ReLU activation or a simple linear activation.

D.2 Adjusting output of a frozen network

In this section, we investigate whether it is possible to learn how to adjust the action

vector aaa outputted by a non-adaptive policy in order to handle different ρρρ’s. The idea

is related to progressive neural networks [36] which were proposed as a domain adap-

tation method for RL; Rusu et al. first train a policy on a source task, and then freeze

those policy layers and connect them to new layers that are trained on the target task.

Our approach is to first train a non-adaptive policy with the default two-layer ar-

chitecture on the default ρρρ. This default architecture only takes the character state sss as

input. Once this policy has converged, we then freeze those layers, and add new layers

53

Appendix D. Domain Randomisation 54

Figure D.2: Actor network architectures. Left to right: A, B, C. The critic network archi-

tectures are the same except that the final aaa linear output layer is replaced by a single

linear unit. The greyed-out layers (in the top left of each network) represent the frozen

default network - these have first been trained on the default character model, and are

then frozen during the domain randomisation training.

that process ρρρ in order to learn how to adjust the action aaa outputted by the default archi-

tecture into a new action aaa′′′. These new layers are trained using domain randomisation

- we randomly sample the elements of ρρρ at the start of each episode.

We tried three different architectures (A, B, C), shown in Figure D.2. In our ex-

periments, we tried training an adaptive controller for scale varying between 1 and

1.2. The significantly different network architecture meant that the hyperparameters

we used in other experiments (Section 3.1.2) were unsuitable, in particular the learn-

ing rates. Long training times prevented a thorough search for suitable new values,

but the best results we achieved are shown in Figure D.3. The hyperparameters that

we changed from the default values are the following: actor learning rate: 10−4, critic

learning rate: 10−2, clipping factor ε: 0.05. It can be seen that none of the archi-

tectures achieves consistently high returns (it is possible to achieve above 280 for all

scales between 1 and 1.5). We leave the investigation of different architectures and

hyperparameters as future work.

1.00 1.04 1.08 1.12 1.16 1.20Scale

100

150

200

250

Return

A (36 units)B (72 units)C (36 units + 36 units)

Figure D.3: Episode returns of different architectures evaluated at different scales.

Documents

Learning the Locomotion of Different Body Sizes Using Deep