Upload
others
View
3
Download
0
Embed Size (px)
Citation preview
Learning the Locomotion of
Different Body Sizes Using Deep
Reinforcement Learning
Wenying Wu
Master of Science
Artificial Intelligence
School of Informatics
University of Edinburgh
2019
Abstract
Ensuring the realism of 3D animation is essential to maintaining audience immersion.
Motion synthesis aims to generate realistic motion from only high-level control pa-
rameters (e.g. the trajectory to follow). In particular, physical motion synthesis uses
physically-based animation, where all motion is the result of simulated forces. To
ensure realism, these techniques can aim to imitate reference motion capture data.
The advantage of physically-based animation is the ability to produce emergent, and
physically-realistic, responses to unexpected external disturbances e.g. different ter-
rains.
An important consideration for methods that use motion capture data is retarget-
ing, i.e. how to apply the resulting animations to characters whose physical properties
do not match those of the motion capture actor. Existing physical motion synthesis re-
search does not place much focus on efficient retargeting, and thus existing techniques
can take many hours to train controllers for a modified character model.
In this dissertation, we look at whether it is possible to train adaptive controllers
that can quickly adapt to different body parameters. This could be used to develop a
tool for physically-based animation where animators can edit characters interactively.
We build our experiments around DeepMimic, a recent method for physical motion
synthesis that uses deep Reinforcement Learning to imitate reference motion capture
data. The DeepMimic authors show it is possible to retarget the motion data to different
models with DeepMimic, but this requires many hours of training. We use this method
as our baseline.
We implement and compare two approaches (both inspired by sim-to-real transfer
research for robotics): linearly combining base controllers to produce a new controller
fitted to the desired body parameters, and domain randomisation. We evaluate these
methods on a humanoid model with varying scales and limb lengths, and demonstrate
that it is indeed possible to train controllers that can adapt to different body parameters
with no online computation as opposed to needing many hours. These controllers
follow the reference motion as well as the baseline can, whilst also demonstrating
comparable robustness to external forces.
i
Acknowledgements
I firstly would like to thank my supervisor Taku Komura for suggesting this very in-
teresting research project and for his input throughout the process. I would also like
to thank Sebastian Starke, Levi Fussell and Robin Trute for their valuable ideas and
insights at the start of the project. I would like to thank Erwin Coumans and Yunfei
Bai for the open-source PyBullet simulator which the project uses. Finally, I would
like to thank my friends here in Edinburgh (especially my fellow Floor 6 frequenters)
for their company, and my parents for their continual support.
ii
Table of Contents
1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Outline of dissertation . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Background and Related Work 42.1 Character animation . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Motion synthesis . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Retargeting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Retargeting kinematic animation . . . . . . . . . . . . . . . . 6
2.2.2 Retargeting physically-based animation . . . . . . . . . . . . 6
2.3 Controller transfer in robotics . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Domain adaptation and transfer learning . . . . . . . . . . . . 7
2.3.2 Domain randomisation . . . . . . . . . . . . . . . . . . . . . 8
2.3.3 Linear combination of policies . . . . . . . . . . . . . . . . . 8
3 DeepMimic 93.1 Reinforcement learning . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Proximal policy optimisation . . . . . . . . . . . . . . . . . . 10
3.1.2 Algorithm and hyperparameters . . . . . . . . . . . . . . . . 11
3.2 Problem representation . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Physics simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Character and reference data . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 Modifying body parameters . . . . . . . . . . . . . . . . . . 14
3.4.2 Retargeting motion capture data . . . . . . . . . . . . . . . . 15
3.5 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
iii
4 Combination of Controllers 164.1 Base policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1.1 Transfer learning . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Combining policies . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.3 Interpolating two policies . . . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Linearly combining multiple policies . . . . . . . . . . . . . . . . . . 19
4.4.1 Black-box optimisation . . . . . . . . . . . . . . . . . . . . . 20
4.4.2 Corrective base policies . . . . . . . . . . . . . . . . . . . . 22
4.4.3 Bilinear interpolation . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Model fitting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5.1 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.5.2 K-Nearest Neighbours . . . . . . . . . . . . . . . . . . . . . 27
4.5.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
5 Domain Randomisation 295.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1 Network architecture . . . . . . . . . . . . . . . . . . . . . . 29
5.1.2 PPO hyperparameters . . . . . . . . . . . . . . . . . . . . . . 30
5.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6 Results and Evaluation 336.1 Training time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.1 Offline time . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
6.1.2 Online time . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.2 Episode return . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3 Controller robustness . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7 Conclusions and Future Work 39
Bibliography 41
A Reward Function Terms 46
B Training Graphs 48B.1 Domain randomisation . . . . . . . . . . . . . . . . . . . . . . . . . 48
B.2 Baseline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
iv
C Combination of Policies 50C.1 Transferred vs non-transferred base policies . . . . . . . . . . . . . . 50
C.2 Segmenting the action vector . . . . . . . . . . . . . . . . . . . . . . 51
D Domain Randomisation 53D.1 Neural network architecture diagrams . . . . . . . . . . . . . . . . . 53
D.2 Adjusting output of a frozen network . . . . . . . . . . . . . . . . . . 53
v
Chapter 1
Introduction
1.1 Motivation
3D animation is ubiquitous in films and video games. In order to ensure the audience
remains immersed, it is essential that animation appears believable and realistic. In the
last couple of decades, motion capture has become a widely-used technique for ani-
mation [19], however it is expensive, requires manual post-processing, and is limited
in terms of flexibility and controllability. Motion synthesis research aims to address
these issues by looking at how to generate realistic-looking motions according to only
high-level control inputs, often relying on a database of motion capture data [33].
Most character animation in industry is kinematic (operates directly on positions
and velocities), however there has also long been interest in refining physically-based
animation techniques where all motion is the result of simulated forces [8]. Physical
motion synthesis is an exciting avenue of research that offers a lot of potential for pro-
ducing characters that can react realistically to unseen environmental changes, a feature
that would be incredibly useful in interactive applications. To ensure the synthesised
motions appear realistic, physical motion synthesis methods may use techniques such
as space-time optimisation [44, 2] or reinforcement learning (RL) [34] to minimise the
difference between the simulated motion and reference motion capture data.
One issue with motion capture data is that often we wish to animate characters
whose physical characteristics do not match those of the motion capture actor. For
example, we may want to animate non-human characters, or characters with artistic
but unrealistic proportions, or simply to re-use the same data for different characters
to reduce costs. In these situations, we must take care to avoid undesirable artefacts
such as foot skating, floor penetration, or unnatural movement when transferring the
1
Chapter 1. Introduction 2
motion. This transferring is known as retargeting, and established techniques exist
both in the literature and in industry [11].
Most of these retargeting methods are for kinematic animation rather than physically-
based animation. It has been demonstrated that one can use motion capture data to
produce physical controllers for character models with different body parameters, but
as it stands, these methods require long and compute-heavy training/fitting procedures
for each different model [34, 2]. Depending on the method, this can take up to several
days, ruling out the possibility of an interactive editor. Ideally, we would like to be
able to develop tools which allow animators to adjust character models without being
exposed to long waiting times or the under-the-hood details of training.
1.2 Objective
This dissertation looks at whether it is possible to train adaptive physical controllers
that can quickly be used for models with different body parameters. This would al-
low animators to be more experimental and flexible with character designs as they
work with physically-based animation and motion capture data. In particular, we ad-
dress whether techniques proposed in robotics research for overcoming the Reality
Gap (the discrepancy between simulated training environments and the real world) can
be adapted to physically-based animation; and if so, which techniques are the most
promising with respect to the quality of the motion produced, the amount of online and
offline training time, and the controllers’ robustness to environmental perturbations.
1.3 Contributions
In this dissertation, we look at training adaptive controllers from DeepMimic con-
trollers. DeepMimic is a state-of-the-art physical motion synthesis method which is
able to train physically-animated characters to imitate motion capture clips [34] by
using Proximal Policy Optimisation (PPO), a deep RL algorithm [39]. Peng et al.
demonstrate that their method can train controllers for an Atlas robot model using hu-
man motion capture data, despite the fact that Atlas has different proportions and mass
distribution to the human actor. The issue is that this takes 1-2 days to train.
We design and implement two possible approaches for producing adaptive con-
trollers, and evaluate them on humanoid character models with varying body parame-
ters (e.g. limb length). The approaches are:
Chapter 1. Introduction 3
• Learning how to linearly combine multiple controllers in order to produce anew controller that is best fitted to the given body parameters. We show that
linear interpolation does not always find the best combination weights and that
black-box optimisation methods such as Bayesian optimisation can be applied
instead. However we also demonstrate that once we have the optimised weights
for a set of body parameters, one can use methods such as K-Nearest Neighbours
and cubic interpolation to cover the whole body parameter space.
• Using domain randomisation to train controllers that adjust their outputsdepending on the inputted body parameters. This involves trying different
neural network architectures and tuning the PPO hyperparameters. We also ran
some preliminary experiments using progressive networks [36].
Both approaches are inspired by robotics research [49, 50] and, as far as we are
aware, have not been applied to DeepMimic or any other physical motion synthesis
method. Our results show that it is possible to produce adaptive controllers using
both approaches, which is a novel discovery for computer graphics. We also suggest
directions for future research and improvement, and point out the observed limitations.
1.4 Outline of dissertation
We begin in Chapter 2 by presenting an overview of relevant concepts for the project,
relating them to the graphics literature, and further motivating the work carried out. We
discuss existing research into retargeting for physically-based animation, as well as the
robotics research that inspired the methods we use. Chapter 3 explains RL, PPO and
DeepMimic in more detail. We also describe the humanoid character model we use
in our experiments. Chapters 4 and 5 describe our implementations of linearly com-
bining controllers and domain randomisation respectively. The experimental results
that motivated design choices are also presented. The main evaluation and discussion
takes place in Chapter 6, where we take the best methods from Chapters 4 and 5 and
compare them against each other, as well as against the baseline of basic DeepMimic,
in order to address the questions in the objective. Finally, Chapter 7 draws conclusions
about the different methods and suggests ideas for future work.
Supplementary video We provide a video of some of the resulting animations:
https://drive.google.com/open?id=1Ff88lhvs8mLrOH0vzHBi3X75sMvoX73o
Chapter 2
Background and Related Work
In this chapter, we discuss the relevant background. We begin by explaining the moti-
vation behind physical motion synthesis methods such as DeepMimic. We then discuss
existing retargeting techniques. Finally, we outline the robotics research into overcom-
ing the Reality Gap that provided inspiration for our methods in later chapters.
2.1 Character animation
This dissertation looks at 3D animation. The character being animated is represented as
a hierarchy of bones linked together by joints; this structure is known as the skeleton.
Character poses can be represented by specifying the position of the root body part
(often the pelvis) and the rotations of each joint. Forward kinematics can then compute
the position of end-effectors (hands, feet). One way to create animations is to manually
design a sequence of poses (keyframes), and use interpolation on the joint rotations to
smoothly transition the character from one keyframe to the next. These keyframes can
also be obtained from motion capture data. These approaches to animation are purely
kinematic - we operate directly on the positions and rotations of joints, and are not
constrained to following the laws of physics. This is advantageous in that it offers the
animator more control, but this can also be a drawback: if we want to adapt a motion
to a different character, or to react to an external force, we have to do this manually.
Physically-based animation is an alternative where all motion is the result of sim-
ulated forces, and we design controllers that drive the character’s joints. This is less
intuitive than designing kinematic motions; however, unlike kinematic methods, char-
acters can react to perturbations in a physically realistic way. This is an especially
desirable property for interactive applications such as video games [8].
4
Chapter 2. Background and Related Work 5
2.1.1 Motion synthesis
The idea of motion synthesis is to artificially synthesise realistic character motion that
follows high-level control inputs (such as the trajectory to be followed), avoiding the
time-consuming task of designing motions frame-by-frame at a low level. Such tech-
niques would be especially useful in the development of interactive video games where
character motion depends on real-time input and it is thus impossible to pre-design
every possible movement. Industry currently relies on methods such as inverse kine-
matics [4] to adapt animation clips to react to control inputs, but this often produces
undesirable visual artefacts such as foot skating and terrain penetration [33].
Motion synthesis methods can be broadly categorised into two approaches: example-
based and simulation-based. Example-based (i.e. data-driven) methods use a large
database of reference motions and apply statistical methods such as deep learning and
autoencoders [18] to produce different motions from the data. On the other hand,
simulation-based methods use physically-based animation to synthesise the motions.
Geijtenbeek et al. [8] provide a survey of physical motion synthesis. Older research
mostly uses motion tracking and model-based methods. Tracking methods however
require motion clips that are feasible to track, and the ability to generalise beyond
the clip is limited [42]. Model-based methods require expert insight into designing
the controller, limiting the range of motions they are applicable for, e.g. SIMBICON
only works for biped locomotion [47]. Stimulus-response network controllers aim to
be more general and reduce domain engineering [8]. The controller is represented
as a parameterised function, e.g. a neural network, and trained using methods such
as RL. DeepMimic [34] is a recent method that uses deep RL to train physical con-
trollers to imitate reference motion data. SAMCON is another method that instead
uses sampling-based optimisation [29]. Other recent research includes [28, 45, 30].
We chose to use DeepMimic in our research as its framework is simple and general;
the controller is just a neural network, and the method has been shown to work on
motions from crawling to doing backflips, as well as on a variety of characters such
as dragons. One can also specify a task during training (e.g. to move in a certain
direction, or throw a ball at a target), and the training will produce a controller that
satisfies this whilst still staying faithful to the reference motion. Being able to quickly
retarget DeepMimic controllers would thus open the doors to interactive tools for this
large range of motions. Furthermore, the controllers also demonstrate robustness to
uneven terrain or unexpected forces. DeepMimic is explained in detail in Chapter 3.
Chapter 2. Background and Related Work 6
2.2 Retargeting
Retargeting is where we adapt motion that was recorded (or designed manually using
keyframes) for one character model so it can be used by another character model.
This means we can re-use the motion, saving time and cost. Sometimes retargeting is
unavoidable, e.g. using motion-capture data to animate models that do not match the
motion capture actor because they have unrealistic proportions, or are non-human [46].
We focus on retargeting between models with the same topological structure (i.e. the
skeleton parts are connected in the same way, but may differ in dimension).
2.2.1 Retargeting kinematic animation
Most research into retargeting looks at kinematic animation; Guo et al. [11] provide a
survey. Pioneering work by Gleicher et al. [10] describes how to use space-time op-
timisation to perform retargeting. They propose an objective function that minimises
the pose difference in each frame, combined with constraints on the character’s joints
e.g. ‘elbows cannot bend backwards’, ‘this foot should be stationary in this time in-
terval’. The difficulty with retargeting kinematic animation is that constraints must be
designed manually and do not necessarily generalise to all models and motions.
2.2.2 Retargeting physically-based animation
Retargeting physically-based animation can in some ways be more straightforward
than retargeting kinematic animation since we do not need to explicitly design con-
straints to avoid artefacts like foot skating or floor penetration.
Peng et al. [34] demonstrate that DeepMimic is able to successfully train an Atlas
robot model to imitate human motion capture data, despite the differing physical shape
and mass distribution. They do not modify the motion capture clip at all, and train from
scratch from a randomly initialised policy. The reported training times for the Atlas
model are roughly the same as those for the human model (40-70 million samples,
depending on the motion clip). This takes 1-2 days even when using many parallel
cores. The authors did not explore techniques for speeding up the retargeting process.
Al Borno et al. [2] propose a novel method for performing physics-based motion
retargeting for different human body shapes. They use space-time optimisation to per-
form the retargeting [1]; the cost function encourages the physically-simulated char-
acter to follow a motion capture clip as closely as possible. The authors then use Co-
Chapter 2. Background and Related Work 7
variance Matrix Adaptation Evolutionary Strategy (CMA-ES) [14], a sampling-based
optimisation method, to find the best control trajectories according to the cost func-
tion. In order to make a controller that is robust to external perturbations rather than
just tracking the motion clip, the authors build a random tree of Linear Quadratic Reg-
ulators (LQRs). The authors state that constructing the tree can take multiple days.
Although both of these methods were demonstrated to work, the issue lies in the
long training/optimisation process needed for each new model and motion clip. Our
work explores approaches for creating adaptive controllers that can quickly be applied
to modified models without extensive retraining, something that has not been addressed
for DeepMimic or other comparably-general physical motion synthesis methods.
2.3 Controller transfer in robotics
In robotics, the Reality Gap is the discrepancy between simulated training environ-
ments and the real world. This often results in controllers that were trained in simula-
tion not being able to perform in reality. Sim-to-real transfer is important since train-
ing robots in simulation is faster, cheaper and safer than training in reality. We look
at whether approaches from robotics can be used for obtaining adaptive controllers for
physical animation. To our knowledge, this has not been looked at yet in the literature.
2.3.1 Domain adaptation and transfer learning
Domain adaptation (also referred to as transfer learning) looks at how to transfer mod-
els between different conditions (i.e. domains). The motivation is to reduce the training
time needed in the target domain by bringing over knowledge from a source domain.
It is generally assumed that the different domains are related and thus representations
or behaviours learned in one domain will facilitate learning in the other domain.
There is much domain adaptation research for computer vision [31]. There has also
been work into adapting control policies in the RL community: approaches include
learning invariant features between the two domains [12], transferring training samples
[26], and fine-tuning neural networks that were pre-trained on the source domain [9,
36]. We focus on the final approach. Glatt et al. [9] show that initialising a network
with weights from a network trained on the source domain allows knowledge transfer
between Atari games, whilst Rusu et al. [36] propose adding new layers to a network
trained on the source domain and freezing the layers belonging to the source network.
Chapter 2. Background and Related Work 8
2.3.2 Domain randomisation
The idea of domain randomisation is to randomly vary the simulation conditions dur-
ing training, such that the resulting controller is exposed to a range of conditions that
hopefully encompasses the conditions of the target domain (e.g. reality). Unlike do-
main adaptation, no training needs to be done in the target domain.
There are two different approaches: we can train robust policies that are able to
handle different environments without needing to identify them [35], or we can train
adaptive policies that first identify the environment (e.g. by performing calibrating ac-
tions) and then adjust themselves accordingly [48]. Since we will know the properties
of the character model we are retargeting to, the latter fits our problem better.
Domain randomisation has been successfully combined with RL algorithms to train
adaptive robotic controllers. For example, Yu et al. [49] use Trust Region Policy Op-
timisation (TRPO), a deep RL algorithm, and explicitly augment the state with the
domain parameters - we take the same approach in our experiments. They demonstrate
their algorithm on simulated continuous control problems such as a robot arm throwing
a ball whose mass varies. Heess et al. [16] look at training controllers for traversing
different terrains; they design curricula for training rather than entirely randomising.
2.3.3 Linear combination of policies
Another way of obtaining adaptive controllers is to combine the outputs of multiple
base controllers, where each base controller has been trained on a different domain.
Zhang et al. [50] look at obtaining policies for robotic arms, and they vary the
physical properties such as the length of the links. They propose Policy Self Evolu-
tion by Calibration (POSEC) for learning how to linearly combine base controllers for
different arm properties. The first step after obtaining the base controllers (trained us-
ing TRPO) is to use CMA-ES to find the optimal weights for a set of different arms.
These weights are then used to train a regression model to take in a feature vector
(representing the arm properties) and output the base controller combination weights.
When designing our experiments, we also took inspiration from graphics research
such as motion blending [23, 24] and facial animation using blend shapes [21]. Both
methods use linear combinations to produce new motions/facial expressions. In partic-
ular, the intuition provided by Joshi et al. [21] of placing blend shapes on the hypercube
formed by the parameters being varied is beneficial, and Kovar et al. [24] inspired the
idea of using K-Nearest Neighbours to compute the combination weights.
Chapter 3
DeepMimic
DeepMimic [34] is a state-of-the-art method for training physically-simulated virtual
characters to imitate motion capture or keyframe data, and it is the method our exper-
iments are built around. Given a character model and a reference motion, DeepMimic
uses deep RL to produce a physical controller for the character. We first formally in-
troduce RL and PPO, which is the particular algorithm used by DeepMimic. We then
explain how the animation problem is defined as an RL problem. We also describe the
character model and motion capture data, and finally we discuss the training process.
3.1 Reinforcement learning
Reinforcement Learning (RL) is a type of Machine Learning which is inspired by how
humans and animals learn through experimentation and interaction with the real world
[43]. It has been applied in many areas including robotics and computer animation.
In an RL problem, there is an agent which can take actions in an environment.
These actions affect the state of the agent and its environment, as specified by a transi-
tion function. We additionally specify a reward function that dictates the scalar reward
rt that the agent receives at each step t of its interaction with the environment. These
properties can be formalised as a Markov decision process. We define return as the
sum of rewards obtained in an episode of interaction. Expected return is denoted J.
The goal is for the agent to learn a policy π(aaa|sss) (a mapping from the currently
observed state sss to the action aaa to be performed) that maximises expected return. This
is done by allowing the agent to explore and gather experience (samples) in the envi-
ronment, and adjust its policy accordingly from the rewards it receives.
Many different RL algorithms exist, however we will focus on policy gradient
9
Chapter 3. DeepMimic 10
methods. These methods use a parametric representation of the policy πθ (e.g. a neural
network) and estimate the gradient of the expected return ∇θJ(θ) by collecting samples
in the environment. The sequence of states, actions and rewards observed in an episode
of interaction is called a trajectory.
We can also use a parametric function to represent the value function V (ssst) which
predicts the expected future return (value) if we are currently in state ssst . Actor-critic
methods are those that learn both the policy π (actor) and V (critic). The critic uses
samples gathered by the actor to improve its predictions to match the observed returns.
The return Rt obtained in a trajectory from t onward is defined: Rt = ΣTk=tγ
krk (3.1).
γ ∈ [0,1] denotes the discount factor - this controls how much emphasis the agent
should put on immediate rewards. T denotes the timestep at which the episode ends.
We can then define advantage At which measures the return we obtained from this
state onward compared to what the critic V predicts: At = Rt−V (ssst).
Schulman et al. [38] propose another way of computing At known as the Gener-
alised Advantage Estimator (GAE); it uses λ-returns [43] instead of the definition in
Equation 3.1, where λ is a hyperparameter ∈ [0,1] trading off the bias and variance in
the estimates (Equation 3.1 corresponds to setting λ = 1).
The Policy Gradient Theorem [43] defines how to make the gradient estimate:
∇θJ(θ) = Eπ[At∇θlog(πθ(aaat |ssst))] (3.2)
Policy gradient algorithms alternate between sampling and optimising; we first gather
a batch of samples, then estimate the gradient from these samples and apply gradient
ascent to update θ such that J(θ) improves. To improve efficiency, we can use impor-
tance sampling to estimate the gradient using samples collected by a previous policy
rather than having to re-collect experience after every update. This means multiplying
by importance weights: ∇θJ(θ) = Eπ[wt(θ)At∇θlog(πθ(aaat |ssst))] where wt(θ) denotes
the importance weight πθ(aaat |ssst)πθold (aaat |ssst)
(so wt(θ) = 1 means the policy hasn’t changed). This
can be interpreted as optimising the following objective function: L(θ) = Et [wt(θ)At ].
3.1.1 Proximal policy optimisation
PPO is an example of an actor-critic policy gradient RL method. Its proposing paper
[39] presents it as a modification to TRPO [37] that is simpler to implement. The idea
of TRPO is to maintain a trust region within which policy updates are restricted i.e.
we prevent the policy from changing too much in a single optimisation step, hopefully
Chapter 3. DeepMimic 11
avoiding poor updates that we are unable to recover from. The difference between
policies is measured by Kullback-Keibler (KL) divergence. Thus, TRPO optimises
L(θ) subject to a hard constraint (δ is a hyperparameter defining the trust region):
maxθ
L(θ) = Et [wt(θ)At ] subject to Et [KL[πθold(.|ssst),πθ(.|ssst)]]≤ δ (3.3)
The issue with TRPO is the difficulty in implementing the hard constraint as it re-
quires expensive second-order derivatives. PPO instead uses unconstrained first-order
optimisation and a clipped objective to de-incentivise large policy updates:
LCLIP(θ) = Et [min(wt(θ)At ,clip(wt(θ),1− ε,1+ ε)At)] (3.4)
where ε is the clipping factor. We provide some intuition: suppose aaat has At > 0,
so we wish to increase its probability. This means increasing wt(θ), but clip stops us
increasing it > 1+ ε. The min allows us to ‘undo’ mistakes - suppose we reduced the
probability of aaa due to a noisy sample, so w(θ) became < 1. When we see a sample
where aaa has positive advantage, we will be able to increase w(θ) again up to 1+ ε.
3.1.2 Algorithm and hyperparameters
For this project, we used the DeepMimic implementation provided by PyBullet [5].
The training algorithm alternates between gathering batches of data and making neu-
ral network updates. Data is gathered in episodes: each episode begins from some
state sss and we simulate until a fixed time horizon (20 seconds) or early termination
is triggered (a body part that is not a foot touches the floor). The starting states are
chosen according to reference state initialisation (RSI): rather than always starting at
the beginning of the motion, we sample from all the poses in the reference motion clip.
The actor and the critic each have their own network. To speed-up training, one
can have N agents collecting samples in parallel, sharing the same actor and critic
networks. Once a full batch of samples is obtained, we update to the network parame-
ters by sampling minibatches from the data to estimate gradients. The actor computes
gradients of L(θ) (Equation 3.4); for the critic, we use the TD(λ) algorithm (gradient
descent to minimise difference from target values computed using λ-returns) [43].
We use the same architecture as Peng et al. [34]: two fully-connected layers with
1024 and 512 ReLU units. In the actor network, these layers are followed by a linear
output layer (same dimension as the action space). For the critic network, it is a single
linear unit (the value estimate). The actor network output is taken to be the mean µµµ(sss)
Chapter 3. DeepMimic 12
of a Gaussian representing the policy: π(aaa|sss) = N(µµµ(sss),Σ). The covariance matrix Σ
is diagonal and fixed for all sss,,,aaa; it is treated as a hyperparameter (exploration noise).
Hyperparameters The hyperparameter settings we use are as follows:
batch size: 4096 (no. of samples collected before performing policy updates)
minibatch size: 256 (no. of samples used for each gradient estimate)
λ: 0.95 (used in GAE(λ) and TD(λ) calculations)
discount factor γ: 0.95
clipping factor ε: 0.2
actor step-size: 2.5×10−6 with momentum 0.9 (how far to move in each update)
critic step-size: 10−3 with momentum 0.9
exploration noise: linearly interpolated from 1 to 0 over 40 million samples
3.2 Problem representation
State and action space The character is the agent, and its controller is the policy.
The state of the character at time t is denoted ssst , and it is a vector containing the global
co-ordinates of the root, the positions of each joint (relative to the character’s root), the
rotations of each joint (as quaternions), and the angular and linear velocities of each
joint. For the humanoid character, the state has 197 features. The action at time t is
denoted aaat . It is a vector containing target rotations for each actuable joint; these values
are fed into a proportional-derivative (PD) controller at each joint which computes the
forces to apply. For the humanoid character, the action space has 36 dimensions.
Reference motion DeepMimic uses a reference motion specified as a sequence of
target poses {qqqt}. Each qqqt is a vector of all the reference motion’s joint rotations at
timestep t, and also the global co-ordinates of the reference motion’s root.
Reward function The reward function is designed to encourage the character to
mimic the reference motion. The reward at each timestep t is computed as follows:
rt = ωprp
t +ωvrv
t +ωere
t +ωcrc
t (3.5)
rpt is a pose reward, computed from the difference between the joint rotations (quater-
nions) of the simulated character and the reference motion. rvt is a velocity reward,
Chapter 3. DeepMimic 13
computed from the difference between the simulated character’s joint velocities (an-
gular) and the reference motion’s. ret is the end-effector reward, computed from the
difference in global positions of the simulated character and reference motion’s hands
and feet. Finally rct is computed from the difference in global position of the simulated
character and reference motion’s centre of mass (in our implementation, we just use
the root which is the pelvis). The full definitions are in the Appendix (Section A).
The ω terms weight the different rewards. After some experimentation, we fixed
the values: ωp = 0.5,ωv = 0.05,ωe = 0.15,ωc = 0.3. These sum to 1, so the maximum
possible return in a 20s episode equals the number of timesteps (≈ 600). In practice,
we did not observe returns this high since the character eventually deviates from the
straight trajectory followed by the reference motion.
3.3 Physics simulation
The character is represented as a skeleton of rigid parts connected by joints. Most of
these joints are actuable (can be controlled). These joints are driven by PD controllers
which essentially abstract away the low-level control and allow us to specify actions
as target angles rather than forces. Forces are computed as
τt =−kp(θt− θt)− kdθ
t (3.6)
where t is the timestep, τt is the resulting control force for the joint, θt and θt are
the current joint angle and velocity, θt is the target joint angle, and kp and kd are the
proportional and derivative gains (parameters that control how quickly to reach θt).
The physics simulation is carried out using PyBullet [5] at 240Hz. The character’s
action is updated at 30Hz (i.e. how often we query the policy).
3.4 Character and reference data
In this dissertation, we focus on animating humans to run. We use a running clip
from the freely-available CMU human motion capture dataset (http://mocap.cs.
cmu.edu/), which is also what Peng et al. [34] used. The humanoid model for this
data, shown in Figure 3.1, has 43 degrees of freedom (DOF). 36 of these are actuable,
corresponding to the action space. By default it is 1.5 metres tall and weighs 45 kg.
The motion capture clip is encoded as a series of frames, where each frame contains
the global co-ordinates of the root and the rotations of each joint. One can playback
Chapter 3. DeepMimic 14
Figure 3.1: The humanoid model for different body parameter vectors ρρρ. The left image
shows different scales (1.0, 1.2, 1.5). The central image shows an added arm length of
1.0 (80% increase). The right image shows an added leg length of 1.0 (40% increase).
the motion by using forward kinematics to compute the pose at each frame, and then
interpolating between frames (see our Supplementary Video, linked in Section 1.4).
3.4.1 Modifying body parameters
We treat the humanoid model on the very left of Figure 3.1 as our default model. In
this dissertation, we consider the following adjustments (all pictured in Figure 3.1):
• Modifying the overall scale of the model: this means uniformly scaling the di-
mensions and mass of all the body parts. The default scale is defined as 1.
• Extending the length of the model’s arms: this means adding an equal length
to the upper and lower arm segments, and also increasing their mass such that
the density does not change. We vary the increase between 0% and 80% - 80%
corresponds to an added length of 1.
• Extending the length of the model’s legs: this means adding an equal length to
the upper and lower leg segments, and again keeping density constant. We vary
the increase between 0% and 40% - 40% corresponds to an added length of 1.
The modified humanoid models that we use can all be characterised by a vector of
body parameters which we denote as ρρρ = [scale, added arm length, added leg length].
The default ρρρ is thus [s = 1, a = 0, l = 0]; we will use this notation throughout.
We note that a controller trained on the default model produces unsteady, unnatural-
looking motion when used with scale > 1.1, added arm length > 0.1 or added leg
length > 0.1. If values are increased further, the model will eventually fall over.
Chapter 3. DeepMimic 15
3.4.2 Retargeting motion capture data
One advantage of physical motion synthesis is that it is generally less reliant than
kinematic approaches on using manually-engineered heuristics to ensure e.g. feet do
not penetrate the floor. However, when training a modified model, unlike Peng et al.
[34] who make no modifications, we apply adjustments to the motion capture data
(which is for the default model) to ensure that accelerations stay constant. This is
explained in [17], which looks at the adjustments that need to be made to motion data
to account for modifying a model’s overall scale. By looking at units, it can be derived
that if a model is scaled by factor L, then time should scale as L1/2 and displacement
should scale as L. Thus, when training a model with scale L, we multiply the length
of each frame in the reference motion clip by L1/2, and the global co-ordinates of the
model’s root in each frame by L.
These scaling rules assume all parts of the model are scaled uniformly, however
we wish to vary limb length independently. After some experimentation, we decided
not to modify the motion capture data when modifying arm length, but it is essential to
modify the data when modifying leg length otherwise the model will penetrate (or
levitate above) the floor. We therefore compute the effective scale factor as: L =
modified leg length / default leg length (i.e. as if the whole model had been scaled).
3.5 Training
During the training process, we monitor two performance metrics: training return and
test return. Training return is the average episode return obtained over the parallel
workers during an iteration (i.e. after collecting a batch of samples). After every 400
iterations, we pause training and run 32 test episodes. For these test episodes, we
disable RSI such that each episode always begins at the start of the motion clip. The
test return is the average episode return obtained in these test episodes.
When reporting returns to compare trained controllers, we also disable RSI as well
as setting exploration noise to 0 (meaning the episode proceeds deterministically).
Episode return is a quantitative way of measuring how closely the controller is fol-
lowing the reference motion, which is an indicator of how realistic the motion is.
We train using 12-16 cores (each running an agent), and stop when the training
return has stopped noticeably increasing and we are > 40 million samples.
Chapter 4
Combination of Controllers
This chapter investigates linearly combining separately-trained base policies (con-
trollers) to produce a novel policy fitted to the desired body parameters ρρρ. The idea is
inspired by Zhang et al. [50] who were addressing the Reality Gap, and by graphics
research such as facial blendshapes [21] and motion blending [23] (Section 2.3.3). We
first describe the base policies used, before describing the method for combining poli-
cies. We explain using black-box optimisation on combination weights, which we also
compare against linear interpolation. We conclude by looking at using model fitting
methods on a training set to make weight predictions across the whole body parameter
space. We include the experimental results that motivated our design choices.
4.1 Base policies
We began with training a policy on the default ρρρ [s = 1, a = 0, l = 0]. We then used
transfer learning (discussed next) to obtain policies for the following ρρρ’s: [s = 1.5, a =
0, l = 0] (large), [s = 1, a = 1, l = 0] (long arms), [s = 1, a = 0, l = 1] (long legs).
Each one represents the extreme of one dimension of ρρρ.
4.1.1 Transfer learning
As discussed in Section 2.3.1, it is possible to apply transfer learning ideas to RL in
order to speed-up training times on difficult tasks by bringing over knowledge from
a source task. One way to transfer knowledge to a target task is to carry over all the
weights from a network trained on the source task i.e. rather than randomly initialising
the actor and critic network weights when we begin training on the target task, we
16
Chapter 4. Combination of Controllers 17
initialise them to match the networks trained on the source task [9]. We investigated
whether this could be used to transfer a policy trained on the default body parameters
to a different ρρρ. We do not alter any of the hyperparameters, and the exploration noise
is linearly decreased in the same way. We experimented with two different target ρρρ’s:
[s = 1.5,a = 0, l = 0] and [s = 1,a = 1, l = 1].
0 1 2 3 4Samples 1e7
0
50
100
150
200
250
300
Return
Random InitialisationTransfer From Default
0 1 2 3 4Samples 1e7
0
50
100
150
200
250
Return
Random InitialisationTransfer From DefaultTransfer From Long ArmsTransfer From Long Legs
Figure 4.1: Training graphs comparing policies trained using transfer against policies
trained from random initialisation. The left graph is for [s = 1.5,a = 0, l = 0], the right
is for [s = 1,a = 1, l = 1]. The solid lines show training return whilst the dashed lines
show test return. (To reduce clutter, we only show training return on the right)
The training curves are compared in Figure 4.1. It can be seen in both graphs that
the transferred policies’ returns increase more quickly than the randomly initialised
policies’ at the start of training, achieving similar performance at 2×107 samples as
the randomly-initialised polices achieve at 4×107 samples. Furthermore, we observed
that combining policies trained using transfer could sometimes achieve higher returns
than policies trained from scratch - this is discussed in the Appendix (Section C.1). We
therefore use these base policies trained using transfer for the rest of the chapter.
4.2 Combining policies
As explained in Section 3.1, a policy π(aaa|sss) maps the input state sss to the probability
of taking action aaa. Suppose we have a set of base policies π0, ...,πn that we wish to
linearly combine into a single policy π. We do this by passing sss through each base
Chapter 4. Combination of Controllers 18
0.0 0.2 0.4 0.6 0.8 1.0Added Arm Length
220
240
260
280
300
Return
Best ReturnInterpolation Return
0.0
0.2
0.4
0.6
0.8
1.0
Weigh
t
Best WeightInterpolation Weight
0.0 0.2 0.4 0.6 0.8 1.0Added Leg Length
220
240
260
280
300
Return
0.0
0.2
0.4
0.6
0.8
1.0
Weigh
t
1.0 1.1 1.2 1.3 1.4 1.5Scale
220
240
260
280
300Re
turn
0.0
0.2
0.4
0.6
0.8
1.0
Weigh
t
Figure 4.2: Comparison between using linear interpolation to compute the weight w and
using brute-force (at resolution 0.1). The interpolation results are shown in blue and the
brute-force results are shown in red. The solid lines show episode return (left axis) and
the dashed lines show w (right axis).
policy, obtaining actions aaa0, ...,aaan, and then linearly combining these actions to obtain
our final action aaa. We have n combination weights wi, one for each base policy.
π(aaa|sss) = w0π0(aaa|sss)+ ...+wnπn(aaa|sss)
Σiwi = 1 and ∀i.wi ∈ [0,1](4.1)
4.3 Interpolating two policies
Suppose we have two base policies π0,π1 which were trained for body parameters
ρρρ0,ρρρ1. The equation for interpolating between these two policies is as follows:
π(aaa|sss) = (1−w)π0(aaa|sss)+wπ1(aaa|sss) (w ∈ [0,1]) (4.2)
Since we are assuming the weights for both policies should add up to one, we only
need to explicitly calculate one weight w. Intuitively, by increasing w from 0 to 1, we
should be able to obtain policies for all ρρρ lying along the line from ρρρ0 to ρρρ1 in body
parameter space (i.e. for all ρρρ’s where ∃c ∈ [0,1]. ρρρ = ρρρ0 + c(ρρρ1−ρρρ0) ).
Chapter 4. Combination of Controllers 19
Clearly, a simple approach for computing w for such a ρρρ would be to compute
how far along the line ρρρ lies i.e. computing the value of c for that ρρρ: c = ρρρ−ρρρ0ρρρ1−ρρρ0
(4.3).
However, this is making the assumption that there is an underlying linear relationship.
Since the search space for w is one-dimensional, it is possible to simply exhaus-
tively evaluate the resulting policy for all values between 0 and 1 at some resolution. In
order to assess whether a simple linear interpolation calculation (as in Equation 4.3) is
sufficient for finding the optimal w, we performed this exhaustive search at a resolution
of 0.1 for several pairs of base policies. The results are shown in Figure 4.2.
It can be seen that linear interpolation does not always find the best weight; the per-
formance of linear interpolation for changing the arm length is especially poor com-
pared to brute-force. Interestingly, the trend of best weights found by brute-force is
not monotonically increasing when changing leg length and scale, and furthermore the
best weight never reaches 1.0 in any of the graphs. This could be explained by the fact
that PPO is not guaranteed to converge to a globally optimal policy.
Nevertheless, it seems possible to interpolate between two policies in order to cover
the range of ρρρ’s lying along the line connecting them in body parameter space. We now
look at extending this idea to multiple base policies.
4.4 Linearly combining multiple policies
The idea can be extended to more than two base policies. The hope is that this in turn
enables multiple dimensions of the body parameters ρρρ to be successfully varied by
having a base policy for each dimension. This approach is inspired by research into
using blend shapes for facial animation [21]. The difficulty this introduces is that the
weight search space becomes multidimensional; e.g. in order to combine three base
policies, we require two combination weights (assuming the weights sum to one). For
multidimensional search spaces, brute-force is no longer sensible.
In this section we first look at using black-box optimisation methods for finding the
best weights. We then discuss the need for corrective base policies, before looking at
whether bilinear interpolation is effective for finding good weights. Finally, we look
at whether it is possible to apply model fitting methods such as K-Nearest Neighbours
to cover the space of ρρρ. We also explored the idea of segmenting the action vector and
using multiple weights per base policy to provide finer control, but since this did not
have strong results we discuss it in the Appendix (Section C.2).
Chapter 4. Combination of Controllers 20
4.4.1 Black-box optimisation
In this section we present the problem of finding the best combination weights for a
given ρρρ as an optimisation problem. The function we wish to maximise takes a vector
www and outputs the return achieved by running an episode (using ρρρ) with that www.
The optimisation of functions is an extensively studied area, however many meth-
ods assume access to derivative information, cannot handle noise, or require many
function evaluations, making them unsuitable for our problem of finding combination
weights (where a function evaluation means simulating a full episode, taking around 30
seconds). We therefore look at methods designed for performing global optimisation
on expensive black-box functions - in particular, Bayesian optimisation and CMA-ES.
We first explain the two methods, and then present our empirical comparison.
4.4.1.1 Bayesian optimisation
Bayesian optimisation uses evaluations of the function being optimised to build an ap-
proximation to it, which then guides the search [7]. It is often used for hyperparameter
tuning [41]. We use the Scikit-Optimize implementation [25].
Bayesian optimisation begins with a prior probability distribution representing our
beliefs about the possible functions. After each evaluation of the black-box function,
these beliefs can be updated to form a posterior distribution. The beliefs can be repre-
sented as a Gaussian Process (GP). GPs model the function as a multivariate Gaus-
sian, and use a kernel function to compute the covariance between function values (we
use the Matern kernel). This approach can also account for noise in the observations
- we can manually set the noise amount, or the value can be learned such that the
observations are best explained (this option is indicated in Figure 4.3 as ‘Gaussian’).
Bayesian optimisation uses an acquisition function to decide where to next evalu-
ate the black-box function, trading-off between exploring high-uncertainty regions and
exploiting the best region found so far. We tried the following acquisition functions:
Upper Confidence Bound: UCB(www) = µ(www)+κσ(www) (κ is a hyperparameter,
where a higher value favours exploration over exploitation)
Expected Improvement: EI(www) = E[ f (www)− f (wwwbest)]
Probability of Improvement: PI(www) = P( f (www)≥ f (wwwbest)+κ)
Hedge: stochastically selects one of the above three acquisition functions
Chapter 4. Combination of Controllers 21
4.4.1.2 CMA-ES
Covariance Matrix Adaptation Evolutionary Strategy (CMA-ES) is a stochastic method
for optimisation [14], and it is the method used by Zhang et al. [50] to fit combination
weights. We use the PyCMA implementation [15].
Evolutionary strategies (ES’s) operate by randomly generating populations of solu-
tions (e.g. by sampling from a multivariate Gaussian), and applying ‘natural selection’
such that the best solutions survive into the next generation. CMA-ES adaptively ad-
justs the covariance matrix used to generate the next generation according to the results
of the previous generation, in order to hopefully move towards the optimum.
There are various hyperparameters; we investigate the following:
initial standard deviation: how the first population is sampled
damping: dampens how quickly the covariance changes between populations
4.4.1.3 Comparison
We ran experiments to empirically compare both methods and a variety of hyperpa-
rameter settings. For these experiments we varied arm length and leg length, and used
three base policies: default [s= 1,a= 0, l = 0], long arms [s= 1,a= 1, l = 0], and long
legs [s = 0,a = 0, l = 1]. We selected 9 different ρρρ values and first used brute-force to
find the optimum return for each. The brute-force search was performed at a resolution
of 0.1, meaning 112 = 121 evaluations since there are two weights which can each take
11 values. This took around an hour for each ρρρ.
We then applied each black-box optimisation method on the same 9 ρρρ’s, allowing
40 function evaluations for each problem. On a machine with a 2.3GHz processor, 40
evaluations takes around 15 minutes. Due to the stochasticity involved in each method,
we solve each ρρρ 3 times. We compare the methods in terms of how far the returns lie
from the optimal solution found by brute-force. We report the decrease, so smaller
values are better - indeed, sometimes the optimisation methods can improve on the
return found by brute-force (in which case the decrease becomes negative) since they
are not constrained to a 0.1 resolution. The results are shown in Figure 4.3.
CMA-ES was able to achieve better (i.e. lower) means than Bayesian optimisation
but demonstrated higher variability across the 9 problems. For the rest of this chapter,
we use Bayesian optimisation with LCB as the acquisition function, κ = 10 and noise
= 1 (the lowermost bar in Figure 4.3) as it appears to provide the best balance between
a low mean and a small standard deviation. We use 40 function evaluations.
Chapter 4. Combination of Controllers 22
−20 −10 0 10 20 30 40 50Decrease from optimum found using brute-force
UCB, kappa=10, noise=1UCB, kappa=2, noise=1
Hedge, kappa=2, noise=1UCB, kappa=10, noise=gaussian
Hedge, kappa=2, noise=gaussianUCB, kappa=2, noise=gaussianEI, kappa=10, noise=gaussian
EI, kappa=10, noise=1EI, kappa=2, noise=gaussian
PI, kappa=10, noise=gaussianPI, kappa=10, noise=1PI, kappa=2, noise=1
std=3, damping=1std=1, damping=1
std=3, damping=0.5std=5, damping=1
std=1, damping=0.5
Figure 4.3: Comparison of different black-box optimisation methods. Bayesian optimi-
sation using GPs is shown in red, CMA-ES is shown in blue. The different hyperpa-
rameters are shown on the y-axis. The x-axis displays the mean decrease from the
optimum (a small decrease is better). The means and standard errors are computed
over 9 different problems that were each solved 3 times from different random seeds.
4.4.2 Corrective base policies
It may be the case that having one base policy for each dimension of body parameter
space is insufficient. When blendshapes are used in facial animation, it is common to
use corrective blendshapes (also known as combination blendshapes) which become
active according to the extent that certain combinations of other blendshape weights are
active [27]. This idea may be important for retargeting too; for example, lengthening
both the arms and legs may require policy adjustments that cannot be obtained by
combining policies trained on long arms and long legs separately.
We can visualise the problem as a d-dimensional hypercube, where d is the dimen-
sions of ρρρ, and each vertex represents a base policy [21]. For linear interpolation this
is a line, for bilinear interpolation it is a plane, e.t.c. The number of vertices increases
as 2d , so it is an important question whether we require a policy at each vertex.
To investigate this, we ran experiments that compare the performance obtainable
with and without corrective base policies. The first experiment looks at varying scale
and arm length (base policies: [s = 1,a = 0, l = 0], [s = 1,a = 1, l = 0], [s = 1.5,a =
0, l = 0], corrective: [s = 1.5,a = 1, l = 0]). The second looks at varying arm and leg
length (base policies: [s = 1,a = 0, l = 0], [s = 1,a = 1, l = 0], [s = 1,a = 0, l = 1],
Chapter 4. Combination of Controllers 23
corrective: [s = 1,a = 1, l = 1]). The results are shown in Figure 4.4.
In both experiments, including a corrective base policy achieves much better return
when both of the dimensions being varied are large, indicating that corrective base
policies are indeed needed. In theory, having a corrective base policy should guarantee
returns at least as large as not having one (we can simply set the corrective base policy’s
weight to zero); however, since Bayesian search is not guaranteed to find the global
optimum, performance is sometimes worse when there is a corrective base policy. To
avoid this issue, it may be possible to design heuristics for when to constrain corrective
base policy weights to be zero, or alternatively to force the Bayesian search to first
explore the weight subspace where corrective base policy weights are zero.
The implication from these experiments that corrective base policies are needed
unfortunately indicates that combining base policies may not be scalable; as the num-
ber of dimensions of ρρρ increases, the number of base policies needed increases expo-
nentially. Since the training of base policies only needs to happen once (for a single
reference motion), this may be not be too serious. However, it may be vital to design
good heuristics for weight optimisation as the search space dimensions grow.
a b c d e f g h i j k l m n o p q r s t uBody Parameters
100
150
200
250
300
Return
With Corrective PolicyWithout Corrective Policy
a b c d e f g h iBody Parameters
200
250
300
350
Return
With Corrective PolicyWithout Corrective Policy
scale\arm 0.0 0.5 1.0
1.00 a b c
1.08 d e f
1.16 g h i
1.24 j k l
1.32 m n o
1.40 p q r
1.50 s t u
arm\leg 0.0 0.5 1.0
0.0 a b c
0.5 d e f
1.0 g h i
Figure 4.4: Episode returns obtained when using a corrective base policy compared
against not using one. The ρρρ values for the experiments are shown in the tables on the
right. Since Bayesian search involves randomness, weights were fitted three times for
each ρρρ; we show mean and standard error.
Chapter 4. Combination of Controllers 24
4.4.3 Bilinear interpolation
Suppose we have a base policy at every corner of the hypercube formed by the dimen-
sions of ρρρ. It is then possible to apply mathematical interpolation formulae to compute
the computation weights by making the assumption that, for each vertex of the hy-
percube, the best solution is given by weighting the corresponding base policy by 1.
Linear interpolation can be extended to multiple dimensions; in two dimensions it is
known as bilinear interpolation. We experimented with using bilinear interpolation in
order to compute the combination weights when we have four base policies: default,
long arms, long legs, and a corrective policy with long arms and legs.
Bilinear interpolation assumes we know the values of f at 4 datapoints: Q00 =
(x0,y0),Q01 = (x0,y1),Q10 = (x1,y0) and Q11 = (x1,y1). Given a query point (x,y),
we estimate f (x,y)≈ a0+a1x+a2y+a3xy. The coefficients ai are found by solving a
linear system such that inputting Q00, ...,Q11 returns the correct values.
In our case, Q00, ...,Q11 are 4 datapoints in body parameter space. fff is a function
returning the combination weights (only 3 since we assume the 4 weights sum to one).
fff (Q00) = fff ([s = 1,a = 0, l = 0]) = [1,0,0],
fff (Q01) = fff ([s = 1,a = 1, l = 0]) = [0,1,0],
fff (Q10) = fff ([s = 1,a = 0, l = 1]) = [0,0,1],
fff (Q11) = fff ([s = 1,a = 1, l = 1]) = [0,0,0].
We perform bilinear interpolation independently for each of the 3 combination
weights i.e. we have three functions f0(ρρρ) = w0, f1(ρρρ) = w1, f2(ρρρ) = w2 such that
fff (ρρρ) = [w0,w1,w2].
We compare the results against what is found by Bayesian optimisation (we use
the hyperparameter settings found in Section 4.4.1 and solve each ρρρ three times). Fig-
ure 4.5 shows the results. As was observed in the 1D case (Section 4.3), Bayesian
optimisation almost always outperforms the solution found by bilinear interpolation,
indicating that the underlying relationship between optimal combination weights and
ρρρ values is more complex. We now see if this can be learned using model fitting.
4.5 Model fitting
Computing the weights for a given ρρρ is expensive, taking around 15 minutes if 40 iter-
ations of Bayesian optimisation are used, and this may not be sufficient for finding the
Chapter 4. Combination of Controllers 25
a b c d e f g h i j k l m n o p q r s t u v w x yBody Parameters
150
200
250
300
Return
Bilinear Interpolation Bayesian Optimisation
arm\leg 0.00 0.25 0.50 0.75 1.00
0.00 a b c d e
0.25 f g h i j
0.50 k l m n o
0.75 p q r s t
1.00 u v w x y
Figure 4.5: Comparison of bilinear interpolation against Bayesian optimisation on 25
ρρρ’s (shown in table). Four base policies were used. For each ρρρ we applied Bayesian
optimisation 3 times: we show the mean and standard error.
best solution as the number of base policies increases. Since the desire is an approach
able to obtain policies for any ρρρ quickly, such that it could be incorporated into an
interactive tool for animators, we decided to investigate if model fitting methods such
as interpolation and K-Nearest Neighbours (KNN) could be applied to learning the
best weights for every ρρρ. This is inspired by Zhang et al.’s work in [50] (where regres-
sion is used to predict base policy weights) and by motion blending research (Kovar et
al. [24] use KNN to predict blending weights).
The high-level steps are as follows:
1. Construct a training set {(ρρρ1,www1), ...,(ρρρM,wwwM)} of M samples. This is done by
using Bayesian optimisation to compute the best interpolation weights wwwi for
each ρρρi, i ∈ [1,M]. This happens offline and only needs to occur once, so many
iterations of optimisation can be used if desired.
2. This training set can be fed into a model fitting method such as KNN or regres-
sion to learn how to predict the best weights for any given ρρρ (in the space covered
by the training set).
Chapter 4. Combination of Controllers 26
We experimented with model fitting when varying added arm and leg length be-
tween 0 and 1. We used four base policies: default, long arms, long legs, and long
arms and legs. We used Bayesian optimisation with 80 function evaluations and the
best settings from Section 4.4.1. We constructed a training set PPPtr of size 36, where:
PPPtr = {0.0,0.2, ...,1.0}×{0.0,0.2, ...,1.0}
× denotes taking the Cartesian product of two sets(4.4)
We then constructed a validation set PPPv (size 10) and test set PPPte (size 20) by sam-
pling elements from the uniform distribution [0,1]. We note that these set sizes are
relatively small, however the domain of ρρρ is also fairly limited. Zhang et al. [50]
similarly only use small sets - their training and test sets are 20 samples each.
Metrics Most model fitting tasks use a cost function such as squared distance from
the target value to measure performance. However, for the task at hand, we are more
interested in finding weights that give high episode returns than necessarily matching
the fitted weights in PPPv and PPPte as closely as possible. Therefore, in this section, we re-
port two performance metrics: the Euclidean distance between the vector of predicted
weights and the vector of fitted weights (we refer to this as the weight distance), and
the decrease in return when using the predicted weights compared to when using the
fitted weights. For both metrics, a smaller value is better.
4.5.1 Interpolation
In earlier sections we looked at using interpolation to find combination weights by
making assumptions that each base policy corresponded to the best solution for one
corner of the body parameter hypercube. We now instead look at whether interpolation
can be used to essentially “fill in the spaces” between the ρρρ’s in a training set. We only
vary two dimensions of ρρρ (arm and leg length) so this is a 2D interpolation problem.
We try both linear interpolation and cubic interpolation. We use the griddata func-
tion provided by SciPy [20]. Linear interpolation works by triangulating the input
data (they use Delaunay triangulation [40]), and then performing linear interpolation
on each triangle. Cubic interpolation also involves triangulating the data, and then
constructing a cubic Bezier polynomial on each triangle using Clough-Tocher [3].
As in Section 4.4.3, we perform interpolation on each combination weight inde-
pendently. We explicitly fit 3 weights (clipping predictions to lie between 0 and 1)
Chapter 4. Combination of Controllers 27
and compute the final weight using the assumption they should all sum to one. If
w0 +w1 +w2 ≥ 1, we set w3 = 0 and normalise: wi =wi
w0+w1+w2for i = 0,1,2.
The validation set results are shown in Figure 4.6. Although both do similarly in
terms of weight distance, cubic interpolation significantly outperforms linear interpo-
lation in terms of return so it is the better method for our purposes.
Linear Cubic0
5
10
15
Decrea
se in
Return
Decrease in Return
0.0
0.1
0.2
0.3
Weigh
t Dist
ance
Weight Distance
Figure 4.6: Validation set results of linear and cubic interpolation.
4.5.2 K-Nearest Neighbours
KNN is a Machine Learning algorithm that can be used for both classification and
regression [6]. We use the Scikit-Learn implementation [32].
The idea behind KNN is simple: in order to predict the combination weights www for
a query point ρρρ, we find the K training samples that are closest to ρρρ in body parameter
space. We then use these K neighbours {(ρρρ1,www1), ...,(ρρρK,wwwK)} to make our prediction
www. We try two different neighbour-weighting methods:
• uniform weights: www = 1K ΣK
i=1wwwi (same as computing mean of the neighbours)
• distance-based weights: www = ΣKi=1aiwwwi, where ai =
1/diΣK
j=11/d j(di is the Euclidean
distance from ρρρ to ρρρi) (the closer the neighbour, the higher its weighting)
The only parameter to be set is K, the number of neighbours to use for each pre-
diction. We try values from 2 to 6; the validation set results are shown in Figure 4.7.
The performance in terms of return appears to depend more on K than the weighting
scheme. K = 2 with distance-based weights performed best.
4.5.3 Comparison
We evaluated the best setting from each of the two methods on the test set; the results
are shown in Table 4.1. On the full training set, cubic interpolation performed surpris-
ingly well, achieving a mean decrease in return of -11: in other words, on average it
Chapter 4. Combination of Controllers 28
K=2 K=3 K=4 K=5 K=6 K=2 K=3 K=4 K=5 K=60
5
10
15
Decr
ease
in R
etur
n
Decrease in Return
0.0
0.1
0.2
0.3
Wei
ght D
istan
ce
Weight Distance
Figure 4.7: Validation set results of different KNN models. K indicates the number of
neighbours used for each prediction. The non-hatched bars (left) weight all neighbours
uniformly; the hatched bars (right) weight neighbours according to distance.
Decrease in Return Weight DistanceFull (36) Reduced (9) Full (36) Reduced (9)
KNN (K=2, Weighted) 1 ± 2 15 ± 5 0.26 ± 0.04 0.22 ± 0.03
Interpolation (Cubic) -11 ± 13 10 ± 3 0.27 ± 0.05 0.24 ± 0.02
Table 4.1: Test set results of the two models that performed best on the validation set.
actually improved on the returns achieved when using Bayesian optimisation to fit the
test set. KNN also performed well, with a mean decrease of only 1.
Due to the good performance, we also decided to look at the performance of these
models when trained on a reduced training set that contains only 9 samples instead
of 36: PPP′tr = {0.0,0.5,1.0}×{0.0,0.5,1.0}. We evaluate them on the same test set.
As expected, the performances in terms of return degrade, but interpolation still out-
performs KNN. The mean weight distances are lower on the reduced set than the full
set, indicating that reducing weight distance does not guarantee the best returns.
Overall, cubic interpolation was the best method tried. Neither method requires
training, however, so in practice it would be straightforward to try multiple methods. A
caveat is that the method used for cubic interpolation only applies to 2D problems (i.e.
only varying 2 elements of ρρρ). Future work would be to look at a multidimensional
extension to cubic spline interpolation [13] and compare this to KNN. Other model
fitting methods that have been used in facial animation and motion blending (and can
extend beyond 2 dimensions) include Radial Basis Functions and neural networks [23].
Chapter 5
Domain Randomisation
In this chapter, we investigate whether domain randomisation can be used to train
adaptive DeepMimic controllers. It is an approach used in robotics for tackling the
Reality Gap [48], as discussed in Section 2.3.2.
The method works as follows. We use randomly-initialised neural networks for the
actor and critic. To enable domain adaptation, the neural networks’ inputs now include
the body parameters ρρρ as well as the character state ssst . This means that the policy
will change if ρρρ changes - it adapts. During training, ρρρ is changed at the start of every
episode, and the properties of the simulated character model are changed accordingly.
In our implementation, each element of ρρρ is sampled uniformly at random from a
continuous interval. This corresponds to the approach in [49], although Yu et al. use
TRPO rather than PPO. No other adjustments to the PPO algorithm were made.
In this section we first discuss design aspects such as the neural network architec-
ture and hyperparameters, before presenting our experimental results.
5.1 Design
5.1.1 Network architecture
We tried three different architectures: Default, Split A and Split B. Diagrams are in
the Appendix (Section D.1). The first is the default architecture from Peng et al. [34]
with two hidden layers; we simply append ρρρ to the first layer’s input. We then took
inspiration from the split architecture used by Peng et al. for vision-based tasks which
joins a convolutional network, which processes image data, to the default architecture.
Similarly, Split A and Split B process sss and ρρρ separately before combining to form aaa.
29
Chapter 5. Domain Randomisation 30
5.1.2 PPO hyperparameters
PPO is an algorithm with many hyperparameters. Although it was designed to be
more robust to hyperparameter settings than other policy gradient methods [39], tuning
is still important. Due to long training times, there was not time for comprehensive
tuning, but after inspecting the training metrics of unsuccessful runs, we chose to focus
on the following. (Other hyperparameters were kept at the values in Section 3.1.2.)
Critic learning rate When training policies across particularly large ranges (e.g.
varying scale between 0.5 and 1.5), we observed that the critic loss began to diverge.
This implied that the default critic learning rate (0.01) might be too large (and causing
overshooting) so we tried smaller values.
Clipping factor A key property of PPO is the clipped objective function to prevent
harmfully-large policy updates (Section 3.1.1). The clipping factor (ε) determines how
large an update we allow. When training policies across large ranges, we observed that
the clipping fraction (how often we have to apply clipping) was lower than usual,
suggesting the default (0.2) might be too large and that the allowed policy update
amount should be further restricted.
5.2 Experiments
In this section, we discuss our experimental results. Each policy was trained until train-
ing return stopped noticeably improving, which normally took > 60 million samples.
Overall Scaling
We first investigated whether it was possible to train a policy that adapted to different
scales, so the ρρρ fed into the network is a 1D vector [s]. Our first experiment tried the
three architectures with ρρρ varying between 1.0 and 1.2, whilst keeping all the PPO
hyperparameters as their defaults. The results are shown in Figure 5.1 (top). All three
architectures achieve returns > 200 across the range, with Split A performing the best.
Since the networks performed well on the [1.0,1.2] range, we then ran the same
experiments on the [1.0,1.5] range, corresponding to our scale experiments in Chapter
4. The results are shown in Figure 5.1 (middle). On this larger range, the adaptive
policies perform much worse, failing to achieve high returns consistently across the
Chapter 5. Domain Randomisation 31
1.00 1.04 1.08 1.12 1.16 1.20Scale
100
150
200
250
300
Return
DefaultSplit ASplit B
1.0 1.1 1.2 1.3 1.4 1.5Scale
100
150
200
250
300
Return
DefaultSplit ASplit B
1.0 1.1 1.2 1.3 1.4 1.5Scale
100
150
200
250
300
Return
Default (CF=0.2)Default (CF=0.1)Split A (CF=0.2)Split A (CF=0.1)Split B (CF=0.2)Split B (CF=0.1)
Figure 5.1: The episode returns of the different policies evaluated at different scales.
For the top graph, the policies were trained on the range [1, 1.2] using default hyperpa-
rameters. For the middle graph, we used [1,1.5] and default hyperparameters. For the
bottom graph, we used [1,1.5], halved the critic learning rate to be 0.005, and tried two
different clipping factor (CF) values (dashed is CF=0.2 (default), solid is CF=0.1).
whole range. The two split networks struggle at the higher end whilst the default
network performs worse at the lower end; however, it is difficult to draw concrete
conclusions due to the randomness involved in training. The issue of not being able
to handle the whole range may be related to catastrophic forgetting [22] - when neural
networks lose knowledge about previously-learned tasks when trained on new tasks.
For the next set of experiments, we made the hyperparameter adjustments discussed
Chapter 5. Domain Randomisation 32
in Section 5.1.2 to see if these performed better on the [1.0,1.5] range. We halved the
critic learning rate to be 0.005, and tried two different clipping factors 0.1 and 0.2.
The results are shown in Figure 5.1 (bottom). The default architecture still struggles,
however the two split architectures now achieve more consistent performance. Split A
with clipping factor 0.1 achieved the highest average return across the range.
Leg Length
Having seen that domain randomisation can successfully train policies that adapt to
different scales, we turned to look at varying leg length. As in Chapter 4, we look at
adding up to 40% length (referred to as 1) to each leg. For these experiments we again
used the halved critic learning rate (0.005). The results are shown in Figure 5.2.
One can see that unfortunately all the policies failed to perform well for increased
leg lengths, all falling below 150 when adding more than 0.2. (Our Chapter 4 demon-
strated that it is possible to achieve returns above 240 for added leg length up to 1.) It
remains as future work to see whether different architectures, hyperparameter settings,
or ideas from curriculum learning [16] could successfully produce an adaptive policy
for leg length. Since we did not manage to use domain randomisation to train an adap-
tive controller for leg length, we did not investigate varying multiple elements of ρρρ as
we did in Chapter 4.
0.0 0.2 0.4 0.6 0.8 1.0Added Leg Length
100
150
200
250
Ret
urn
Default (CLR=0.005, CF=0.1)Split A (CLR=0.005, CF=0.2)Split A (CLR=0.005, CF=0.1)Split B (CLR=0.005, CF=0.1)
Figure 5.2: Episode returns of different policies evaluated at different leg lengths.
Adjusting output of a frozen network
We ran some experiments using an idea similar to progressive networks [36] for train-
ing adaptive policies. We did not produce strong results (we did not have time for
trying many architectures) so we discuss this in the Appendix (Section D.2).
Chapter 6
Results and Evaluation
In this chapter, we compare the methods explored in Chapters 4 and 5. The basic
DeepMimic retargeting method presented by Peng et al. [34] is our baseline. We look
at training time, episode return, and controller robustness. For the baseline, we ran-
domly initialise weights, use the hyperparameters listed in Section 3.1.2, and train until
training return stops clearly increasing. (True convergence is hard to observe due to the
noisiness of deep RL.) We compare to Peng et al.’s values, though we use a different
physics simulator so we do not expect to match exactly. We also quote values from Al
Borno et al. [2] who use LQR-trees to represent physical controllers that track motion.
6.1 Training time
In this section, we compare the different approaches in terms of their offline and on-
line computation times. One of the main motivations behind this dissertation was the
time-consuming nature of DeepMimic. Training a new policy from scratch for each
character model takes many hours, ruling out the possibility of an interactive editor.
All DeepMimic training was run either on Edinburgh’s ECDF cluster (https://
www.ed.ac.uk/information-services/research-support/research-computing/
ecdf/), using 16 cores with 64GB RAM; or on a machine with 12 cores and 64GB
RAM. Bayesian optimisation was run with a 2.3GHz CPU and 16GB RAM.
6.1.1 Offline time
Offline time is the work done in order to reduce the online time. It should only be
needed once for each character type and motion clip.
33
Chapter 6. Results and Evaluation 34
Baseline For the baseline, we assume no offline computation.
LQR-Trees Since all training is model-specific, we assume no offline computation.
Combination of policies Training base policies takes place offline. This is an expen-
sive process - at least one base policy is required for each dimension of ρρρ, and perhaps
more if the range to be covered is large (e.g. varying scale between 1 and 2 may re-
quire a base policy at 1.5 as well as ones at 1 and 2). Furthermore, if corrective policies
are used, the number of base policies in fact grows exponentially with ρρρ’s dimensions
(Section 4.4.2). However, training time can be reduced by using transfer learning, as
shown by the experiments in Section 4.1.1, meaning each policy only takes around 20
million samples (≈ 10 hours using 16 cores) rather than 40 million. If model fitting
methods such as KNN are used, we must construct a training set. Our Section 4.5
experiments showed that 36 samples (each taking ≈ 30 min. to fit) were sufficient for
good predictions when varying arm length up to 180% and leg length up to 140%.
Domain randomisation The training can take place offline, since it is not specific to
any ρρρ. We trained the policies for 50-60 million samples (25-30 hours using 16 cores),
which is not much more than the 40-60 million samples the baseline policies took, and
yet this was sufficient to produce adaptive controllers that adapt to scales up to 1.5.
It could be that reducing exploration noise more slowly and training for longer could
improve performance further. Training graphs are in the Appendix (Chapter B).
6.1.2 Online time
Online time is the time taken to produce a controller for a new modified character. Our
main motivation was to investigate methods that reduce this as much as possible.
Baseline The baseline takes at least 40 million samples before the training return
stops clearly improving. When using 16 cores, this is at least 20 hours of training
time. Peng et al. [34] similarly report that their running controller was trained for≈ 50
million samples. Our training graphs are shown in the Appendix (Section B.2).
LQR-Trees The first part of the method uses CMA-ES to do space-time optimisation
and produce a controller that tracks the reference motion; this takes around 5 hours.
The next builds the LQR-tree to add robustness to external forces; this takes 2 days.
Chapter 6. Results and Evaluation 35
Combination of policies When combining policies, online computation is required
for finding the best interpolation weights. In Section 4, we used Bayesian optimisa-
tion for this search. One can specify how many function evaluations the optimisation
routine should use; when using 4 base policies, we observed that increasing this be-
yond 40 stopped yielding improvements. Figure 6.1 shows the returns achieved on 5
different ρρρ’s when using different numbers of function evaluations - the lines become
mostly flat above 40. On a 2.3GHz CPU, 40 iterations takes around 15 minutes.
One way to avoid this online search is to perform more offline computation, namely
to use model fitting to predict the weights (as explored in Section 4.5). In this case, the
online time of running ρρρ through the model (e.g. KNN) is practically negligible.
0 10 20 30 40 50 60No. of Function Evaluations
50
100
150
200
250
300
Return
Figure 6.1: We apply Bayesian optimisation to 5 different ρρρ’s, varying the number of
function evaluations allowed from 2 and 60. For each ρρρ and function evaluation no. we
solve 3 times; we show mean and standard error. Each line represents a different ρρρ.
Domain randomisation No online computation is needed - we simply feed ρρρ into
the adaptive policy at runtime.
6.2 Episode return
The total reward obtained during an episode (the return) can be used as a quantitative
measure of how successfully a policy is following the motion capture data. It should
only be used to compare policies trained on the same ρρρ, since the maximum achievable
return varies. We note however that a high reward doesn’t always mean good-looking
motion, and some motions still appear unsteady. With more time, we would experiment
further with hyperparmeters and reward function weights to try and address this.
Chapter 6. Results and Evaluation 36
Uniform character scaling
We first look at changing the overall scale of the model between 1 and 1.5. We compare
the following methods: a) the baseline, b) the combination of two base policies ([s =
1,a = 0, l = 0] and [s = 1.5,a = 0, l = 0]) where weights are found using brute-force,
and c) an adaptive neural network trained using domain randomisation (using the Split
A architecture which performed best in Chapter 5). Figure 6.2 shows the results.
Domain randomisation (with CF=0.1) achieved high returns across most of the
range, although it performed poorly for the default scale 1. Future work would be to
repeat the training multiple times with different random seeds to see if we were simply
‘unlucky’ here. At higher scales it outperforms the baseline, which was unexpected -
this could perhaps be due to the reduced CF, since we used CF=0.2 for all the baseline
models. (For comparison, we also show domain randomisation with CF=0.2, which
performs worse than the baseline.) Combination performs similarly to the baseline at
1 and 1.5 (where it is essentially using the same policies), but struggles between 1 and
1.3. Using another base policy (e.g. at 1.25) may be one way to improve this.
Overall, in these experiments it was domain randomisation that performed best.
1.0 1.1 1.2 1.3 1.4 1.5Scale
200
220
240
260
280
300
Return
BaselineCombinationDomain Randomisation (CF=0.1)Domain Randomisation (CF=0.2)
Figure 6.2: Episode returns achieved at different character scales.
Scaling individual limbs
We now look at increasing the length of the arms and legs. We did not succeed at
training policies for this using domain randomisation, so we only compare the base-
line and combination. We fitted the combination weights using Bayesian optimisation
(using the hyperparameters identified in Section 4.4.1) and 40 function evaluations.
The results are in Figure 6.5, with the corresponding ρρρ’s shown in Table 6.1.
Combination performs similarly to the baseline, even demonstrating notable im-
provements for some ρρρ’s. This indicates that DeepMimic is not always reliable at
Chapter 6. Results and Evaluation 37
arm \leg 0.00 0.25 0.50 0.75 1.00
0.00 a b c d e
0.25 f - - j -
0.50 g - k - l
0.75 h m - - -
1.00 i - n - o
Table 6.1: Body parameter (ρρρ) values used in Figures 6.3 and 6.5.
finding an optimal policy when retargeting to models that do not match the motion
capture data, and that the flexibility offered by linearly combining base policies can
outperform it. It could be that we were unlucky with our weight initialisations, and
future work would be to train each baseline model multiple times to account for this.
Nevertheless, it can be seen that combination can achieve high returns.
a b c d e f g h i j k l m n oBody Parameters
0
100
200
300
Return
Baseline Combination
Figure 6.3: Episode returns achieved for different arm/leg lengths (see Table 6.1).
6.3 Controller robustness
In this section, we evaluate the robustness of the different controllers to externally-
applied forces. One of the main advantages of using physically-based animation over
kinematic animation is the ability to produce emergent reactions to perturbations.
We follow the methodology of Peng et al. in the DeepMimic paper [34]. They
apply an external force halfway through the running motion cycle. It acts on the char-
acter’s pelvis in the direction perpendicular to forward movement, and is applied for
0.2 seconds. We begin with a 50N force, and if the character does not fall, we restart
the episode and increase the force applied by 25N, repeating until the character falls.
Peng et al. reported that their running controller could tolerate 300N (i.e. a 300×0.2 = 60Ns impulse), and stated this was comparable to other physical motion synthe-
Chapter 6. Results and Evaluation 38
sis methods such as SAMCON [29] (whose running controller can withstand a 50Ns
impulse). It is worth noting that no external forces are applied during training; Peng
et al. hypothesise that robustness comes from the exploration noise exposing the char-
acter to more states when training. Al Borno et al. [2] train using external forces, yet
report worse figures (their controller can withstand a 50Ns impulse≈30% of the time).
Uniform character scaling
Figure 6.4 shows the results when varying scale. Although we saw that domain ran-
domisation had the highest returns, the methods do similarly to each other here. (Means:
340±5N (baseline), 350±4N (combination), 340±10N (domain randomisation))
1.0 1.1 1.2 1.3 1.4 1.5Scale
200
250
300
350
400
450
Maximum
Force
(N)
BaselineCombination
Domain Randomisation (CF=0.1)
Figure 6.4: Maximum sideways force tolerated by controllers for different scales.
Scaling individual limbs
Figure 6.5 shows the results when arm and leg length are varied. As may be ex-
pected, extending limb length results in less robust controllers, however the methods
still achieve good means: 310±30N (baseline), 330±15N (combination).
a b c d e f g h i j k l m n oBody Parameters
0
100
200
300
400
500
Max
imum
For
ce (N
)
Baseline Combination
Figure 6.5: Sideways force tolerated by controllers for different limb lengths (Table 6.1).
Chapter 7
Conclusions and Future Work
Conclusions
We investigated how to train adaptive controllers for physically-based animation which
can be quickly applied to character models whose body parameters have been modi-
fied. Two approaches were implemented: linearly combining controllers; and domain
randomisation. Both were inspired by sim-to-real research in robotics, and to our
knowledge have not been used for animation. We built our methods around Deep-
Mimic [34], a physical motion synthesis method that uses deep RL. We evaluated
them on animating a humanoid (with variable scale and limb length) to run.
Combination of controllers Our experiments used up to 4 base controllers, and
showed that it was possible to produce good controllers for the body parameter space
spanned by these controllers. We observed that linear interpolation did not find the op-
timal combination weights, and therefore we proposed using black-box optimisation.
Finally, we demonstrated that it is possible to apply model fitting methods (cubic inter-
polation and KNN) to learn from a training set of fitted combination weights in order
to cover the whole body parameter space and remove the need for online computation.
Domain randomisation We tried three different neural network architectures
and several different PPO hyperparameter settings. By inputting the character model’s
scale to the neural network, and randomly varying this during training, we demonstrate
that the resulting controller is able to adapt to different character model scales.
Comparison Both approaches could produce adaptive controllers for varying
character scales up to 150%. No online computation is required (if model fitting is
used for the linear combination approach), a substantial improvement from the > 20
hours needed by the baseline (using DeepMimic to train from scratch). In terms of
39
Chapter 7. Conclusions and Future Work 40
offline time, domain randomisation did not require much longer than the baseline’s
online time. The combination approach required training two base policies (speed-up
can be obtained using transfer learning). In terms of episode return (i.e. how closely
the controller follows the reference motion), domain randomisation outperformed both
linear combination and the baseline. In terms of robustness to external forces, all
approaches performed similarly, and comparably to results in the literature [34, 29, 2].
The combination approach could produce controllers for varying arm/leg length
(increasing by up to 80% and 40% respectively), achieving similar or better returns and
robustness to the baseline. Domain randomisation was unable to handle varying leg
length, indicating it is a harder approach to tune. However, the combination procedure
is more complex, and as we vary more body parameter dimensions, the number of base
policies needed (and weight search space dimensions) grows exponentially.
Future work
We showed that training adaptive controllers using DeepMimic is possible. They re-
quire no online computation, whilst generating motion that is robust and mostly fol-
lows the reference motion as well as the DeepMimic baseline. However, the motions
do not quite look consistent enough to be used in practice. Given more time, we would
try to reduce the unsteadiness in the motions by tuning the reward function and hyper-
parameters (e.g. PD gains) better. We also note that PyBullet [5] currently does not
allow setting rotation limits on joints with 4 DOF, which can lead to strange motions.
We would like to try other body variations and motions, and also non-humans. It
would also be important to see if our controllers can be made interactively-controllable.
We note the scalability of the combination approach to more body parameter di-
mensions as a potential issue; future work could investigate this. In particular, it should
be possible to design heuristics to guide the weight optimisation process. The model
fitting method that works best may also change as dimensions grow.
We believe that tuning the hyperparameters and architecture more thoroughly is key
to improving domain randomisation’s performance. We would also have liked to carry
out more repeats to assess domain randomisation’s reliability across random seeds. We
only performed preliminary experiments using progressive networks (Section D.2) but
believe they are an interesting avenue, we just did not find the correct architecture in
the time we had. Another idea is curriculum learning [16] which is where one designs
how the domain varies during training rather than using pure randomisation.
Bibliography
[1] Mazen Al Borno, Martin De Lasa, and Aaron Hertzmann. Trajectory optimiza-
tion for full-body movements with complex contacts. IEEE transactions on visu-
alization and computer graphics, 19(8):1405–1414, 2012.
[2] Mazen Al Borno, Ludovic Righetti, Michael J Black, Scott L Delp, Eugene Fi-
ume, and Javier Romero. Robust physics-based motion retargeting with realistic
body shapes. In Computer Graphics Forum, volume 37, pages 81–92. Wiley
Online Library, 2018.
[3] Peter Alfeld. A trivariate cloughtocher scheme for tetrahedral data. Computer
Aided Geometric Design, 1(2):169–181, 1984.
[4] Bobby Bodenheimer, Chuck Rose, Seth Rosenthal, and John Pella. The process
of motion capture: Dealing with the data. In Computer Animation and Simula-
tion97, pages 3–18. Springer, 1997.
[5] Erwin Coumans and Yunfei Bai. Pybullet, a python module for physics simula-
tion for games, robotics and machine learning. GitHub repository, 2016.
[6] T Cover and P Hart. Nearest neighbor pattern classification. IEEE Transactions
on Information Theory, 13(1):21–27, 1967.
[7] Peter I Frazier. A tutorial on bayesian optimization. arXiv preprint
arXiv:1807.02811, 2018.
[8] Thomas Geijtenbeek and Nicolas Pronost. Interactive character animation us-
ing simulated physics: A state-of-the-art review. In Computer graphics forum,
volume 31, pages 2492–2515. Wiley Online Library, 2012.
[9] Ruben Glatt, Felipe Leno Da Silva, and Anna Helena Reali Costa. Towards
knowledge transfer in deep reinforcement learning. In Intelligent Systems
(BRACIS), 2016 5th Brazilian Conference on, pages 91–96. IEEE, 2016.
41
Bibliography 42
[10] Michael Gleicher. Retargetting motion to new characters. In Proceedings of the
25th annual conference on Computer graphics and interactive techniques, pages
33–42. ACM, 1998.
[11] Shihui Guo, Richard Southern, Jian Chang, David Greer, and Jian Zhang. Adap-
tive motion synthesis for virtual characters: a survey. The Visual Computer,
31(5):497–512, 2015.
[12] Abhishek Gupta, Coline Devin, YuXuan Liu, Pieter Abbeel, and Sergey Levine.
Learning invariant feature spaces to transfer skills with reinforcement learning.
arXiv preprint arXiv:1703.02949, 2017.
[13] Christian Habermann and Fabian Kindermann. Multidimensional spline inter-
polation: Theory and applications. Computational Economics, 30(2):153–169,
2007.
[14] Nikolaus Hansen. The cma evolution strategy: A tutorial. arXiv preprint
arXiv:1604.00772, 2016.
[15] Nikolaus Hansen, Youhei Akimoto, and Petr Baudis. CMA-ES/pycma on Github.
Zenodo, DOI:10.5281/zenodo.2559634, February 2019.
[16] Nicolas Heess, Srinivasan Sriram, Jay Lemmon, Josh Merel, Greg Wayne, Yuval
Tassa, Tom Erez, Ziyu Wang, SM Eslami, Martin Riedmiller, et al. Emergence of
locomotion behaviours in rich environments. arXiv preprint arXiv:1707.02286,
2017.
[17] Jessica K Hodgins and Nancy S Pollard. Adapting simulated behaviors for new
characters. In Proceedings of the 24th annual conference on Computer graphics
and interactive techniques, pages 153–162. ACM Press/Addison-Wesley Pub-
lishing Co., 1997.
[18] Daniel Holden, Jun Saito, and Taku Komura. A deep learning framework for
character motion synthesis and editing. ACM Transactions on Graphics (TOG),
35(4):138, 2016.
[19] Mohd Izani, AR Eshaq, et al. Keyframe animation and motion capture for cre-
ating animation: a survey and perception from industry people. In Proceedings.
Student Conference on Research and Development, 2003. SCORED 2003., pages
154–159. IEEE, 2003.
Bibliography 43
[20] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientific
tools for Python, 2001-2019.
[21] Pushkar Joshi, Wen C Tien, Mathieu Desbrun, and Frederic Pighin. Learning
controls for blend shape based realistic facial animation. In ACM Siggraph 2006
Courses, page 17. ACM, 2006.
[22] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume
Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Ag-
nieszka Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural
networks. Proceedings of the national academy of sciences, 114(13):3521–3526,
2017.
[23] Taku Komura, Ikhsanul Habibie, Jonathan Schwarz, and Daniel Holden. Data-
driven character animation synthesis. Handbook of Human Motion, pages 1–29,
2017.
[24] Lucas Kovar and Michael Gleicher. Automated extraction and parameterization
of motions in large data sets. In ACM Transactions on Graphics (ToG), vol-
ume 23, pages 559–568. ACM, 2004.
[25] GL Manoj Kumar and Tim Head. Scikit-optimize. Tim Head and contributors,
2017.
[26] Alessandro Lazaric, Marcello Restelli, and Andrea Bonarini. Transfer of samples
in batch reinforcement learning. In Proceedings of the 25th International Con-
ference on Machine Learning, ICML ’08, pages 544–551, New York, NY, USA,
2008. ACM.
[27] John P Lewis, Ken Anjyo, Taehyun Rhee, Mengjie Zhang, Frederic H Pighin, and
Zhigang Deng. Practice and theory of blendshape facial models. Eurographics
(State of the Art Reports), 1(8):2, 2014.
[28] Libin Liu and Jessica Hodgins. Learning basketball dribbling skills using trajec-
tory optimization and deep reinforcement learning. ACM Transactions on Graph-
ics (TOG), 37(4):142, 2018.
[29] Libin Liu, Michiel Van De Panne, and KangKang Yin. Guided learning of con-
trol graphs for physics-based characters. ACM Transactions on Graphics (TOG),
35(3):29, 2016.
Bibliography 44
[30] Josh Merel, Yuval Tassa, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Greg
Wayne, and Nicolas Heess. Learning human behaviors from motion capture by
adversarial imitation. arXiv preprint arXiv:1707.02201, 2017.
[31] Vishal M Patel, Raghuraman Gopalan, Ruonan Li, and Rama Chellappa. Vi-
sual domain adaptation: A survey of recent advances. IEEE signal processing
magazine, 32(3):53–69, 2015.
[32] Fabian Pedregosa, Gael Varoquaux, Alexandre Gramfort, Vincent Michel,
Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron
Weiss, Vincent Dubourg, et al. Scikit-learn: Machine learning in python. Journal
of machine learning research, 12(Oct):2825–2830, 2011.
[33] Tomislav Pejsa and Igor S Pandzic. State of the art in example-based motion
synthesis for virtual characters in interactive applications. In Computer Graphics
Forum, volume 29, pages 202–226. Wiley Online Library, 2010.
[34] Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. Deep-
mimic: Example-guided deep reinforcement learning of physics-based character
skills. ACM Trans. Graph., 37(4):143:1–143:14, 2018.
[35] Aravind Rajeswaran, Sarvjeet Ghotra, Balaraman Ravindran, and Sergey Levine.
Epopt: Learning robust neural network policies using model ensembles. arXiv
preprint arXiv:1610.01283, 2016.
[36] Andrei A Rusu, Neil C Rabinowitz, Guillaume Desjardins, Hubert Soyer, James
Kirkpatrick, Koray Kavukcuoglu, Razvan Pascanu, and Raia Hadsell. Progres-
sive neural networks. arXiv preprint arXiv:1606.04671, 2016.
[37] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp
Moritz. Trust region policy optimization. In International Conference on Ma-
chine Learning, pages 1889–1897, 2015.
[38] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter
Abbeel. High-dimensional continuous control using generalized advantage es-
timation. arXiv preprint arXiv:1506.02438, 2015.
[39] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
2017.
Bibliography 45
[40] Jonathan Richard Shewchuk. Delaunay refinement algorithms for triangular mesh
generation. Computational geometry, 22(1-3):21–74, 2002.
[41] Jasper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian opti-
mization of machine learning algorithms. In Advances in neural information
processing systems, pages 2951–2959, 2012.
[42] Kwang Won Sok, Manmyung Kim, and Jehee Lee. Simulating biped behaviors
from human motion data. In ACM Transactions on Graphics (TOG), volume 26,
page 107. ACM, 2007.
[43] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction.
MIT press, 2018.
[44] Andrew Witkin and Michael Kass. Spacetime constraints. ACM Siggraph Com-
puter Graphics, 22(4):159–168, 1988.
[45] Jungdam Won, Jongho Park, Kwanyu Kim, and Jehee Lee. How to train your
dragon: example-guided control of flapping flight. ACM Transactions on Graph-
ics (TOG), 36(6):198, 2017.
[46] Katsu Yamane, Yuka Ariki, and Jessica Hodgins. Animating non-humanoid
characters with human motion data. In Proceedings of the 2010 ACM SIG-
GRAPH/Eurographics Symposium on Computer Animation, pages 169–178. Eu-
rographics Association, 2010.
[47] KangKang Yin, Kevin Loken, and Michiel Van de Panne. Simbicon: Simple
biped locomotion control. In ACM Transactions on Graphics (TOG), volume 26,
page 105. ACM, 2007.
[48] Wenhao Yu, C Karen Liu, and Greg Turk. Policy transfer with strategy optimiza-
tion. arXiv preprint arXiv:1810.05751, 2018.
[49] Wenhao Yu, Jie Tan, C Karen Liu, and Greg Turk. Preparing for the unknown:
Learning a universal policy with online system identification. arXiv preprint
arXiv:1702.02453, 2017.
[50] Chao Zhang, Yang Yu, and Zhi-Hua Zhou. Learning environmental calibration
actions for policy self-evolution. In IJCAI, pages 3061–3067, 2018.
Appendix A
Reward Function Terms
These are the terms used to compute the reward (Section 3.2). We use the same defini-
tions as given in the DeepMimic paper [34]. Each joint j additionally has an associated
weight w j to denote its relative importance. These are displayed in Figure A.1.
rpt = exp
[−2(
Σ jw j||q jt q j
t ||2)]
(A.1)
q jt denotes the angle of the simulated character’s joint j at time t (represented as
a quaternion). q jt denotes the corresponding target angle from the reference motion.
takes the quaternion difference, and ||.|| turns a quaternion into the scalar rotation
about its axis.
rvt = exp
[−0.1
(Σ jw j||q j
t − q jt ||2)]
(A.2)
q jt and ˆq j
t denote the simulated character and reference motion’s joint velocities
respectively. ˆq jt is computed from the reference motion using finite difference.
ret = exp
[−40
(Σewe||pe
t − pet ||2)]
(A.3)
pet and pe
t represent the global positions of end-effector e for the simulated character
and reference motion respectively.
rct = exp
[−10
(wc||pc
t − pct ||2)]
(A.4)
pct and pc
t represent the global position of the root (pelvis) for the simulated char-
acter and reference motion respectively.
46
Appendix A. Reward Function Terms 47
Figure A.1: Diagram of the actuable joints in the humanoid model. Each joint is labelled
with a black circle, along with its degrees-of-freedom (DOF) and reward function weight
w j.
Appendix B
Training Graphs
B.1 Domain randomisation
0
100
200
Default (CF=0.2) Default (CF=0.1) Split A (CF=0.2)
0 2 4 61e7
0
100
200
Split A (CF=0.1)
0 2 4 61e7
Split B (CF=0.2)
0 2 4 61e7
Split B (CF=0.1)
Samples
Return
Figure B.1: Training graphs for the domain randomisation experiments in Section 5.2
(the bottom graph in Figure 5.1). Split A (CF=0.1) corresponds to the domain randomi-
sation controller used for evaluation in Chapter 6.
48
Appendix B. Training Graphs 49
B.2 Baseline
0
100
200
300s=1.00 s=1.12 s=1.24 s=1.36 s=1.50
0
100
200
300a (same as s=1.00) b c d e
0
100
200
300f g h i j
0 2 4 61e7
0
100
200
300k
0 2 4 61e7
l
0 2 4 61e7
m
0 2 4 61e7
n
0 2 4 61e7
o
Samples
Return
Figure B.2: Training graphs for the baseline policies used for evaluation in Chapter 6.
The solid lines show training return, the dashed lines show test return.
Appendix C
Combination of Policies
C.1 Transferred vs non-transferred base policies
In Section 4.1.1, we observed that using transfer learning enabled base policies to be
trained more quickly. An interesting question is whether interpolation works better
when using base policies that were all trained using transfer from the default base
policy than when using base policies that were trained from a random initialisation.
Since the policy space in which DeepMimic operates will contain many local optima,
policies trained from different random initialisations are likely to end up in different
local optima, which may mean they combine together less effectively.
We ran two sets experiments to test this hypothesis. The first looks at varying
overall scale (base policies: [s = 1,a = 0, l = 0] and [s = 1.5,a = 0, l = 0]); the second
looks at varying both arm and leg length at once (base policies: [s = 1,a = 0, l = 0]
and [s = 1,a = 1, l = 1]). All experiments use the same default model policy. We then
compare the results of training the second policy from scratch against training it using
transfer from the default policy. Figure C.1 shows the results.
In both graphs, neither scheme consistently outperforms the other. When varying
scale, the transfer learning policy does much better than the randomly-initialised policy
for larger scales; when varying limb length, it performs better for smaller lengths.
Overall, there is not enough evidence to strongly support our hypothesis (that policies
initialised from the same default policy combine together better), but since it does not
make performance notably worse (and does better at high scales), and it also speeds up
training times, we used transferred policies in our Chapter 4 and 6 experiments.
50
Appendix C. Combination of Policies 51
1.0 1.1 1.2 1.3 1.4 1.5Scale
200
220
240
260
280
300
Return
Using Randomly-Initialised PolicyUsing Transfer Learning Policy
0.0 0.2 0.4 0.6 0.8 1.0Added Arm and Leg Length
175
200
225
250
275
300
325
350
Return
Using Randomly-Initialised PolicyUsing Transfer Learning Policy
Figure C.1: Comparison of episode returns for different ρρρ’s when using the default policy
and another policy trained from scratch, against using the default policy and another
policy initialised from the default policy. In the top graph we vary scale, in the bottom
graph we modify arm and leg length simultaneously (so e.g. 0.5 on the x-axis denotes
[s = 1,a = 0.5, l = 0.5]).
C.2 Segmenting the action vector
An experimental idea was to split the action vector into segments, and use a different
interpolation weight for each segment of the vector. A similar idea has been proposed
for facial animation using blend shapes where the face is split into three areas that
can be manipulated independently [21]. However, the finer-grain control over what
is taken from each base policy comes at the cost of increasing the dimensions of the
weight search space.
Four different ways of segmenting the action vector were tried:
• 1 segment• 3 segments: head and torso, arms, legs
Appendix C. Combination of Policies 52
• 5 segments (a): head and torso, right arm, right leg, left arm, left leg
• 5 segments (b): head and torso, shoulders, elbows and wrists, hips, knees and
ankles
In order to assess the performance of these different segmentation approaches, we
ran experiments using each approach where we used Bayesian optimisation to find the
best interpolation weights. We ran one set of experiments that varies arm length (base
policies: [s = 1,a = 0, l = 0], [s = 1,a = 1, l = 0]), and another that varies leg length
(base policies: [s= 1,a= 0, l = 0], [s= 1,a= 0, l = 1]). We used the parameter settings
that performed best in Section 4.4.1 and use 40 function evaluations. Figure C.2 shows
the results.
0.0 0.2 0.4 0.6 0.8 1.0Added Arm Length
250
260
270
280
290
300
310
Return
1 segment3 segments5 segments (a)5 segments (b)
0.0 0.2 0.4 0.6 0.8 1.0Added Leg Length
200
220
240
260
280
300
320
Return
1 segment3 segments5 segments (a)5 segments (b)
Figure C.2: Episode returns obtained using different approaches for segmenting the ac-
tion vector. For the single-segment approach, the best weights were found using brute-
force at a resolution of 0.1. For the multiple-segment approaches, Bayesian search was
run three times for each ρρρ; the bar plots show the mean and standard error. In the left
plot, the arm length was varied. In the right plot, the leg length was varied.
It can be seen that segmenting the action vector does not offer any clear improve-
ments, with the returns of the different approaches all being similar. For some ρρρ’s,
the segmented approaches even do worse than using a single segment; this is an issue
with Bayesian optimisation not finding the true optimum (since the multiple-segment
approaches can represent every policy that the single-segment approach can). This il-
lustrates the trade-off between increasing return and increasing the dimensions of the
search space.
Future work could look at whether allowing more function evaluations during op-
timisation could improve the performance of segmented approaches. Heuristics could
also be investigated e.g. first optimising using the single-segment approach, and using
the weights found to initialise the search process for a multiple-segment approach.
Appendix D
Domain Randomisation
D.1 Neural network architecture diagrams
Figure D.1: Actor network architectures used in Chapter 5. Left to right: Default, Split A,
and Split B. The critic network architectures are the same except that the final aaa linear
output layer is replaced by a single linear unit. We also show whether layers use the
ReLU activation or a simple linear activation.
D.2 Adjusting output of a frozen network
In this section, we investigate whether it is possible to learn how to adjust the action
vector aaa outputted by a non-adaptive policy in order to handle different ρρρ’s. The idea
is related to progressive neural networks [36] which were proposed as a domain adap-
tation method for RL; Rusu et al. first train a policy on a source task, and then freeze
those policy layers and connect them to new layers that are trained on the target task.
Our approach is to first train a non-adaptive policy with the default two-layer ar-
chitecture on the default ρρρ. This default architecture only takes the character state sss as
input. Once this policy has converged, we then freeze those layers, and add new layers
53
Appendix D. Domain Randomisation 54
Figure D.2: Actor network architectures. Left to right: A, B, C. The critic network archi-
tectures are the same except that the final aaa linear output layer is replaced by a single
linear unit. The greyed-out layers (in the top left of each network) represent the frozen
default network - these have first been trained on the default character model, and are
then frozen during the domain randomisation training.
that process ρρρ in order to learn how to adjust the action aaa outputted by the default archi-
tecture into a new action aaa′′′. These new layers are trained using domain randomisation
- we randomly sample the elements of ρρρ at the start of each episode.
We tried three different architectures (A, B, C), shown in Figure D.2. In our ex-
periments, we tried training an adaptive controller for scale varying between 1 and
1.2. The significantly different network architecture meant that the hyperparameters
we used in other experiments (Section 3.1.2) were unsuitable, in particular the learn-
ing rates. Long training times prevented a thorough search for suitable new values,
but the best results we achieved are shown in Figure D.3. The hyperparameters that
we changed from the default values are the following: actor learning rate: 10−4, critic
learning rate: 10−2, clipping factor ε: 0.05. It can be seen that none of the archi-
tectures achieves consistently high returns (it is possible to achieve above 280 for all
scales between 1 and 1.5). We leave the investigation of different architectures and
hyperparameters as future work.
1.00 1.04 1.08 1.12 1.16 1.20Scale
100
150
200
250
Return
A (36 units)B (72 units)C (36 units + 36 units)
Figure D.3: Episode returns of different architectures evaluated at different scales.