1 Combining Recurrent Neural Networks and Adversarial Training …humanmotion.ict.ac.cn/papers/2019P3_RNN/paper.pdf · 2019-11-18 · 1 Combining Recurrent Neural Networks and Adversarial

1

Combining Recurrent Neural Networks andAdversarial Training for Human Motion

Synthesis and ControlZhiyong Wang, Jinxiang Chai*, and Shihong Xia*

Abstract—This paper introduces a new generative deep learning network for human motion synthesis and control. Our key idea is tocombine recurrent neural networks (RNNs) and adversarial training for human motion modeling. We first describe an efficient methodfor training an RNN model from prerecorded motion data. We implement RNNs with long short-term memory (LSTM) cells becausethey are capable of addressing the nonlinear dynamics and long term temporal dependencies present in human motions. Next, we traina refiner network using an adversarial loss, similar to generative adversarial networks (GANs), such that refined motion sequences areindistinguishable from real mocap data using a discriminative network. The resulting model is appealing for motion synthesis andcontrol because it is compact, contact-aware, and can generate an infinite number of naturally looking motions with infinite lengths. Ourexperiments show that motions generated by our deep learning model are always highly realistic and comparable to high-qualitymotion capture data. We demonstrate the power and effectiveness of our models by exploring a variety of applications, ranging fromrandom motion synthesis, online/offline motion control, and motion filtering. We show the superiority of our generative model bycomparison against baseline models.

Index Terms—Deep learning, Adversarial Training, Human Motion Modeling, Synthesis and Control.

F

1 INTRODUCTION

T HIS paper focuses on constructing a generative model forhuman motion generation and control. Thus far, one of the

most successful solutions to this problem is to build generativemodels from prerecorded motion data. Generative models areappealing for motion generation because they are often compact,have a strong generalization ability to create motions that are notin prerecorded motion data, and can generate an infinite numberof motion variations with a small number of hidden variables.Despite the progress made over the last decade, creating appro-priate generative models for human motion generation remainschallenging because it requires handling the nonlinear dynamicsand long-term temporal dependencies of human motions.

In this paper, we introduce an efficient generative model forhuman motion modeling, generation and control. Our key idea isto combine the power of RNNs and adversarial training for humanmotion generation, in which synthetic motions are generated fromthe generator using RNNs and the generated motion is refinedusing an adversarial neural network, which we call the “refinernetwork”. Fig 2 gives an overview of our method: a motionsequence XRNN is generated with the generator G and is refinedusing the refiner network R. To add realism, we train our refinernetwork using an adversarial loss, similar to GANs [1], such thatthe refined motion sequences Xre f ine are indistinguishable from

• Corresponding Author: Jinxiang Chai ([email protected]) and Shihong X-ia ([email protected])Zhiyong Wang and Shihong Xia are with the Beijing Key Laboratory ofMobile Computing and Pervasive Device, Institute of Computing Technol-ogy, Chinese Academy Of Sciences, Beijing, 100190, China.Zhiyong Wang and Shihong Xia are also with the University of ChineseAcademy of Sciences.E-mail: wangzhiyong, [email protected] Chai is with Texas A&M University.E-mail: [email protected]

real motion capture sequences Xreal using a discriminative networkD. In addition, we embed contact information into the generativemodel to further improve the quality of the generated motions.

We construct the generator G based on RNNs. RNNs are con-nectionist models that capture the dynamics of sequences via cy-cles in the network of nodes. Recurrent neural networks, however,have traditionally been difficult to train because they often containmillions of parameters and have vanishing gradient problems whenthey are applied to handle long term temporal dependencies. Weaddress the challenge by using a LSTM architecture [2], which hasrecently demonstrated impressive performance on tasks as variedas speech recognition [3], [4], language translation [5], [6], [7],and image generation [8].

Our refiner network, similar to GANs [1], is built upon theconcept of game theory, where two models are used to solvea minimax game: a generator that samples synthetic data fromthe model and a discriminator that classifies the data as real orsynthetic. Thus far, GANs have achieved state-of-the-art results ona variety of generative tasks, such as style transfer [9], 3D objectgeneration [10], image super-resolution [11], image translation[12] and image generation [13], [14], [15].

Our final generative model is appealing for human motiongeneration. In our experiments, we show that motions generated byour model are always highly realistic. Our model is also compactbecause we do not need to preserve the original training data oncethe model is trained. In addition, when we acquire new trainingdata, we do not always need to train the model from scratch. If thetotal amount of data does not exceed the capacity of the network,we can utilize the previous networks as a pretrained model andfine-tune the model with the new training data instead.

We have demonstrated the power and effectiveness of ourmodel by exploring a variety of applications, including motiongeneration, motion control and motion filtering. With our model,

2

Fig. 1. Human motion generation and control with our model. (left) random generation of high-quality human motions; (right) realtime synthesis andcontrol of human motions.

RNN generatorG

DiscriminatorD

RefinerR

-

Real vs Refined

Synthesized motion: xRNN

Refined motion: xRefined

Real motion: xreal

Fig. 2. The Pipeline of our system. We first generate a motion sequence from the RNN generator, G, and then refine the output of the generatorwith a refiner neural network, R, that minimizes the combination of an adversarial loss and a “self-regularization” term. The adversarial loss “fools” adiscriminator network, D, which classifies the motion as real or refined. The self-regularization term minimizes the difference between the syntheticmotion and the refined motion.

we can sample an infinite number of natural-looking motions withinfinite lengths, create a desired animation with various formsof control input, such as the direction and speed of a “runningmotion”, and transform unlabeled noisy input motion into high-quality output motion. We show the superiority of our modelthrough comparison against a baseline generative RNN model.We show that our method achieves a more accurate motion syn-thesis result than an alternative method with given root trajectoryconstraints.

1.1 Contributions

Our work is made possible by a number of technical contributions:

• We present a new contact-aware deep learning modelfor data-driven human motion modeling, which combinesthe power of recurrent neural networks and adversarialtraining.

• We train an adversarial refiner network to add realism tothe motions generated by RNNs with LSTM cells, using a

combination of an adversarial loss and a self-regularizationloss.

• We introduce methods for applying the trained deep learn-ing model to motion synthesis, control and filtering.

2 BACKGROUND

Our approach constructs a generative deep learning model froma large set of prerecorded motion data and uses it to createrealistic animation that satisfies various forms of input constraints.Therefore, we will focus our discussion on generative motionmodels and their application in human motion generation andcontrol.

Our work builds upon a significant body of previous workon constructing generative statistical models for human motionanalysis and synthesis. Generative statistical motion models areoften represented as a set of mathematical functions that describehuman movement using a small number of hidden parametersand their associated probability distributions. Previous genera-tive statistical models include Hidden Markov Models (HMMs)

3

[16], variants of statistical dynamic models for modeling spatial-temporal variations within a temporal window [17], [18], [19],[20], [21], and concatenating statistical motion models into finitegraphs of deformable motion models [22].

Most recent work on generative modeling has been focused onemploying deep RNNs to model the dynamic temporal behaviorof human motions for motion prediction [23], [24], [25], [26],[27], [28]. For example, Fragkiadaki and colleagues [23] proposedtwo architectures: LSTM-3LR (3 layers of LSTM cells) andERD (Encoder-Recurrent-Decoder) to concatenate LSTM units tomodel the dynamics of human motions. Jain and colleagues [24]introduced structural RNNs (SRNNs) for human motion predictionand generation by combining high-level spatio-temporal graphswith the sequence modeling success of RNNs. RNNs are appealingfor human motion modeling because they can handle nonlineardynamics and long-term temporal dependencies in human motion-s. However, as observed by other researchers [25], [26], currentdeep RNN based methods often have difficulty obtaining goodperformance in long term motion generation. They tend to failwhen generating long sequences of motion, as the errors in theirprediction are fed back into the input and accumulate. As a result,their long-term results suffer from occasional unrealistic artifacts,such as foot sliding, and gradually converge to a static pose. Zhouet al. [28] successfully trained an auto-conditioned RNN modelthat is able to produce a long motion sequence. However, astheir model does not provide an estimation of the distribution ofthe next frame, their model cannot be directly applied to motioncontrol applications.

To address the challenge, we estimate both the means and stan-dard deviations of the motion parameters. We refine the motionsgenerated by RNN using a “refiner network” with an adversarialloss such that the refined motion sequences are indistinguishablefrom real motion capture data using a discriminative network.Adversarial training allows us to construct a generative motionmodel that can randomly generate an infinite number of high-quality motions with infinite length, a capability that has notbeen demonstrated in previous work. Our user studies show thatthe motions generated by our model have comparable quality tohigh-quality motion capture data and are more realistic than thosegenerated by RNNs. Our goal is also different from theirs becausewe aim to learn generative models for human motion synthesisand control rather than motion prediction for video-based humanmotion tracking. In our experiments, we show that the user cancreate a desired animation with various forms of control input,such as the direction and speed of a running motion.

Our work is relevant to recent efforts on character animationand control using deep learning methods [29], [30]. For instance,Holden et al. [29] trained a convolutional autoencoder on a largemotion database and then learned a regression between highlevel parameters and the character motion using a feedforwardconvolutional neural network. In their more recent work, Holdenet al. [30] constructed a simple three layer neural network to mapthe previous character pose and the current user control to thecurrent pose, as well as the change in the phase, and appliedthe learned regression function for realtime motion control. Ourmodel is significantly different from theirs because we combineRNNs and adversarial training to learn a dynamic temporalfunction that predicts the probability distributions of the currentpose given the character poses and hidden variables in the past.Given the initial state of human characters, as well as the learnedgenerative model, we can randomly generate an infinite number

of high-quality motion sequences without any user input. Anotherdifference is that we formulate the motion synthesis and controlproblem in a Maximum A Posteriori (MAP) framework ratherthan regression framework adopted in their work. The goal of ourmotion synthesis is also different because we aim to generate high-quality human motion from various forms of animation inputs,including constraints from offline motion design, online motioncontrol and motion denoising while their methods are focused onfor realtime motion control based on predefined control inputs.

Lee et al. [31] presented an approach using LSTM whichcan handle foot-ground contacts and non-cyclic motions. Theyintroduced a loss term to penalize foot sliding in training, and adata augmentation method based on motion graphs. Our model isdifferent from theirs in two aspects. The first and most importantdifference is that their model takes control signal and motion fromprevious frame as input, while ours only relies on motion fromprevious frame. This difference in input probably makes the modeldifficult to model non-cyclic motions, which had also been foundby [23], [32], [33]. The second difference is that the output oftheir model is the character’s motion, while our generator givesthe distribution of the character’s motion. As the output of ourgenerator is a distribution but not certain motion, we cannot applythe strategy of Lee et al. [31]. Instead, we make use of adversarialtraining to refine the synthesized motions.

Our idea of using adversarial training to improve the qualityof synthesized motion from RNNs is motivated by the success ofusing an adversarial network to improve the realism of syntheticimages using unlabeled real image data [14]. Specifically, Shri-vastava and colleagues [14] proposed Simulated+Unsupervised(S+U) learning, where the task is to learn a model to improve therealism of a simulator’s output using unlabeled real data, whilepreserving the annotation information from the simulator. Theydeveloped a method for S+U learning that uses an adversarialnetwork similar to GANs, but with synthetic images as inputsinstead of random vectors. We significantly extend their idea forS+U learning for image synthesis to human motion synthesis byimproving the realism of human motion generated by RNNs withadversarial network using prerecorded human motion capture data.

3 OVERVIEW

Our goal herein is to learn generative networks from prerecordedhuman motion data and utilize them to generate natural-lookinghuman motions consistent with various forms of input constraints.The entire system consists of two main components:

Motion analysis. We describe a method for learning the deeplearning model from preprocessed motion data. To be specific,our first step is to learn a generative model based on RNNs withLSTM cells. Next, we train a refiner network using an adversarialloss such that the refined motion sequences are indistinguishablefrom real motion capture data using a discriminative network. Weembed contact information into the GANs to further improve theperformance of the refiner model.

Motion synthesis. We show how to apply the learned GANsto motion generation and control. We formulate the problem in anMAP framework. Given the initial state of human characters, aswell as the learned generative model, we find the most likely hu-man motion sequences that are consistent with control commandsspecified by the user and environmental constraints. We combinesampling-based methods with gradient-based optimization to findan optimal solution to the MAP problem. We discuss how to utilize

4

the contact information embedded in the generative model tofurther improve the quality of output animation. We adopt a similarMAP framework to transform unlabeled noisy input motion intohigh-quality animation.

We describe the details of each component in the next sections.

4 MOTION ANALYSIS

Our goal is to develop a generative model that formulates theprobabilistic distribution of the character state at the next framext+1 given the character state and hidden variables at the currentframe, denoted xt and ht , respectively. Mathematically, we wantto model the following probabilistic distribution:

p(xt+1|xt ,ht) (1)

In the following, we explain how to model the probabilisticdistribution using RNNs in Section 4.1 and how to train a refinernetwork using an adversarial loss such that the refined motionsequences are indistinguishable from real motion capture datain Section 4.2. Section 4.3 discusses how to utilize the contactinformation embedded in the generative model to further improvethe quality of output animation.

4.1 Generative RNN Model

In this section, we first explain our motion feature representationxt . Then, we give a brief introduction to RNNs and LSTM cellsand explain how to apply LSTM to generative motion modeling.In addition, we provide implementation details on how to stabilizetraining of the RNN model.

4.1.1 State Feature RepresentationEach motion sequence contains the trajectory for the absoluteposition and orientation of the root node (pelvis) as well as therelative joint angles of 18 joints. These joints are head, thorax,and left and right clavicle, humerus, radius, hand, femur, tibia,foot and toe. Let qt represent the joint angle pose of a humancharacter at frame t, it can be written as:

q =[tx ty tz rx ry rz θ2 · · · θd

]T (2)

where (tx, ty, tz) is the 3D position of the root joint, rx,ry, and rzare the joint angles of the root joint, and θ2, ...,θd are the jointangles of other joints.

To define the features of the character state, we choose to usethe relative rotation between the current frame and the previousframe for root rotation around the y-axis. We denote it 4ry,which represents the relative global rotation of the character. Ourtranslation features on the x- and z-axes are defined in the localcoordinates of the previous frame. The values of global rotationand translation features do not change when arbitrary rotationsare applied to the character or the root position of the characterchanges, which means that our feature representation is rotationinvariant and translation invariant.

The relationship between joint angle pose q and state featurex can be described as follows:

x =[4tx 4tz 4ry ty rx rz θ2 · · · θd

]T (3)

where the first three parameters 4tx, 4tz, and 4ry are globalfeatures, and ty,rx, and rz are the other three components of theroot joint.

Similarly, we can easily transform the state features x backto the corresponding joint angle pose q. As our global rotationand translation are related to the previous frame, we use ahomogeneous global transformation matrix to maintain the globalstate of the character and initialize it to the identity matrix. Forevery frame, we can obtain the local matrix of the frame fromfeatures and update the global transformation matrix by:

Mt+1 = MtMt+1,local

Mt+1,local =

Rot3×3(∆ry)∆tx0

∆tz0 0 0 1

(4)

where the right column of Mt+1 contains the global root positionon the x-z plane for frame t + 1. The rotation of all the joints canbe directly recovered from the motion features.

4.1.2 Motion Modeling with RNNsRNN is a class of neural networks that has been widely used tomodel dynamic temporal behaviors. It employs parameter sharingover time by using the same group of parameters for every frame.In our application, it takes the hidden states and current featuresas input and is trained to predict the probabilistic distribution ofthe features at the next frame. The hidden states in the RNNmodel carry the information about the history. This is why RNNsare suitable for handling long term temporal dependence. Thederivative of an RNN needs to be computed by Back PropagationThrough Time (BPTT) [34] which is similar to normal backpropagation methods except for its sequential structure. Similarto many other deep neural networks, RNNs also suffer from theproblem of vanishing gradient because the gradient flow must passthrough an activation layer in each frame and thus the magnitudeof the gradient decreases quickly over time. This prevents thenetwork from taking a relatively long history into account. LSTMcells [2] were introduced to address this challenge. In LSTM cells,the hidden state is divided into two parts: C and h. C carries thememory of the network, and h is the output of the network. Pleaserefer to the Appendix for more details of LSTM.

LSTM ensures that gradient flow in Ct no longer needs to passthrough activation functions. The vanishing gradient problem istherefore significantly reduced. As the LSTM model introduces anew variable Ct that carries the long term memory, the formulationof the problem can be rewritten as follows:

p(xt+1|xt ,ht ,Ct) (5)

The structure of our RNN motion model is shown in Fig 3. Topredict the distribution of the next frame, our output should notbe the state features themselves. Similar to [35], we model theprobabilistic distribution of the features in the next frame using aGaussian Mixture Model (GMM). The distribution of the GMMcan be written as follows:

p(xt+1) =M

∑i=1

wiN(xt+1|µi,σi) (6)

where the i-th vector component is characterized by a normaldistributions with weights wi, means µi and standard deviationsσi. To simplify the problem, we assume that the covariance matrixof every Gaussian is a diagonal matrix. The GMM model requires∑

Mi=1 wi = 1, wi > 0, and σi > 0. However, the output of our

network can be in (−∞,+∞). To bridge this gap, we define a

5

input(42)

FC1(500)

FC2(500)

Lstm1(500)

Lstm2(500)

FC3(500)

Output(425)

Fig. 3. The structure of our RNN model. It has three fully connectedlayers (FC1, FC2, and FC3) and two LSTM layers (LSTM1 and LSTM2).The numbers in the brackets are the width of the corresponding layer.

transformation between the network output w j, µi, j, and σi, j andthe GMM parameters wi,µi, j, and σi, j as follows:

wi =ewi

∑Mj=1 ew j

,µi, j = µi, j,σi, j = eσi, j (7)

where σi, j is the standard deviation of the j-th dimension in thei-th Gaussian. By this transformation, we ensure that the weightswi and the standard deviations of the Gaussian distribution σi, j arepositive, and ∑

Mi=1 wi = 1. M is set to 5 in our experiment.

The loss function for every frame can be written as:

argminE =−logp(xt+1|xt ,ht ,Ct)) (8)

During training, we compute ∂E∂ p from the GMM transforma-

tion and backpropagate the derivative to the top layers to obtainall the derivatives of the parameters. Our network is trained withRMSProp [36]. In our implementation, the derivative is clipped inthe range [-5, 5].

4.1.3 Model TrainingAs observed by other researchers [25], [26], RNN based gener-ative models often have difficulty obtaining good performance inlong term motion generation. They tend to fail when generatinglong sequences of motion, as the errors in their prediction are fedback into the input and accumulate. As a result, their long-termresults suffer from occasional unrealistic artifacts, such as footsliding, and gradually converge to a static pose. In the following,we summarize our strategies for training RNNs.

Adding noise into the training process. Because the predic-tion of every frame often has small error, there is a risk that theerror might accumulate over time and cause the model to crashat some point. Similar to [35], [37], we introduce independentidentically distributed Gaussian noise to the state features xt

during training in order to improve the robustness of our modelprediction. By adding noise to the network input, the network to belearned becomes more robust to noise in the prediction process.As a result, the learned model becomes more likely to recoverfrom small error in the prediction process. In our experiment, weset the mean and standard deviation of Gaussian noise to 0 and0.05, respectively.

Downsampling the training data. We find that downsamplingthe training data from 120fps to 30fps can improve the robustnessof the prediction model. We suspect that downsampling allowsthe RNN to handle longer term temporal dependence in humanmotions.

Optimization method. Optimization is critical to the perfor-mance of the learned model. We have found that RMSProp [36]performs better than SGD method adopted in [23].

The size of the training datasets. Training an RNN modelfrom scratch requires considerable training data. Insufficient train-ing data might result in poor convergence. We find that training aninitial model with large diverse datasets, (e.g., the CMU dataset)and then refining the model parameters using a smaller set of high-quality motion data can reduce the convergence error and lead toa better motion synthesis result.

The batch size is set to 20, and the time window size is set to50. Our learning rate is initialized to be 0.001, it is multipliedby 0.95 after every epoch. We train for 300 epochs. We findthat the loss converges after approximately 150 epochs, whichis approximately 30000 iterations.

4.1.4 Motion Generation with RNNsWe now describe how to generate a motion instance with thelearned RNN model. Given the character poses of the initialframes, q0 and q1, we first transform them into the features spacex0 and x1 as we described in Section 4.1.1. For every frame, giventhe current features xt and the current hidden states ht and Ct , weapply them as the input of the RNN model to obtain a GaussianMixture Model for the probabilistic distribution of the features atthe next frame (see Fig 3), the hidden states are updated at thesame time. Next, we sample from the Gaussian Mixture Modelto obtain an instance for the features at the next frame xt+1. Werepeat the process to generate a sequence of state features overtime: x0, x1, x2, ..., xn. Finally, we transform the state featuresback to the corresponding joint angle poses to create a motionsequence: q0, q1, q2, . . . , qn.

4.2 Adversarial Refiner Training

Starting from the motion features xRNN generated by the genera-tive RNN model G, our goal herein is to train a refiner networkR by using an adversarial loss such that the refined motion dataRθ (xRNN) are indistinguishable from the real motion data xrealusing a discriminative network D, where θ is the parameters ofthe refiner network.

To add realism to the motion generated by RNNs, we needto bridge the gap between the distributions of synthesized motiondata and real motion data. An ideal refiner will make it impossibleto classify a given motion sequence as real or refined withhigh confidence. This need motivates the use of an adversarialdiscriminator network, Dφ , which is trained to classify motionsas real vs refined, where φ is the parameters of the discriminatornetwork. The adversarial loss used in training the refiner networkis responsible for “fooling” the network into classifying the refined

6

motions as real. Following the GAN approach, we model this as atwo-player minimax game and update the refiner network Rθ andthe discriminator network Dφ alternately.

4.2.1 Generative Adversarial NetworksThe adversarial framework learns two networks (a generativenetwork and a discriminative network ) with competing losses.The generative model (R) (i.e., the refiner network) transformsthe motion generated by the RNN (xRNN) into the refined motionRθ (xRNN) and tries to “fool” the discriminative model D. The lossfunction for the generative model is described as follows:

argminθ

− log(Dφ (Rθ (xRNN))) (9)

where Dφ (x) is the probability of x being classified by the discrim-inative network as real data. This loss function wants to “fool” thediscriminative model, so that the refined motion Rθ (xRNN) shouldbe indistinguishable from real motion data. Here, we use the -logD loss rather than the original log(1-D) loss to avoid the earlygradient vanishing problem, especially when the discriminator istoo strong.

We define the loss function of the discriminative model asfollows:

argminφ

− log(1−Dφ (Rθ (xRNN)))− log(Dφ (xreal)) (10)

This loss function ensures that the learned discriminative model iscapable of distinguishing “real” or “refined” data.

In our implementation, we focus the refiner network on the fea-tures that are ignored by the generator model. The most importantfeatures lost in the RNN model are the positions and velocitiesof the end effectors. Therefore, we compute the velocities andpositions of the end effectors from the input x, denoted pend(x)and vend(x), respectively, and include them in the input for boththe generative and discriminative models.

Refiner network. Our generative model has two fully connect-ed layers and one LSTM layer (Fig 4). Mathematically, the refinernetwork learns a regression function Rθ that maps the input motionx as well as the positions and velocities of the end effectors to therefined motion Rθ (x):

Rinput =

xpend(x)vend(x)

Rθ (x) = (wr2 · lstm(relu(wr1 ·Rinput +br1))+br2)+x

(11)

where the network parameters θ include the weights of the LSTMlayer (Lstm1) and the weights and biases of two fully connectedlayers, including wr1 ,wr2 , br1 , and br2 .

Discriminative model. For the discriminator, we aim to builda classifier to distinguish the refined motions from the real mo-tions. We apply a bidirectional LSTM [38], which have beendemonstrated to be more powerful [38] than an LSTM, to modelthe discriminative model. The structure of the network is shownin Fig 5. Mathematically, it can be written as follows:

dinput =

[pend(x)vend(x)

]Dφ (x) = wd2 ·bilstmd(relu(wd1 ·dinput +bd1))+bd2

(12)

where the network parameters φ include the weights of thebidirectional LSTM layer and the weights and biases of two fullyconnected layers, including wd1 ,wd2 , bd1 , and bd2 .

Input(54)

FC1(500)

Lstm1(1000)

FC2(54)

Output(54)

+

Fig. 4. The structure of the GAN generator network. It consists of twofully connected layers (FC1 and FC2) and one LSTM layer (Lstm1). Thenumbers in the brackets are the width of the corresponding layer.

Inputframe 1

FC1(500)

Bilstm(500) Bilstm(500) …… Bilstm(500)

FC1(500)

Inputframe 2

FC1(500)

Inputframe n

FC2(2)

Output(1)

Fig. 5. The structure of the GAN discriminator network. It consists of twofully connected layers (FC1 and FC2) and one bidirectional LSTM layer(Bilstm). The numbers in the brackets are the width of the correspondinglayer.

Motion regularization. We add a regularization term to thegenerative loss function to ensure that the difference between theinput motion and refined motion is as small as possible. This leadsto the following generative loss function:

argminθ

E =−log(Dφ (Rθ (xRNN)))+λ ||root(x)− root(Rθ (x))||2

(13)where root(x) and root(Rθ (x)) are the root positions of the inputmotion x and the refined motion Rθ (x), respectively. In addition,the weight λ controls the importance of the regularization term.In our experiment, λ is set to 20.

4.2.2 Adversarial Training DetailsAdversarial training is hard because of the competition betweenthe generative and discriminative networks. It deteriorates fairlyeasily when one of the two models is too strong. Our trainingstrategies for adversarial training are summarized as follows:

Training the generative model more. We have found that ifthe two networks are trained equally in a cycle, the discriminatoroften dominates the generator, leading to crashing of the training.

7

In our practice, the generative model and discriminative model areeach updated 75 times in the beginning. After that, we update thegenerative model five times and the discriminative model once atevery step.

Using a history of refined motions. Another problem ofadversarial training is that the discriminator network only focuseson the latest refined motions. The lack of memory may cause (i)false divergence of the adversarial training, and (ii) the refinernetwork reintroducing the artifacts that the discriminator hasforgotten. To solve this problem, we update the discriminator usinga history of refined motion rather than only the ones generatedby the current network. To this end, similar to [14], we improvethe stability of adversarial training by updating the discriminatorusing a history of refined motions, rather than only the ones inthe current mini-batch. We slightly modify the algorithm to havea buffer of refined motions generated by previous networks. Let Bbe the buffer size, and b be the batch size. We generate B fake datain the very beginning. At each iteration of discriminator training,we sample b

2 motions from the current refiner network, and samplean additional b

2 motions from the buffer to update the parametersof the discriminator. After each training iteration, we randomlyreplace b

2 samples in the buffer with the newly generated refinedmotions. In practice, B is set to 320, and b is set to 32.

Adjusting the training strategy when one of the modelsis too strong. During training, we find that sometimes the dis-criminative model is easily trained to be too strong such thatthe generative model is nearly broken. To balance the training,we multiply the iteration times of the generative model by 2when the discriminator’s softmax loss < 0.01. In contrast, wedivide the iteration times of the generative model by 2 whenthe discriminator’s softmax loss > 1. We find this to be a usefulstrategy to avoid unstable GAN training. This approach is similarto that of [15].

Both the generative and discriminative models are trained byRMSProp [36]. We set the learning rate of the refiner to be 0.002and the learning rate of the discriminator to be 0.005. The decayrate of RMSProp is set to 0.9.

4.2.3 Motion Generation with Adversarial TrainingAfter the refiner network is trained, we can combine it withthe generative RNN model for motion synthesis. To achieve thisgoal, we first choose an initial state that we then use along withthe RNN generative model (G) to randomly generate a motionsequence as described in section 4.1.4. Next, we compute thevelocities and positions of the end effectors for every frame of thegenerated motion sequence. We augment the motion features withthe velocities and positions of the end effectors and input theminto the refiner network (R) to obtain the refined motion features.In the final step, we transform the refined motion features back tothe corresponding joint angle poses to form the output animation.

4.3 Contact-aware Motion Model

In this section, we describe how to embed the contact informationinto our generative model to further improve the quality ofthe generated motion. To this end, we first describe our semi-automatic contact labeling. We start from computing the speed ofthe left and right toes for all the training motion clips. Then, weselect a speed threshold (0.45 m/s in practice) for judging the footcontact. For the frames in which the speed of the left/right toe isbelow the threshold, we label the left/right foot contact state as 1

(a)

(b)

Fig. 6. A comparison between (a) motion generated by the RNN and (b)motion generated by the refinement network. The GAN model eliminatesthe foot sliding problem and obtains a more robust performance.

for the current frame; otherwise, the foot contact state is labeledas 0. After the automatic labeling, we manually check the resultto avoid label mistakes caused by noisy motion data.

In our application, we encode the contact information into a2x1 binary vector c = [cl ,cr]

T . The first bit represents if the leftfoot of the character is on the ground, and the second bit representsif the right foot is on the ground. We augment the motion featureswith contact information for RNN motion modeling. Our motionfeature can be written as:

q =[tx ty tz rx ry rz θ2 · · · θd cl cr

]T (14)

In addition, we also augment both the network input and thenetwork output with the contact vector in adversarial training.Contact awareness further improve the quality of the generatedmotion. Another advantage of contact-awareness is to automatical-ly label every frame of the generated motion with contact informa-tion. This allows us to enforce environmental contact constraintsin a motion generalization process, thereby eliminating noticeablevisual artifacts, such as foot sliding and ground penetration in theoutput animation. Fig 6 shows a comparison before and after therefinement.

8

5 MOTION SYNTHESIS AND CONTROL

In this section, we demonstrate the power of our GAN motionmodel for various applications, including random motion gener-ation, offline motion design, online motion control and motiondenoising.

5.1 Random Motion Generation

To achieve accurate motion control, we require a generative modelthat can generate various motions. We first show our motion’sability to generate different motions. As mentioned in 4.1.4 and4.2.3, our model can sample various motions from any initialframe. Similar to our approach in training, we allow the input ofthe network to have a small noise. Experiment shows that this canexpand the generative ability of the model. This means that we canhave network noise d1, d2,...,dn as another group of variables inaddition to motion features. The generated motion is foot contact-aware because of the contact awareness of our model.

As our model is foot contact-aware, we introduce a postpro-cessing procedure for offline applications after motion generation.We can extract contact information from generated motion fea-tures. We apply an inverse kinematic technique to ensure that thespeed of the contact points is zero, and all the contact points arelocated on the corresponding contact plane. This postprocessingis used for the result of the random motion generation and theoffline motion design in our demo. If the contact information inthe training data is noisy or the model generates wrong contactlabel, the postprocessing may lead to an unnatural gait type.

We show a random motion generation example for stylizedmotions in Fig 7.

5.2 Offline Motion Design

Our generative model is also well suited for generating natural-looking human motion consistent with user-defined input. Weformulate the motion control problem in an MAP estimationframework. We treat the given constraints as the observation froman unknown motion sequence. To ensure that we obtain a motionsequence with high quality even when the given constraints areimpossible to satisfy, we assume that the observations containnoise. We use the trained motion model as the motion prior. Ourgoal is to find a sequence of motion vectors s={si,i=1,2,...T} thatis most likely to appear. This means that we want to maximize theprobability P(s,d|s0,c,e). According to Bayes’ rule, we have:

argmaxs,d

P(s,d|s0,c,e)

= argmaxs,d

P(s,d|s0)P(c,e|s0,s,d)P(c,e|s0)

∝ argmaxs,d

P(s,d|s0)P(c|s0,s,d)P(e|s0,s,d)

(15)

The first term is the prior term, it can be written as:

p(s,d|s0) = P(d1,d2, ...,dT )T

∏i=1

P(si|s0,s1, ...,si−1,d0, ...,di−1)

(16)

The first part of the prior term is Gaussian noise, it obeys aGaussian distribution, and the prior for the noise is:

P(d1,d2, ...,dn) =n

∏i=2

e− d2

i2σ2

noise (17)

(a)

(b)

(c)

Fig. 7. The random motion generation results of (a) walking motion,(b) running motion. (c) motion with different styles from the same initialframe. The yellow circles show the foot contact label.

Here, σnoise is set to 0.05 by experiment. The second part of theprior is the probability from the RNN-based GAN models.

Control term. The second term of equation (15) is the controlterm. Our model enables the user to accurately control a characterat the kinematic level. Low-level kinematic control is importantbecause it allows the user to accurately control motion variationsof particular actions. Our motion control framework is very flexi-ble and supports any kinematic control inputs. The current systemallows the user to control an animation by selecting a point (e.g.,root) on the character and specifying a path for the selected pointto follow. The user could also direct the character by defining thehigh-level control knobs, such as turning angles, step sizes, andlocomotion speeds. More specifically, we allow the user to controlthe root path of an animated character. For root joint projection onthe ground for every frame root(si), we first find the nearest pointscnear,i on the curve. We assume that there is Gaussian noise with astandard deviation of σ f it for the user’s control inputs c. Then, wecan define the likelihood of fitting as follows:

pcurve ∝

T

∏i=1

e−‖root(si)−cnear,i‖2

σ2f it (18)

The standard deviation of the fitting term σ f it indicates the

9

preference of the user’s fitting accuracy. The smaller it is, themore attention the controller pays to the fitting accuracy. When thegiven curve is not achievable and σ2

f it is too small, the synthesizedmotion would be strange. In our experiment, σ2

f it is set to 0.5.Contact awareness term. The last term is the contact aware-

ness term. Due to the contact awareness of our model, our generat-ed motion is automatically annotated with contact information. Wefirst retrieve contact information from the network output. Then,for each frame, if there is a contact between the character andthe environment, we measure two distances: the first one is thedistance between the synthesized contact point on the characterin the current frame and the previous frame; the second one isthe point plane distance between the synthesized contact pointon the character and the corresponding contact plane. We assumeGaussian distributions with standard deviations of σcon for theadjacent frames constraint of the contact-awareness term andσcon,y for the point-plane constraint of the contact-awareness term.Then, the contact awareness term can be written as:

Pcontact = e−‖ f (si−1) f oot − f (si) f oot‖2

σ2con

· e−‖n · ( f (si−1) f oot − pplane)‖2

σ2con,y

(19)

Because the original probability is hard to evaluate, we trans-form it to its -log form, so the overall loss function is

mins1,s2,...,sn,d1,d2,...,dn

E =−log(Pprior ·Pnoise ·Pcurve ·Pcontact) (20)

This loss function is nonconvex because of the complexity ofthe RNN. To solve this optimization problem, we combine thepower of sampling and gradient-based optimization to achieve agood result in an acceptable time. Given a starting frame and auser-defined curve, our experiment shows that synthesizing themotion sequence frame-by-frame will not yield good results. Thisoccurs because in an RNN model, the performance in every frameis highly related to previous frames. Therefore, we use spatialtemporal optimization to make the generated motion sequencesmooth.

In the optimization, the parameters of the neural network arefixed. The optimizer tries to find the solution that best fits theuser-defined constraints while satisfying the motion distributionpredicted from the neural network and minimizing the noise.

We solve the problem through the sliding windows method.We optimize for a sequence of B frames each time, which we calla “window”. For adjacent windows, they share an overlap of bframes. For the initialization of the optimization, the overlappedpart with the last window is set to the optimization result from lastwindow, and the other part is sampled by our motion model. Weset B and b to 34 and 17, respectively.

Because of the complexity of the loss function, the gradientdescent method can be very slow to converge and sometimesleads to local minima. We combine the sampling-based methodand gradient-based method to solve the problem. We first applyparticle swarm optimization [39] for a few iterations to find agood initialization and then switch to gradient-based optimizationto obtain a more precise result. We compute the gradient ∂E

∂ s and∂E∂d by the chain rule in the optimization.

As our motion features are based on the global motion states ofthe previous frame (we use 4ry, 4tx, 4tz as part of the features),our jacobian matrix for the constraint term and foot contact termis more complex than in previous approaches. Therefore, we

analytically evaluate it by BPT T . Because of the RNN prior term,our problem is not in the least squares form. We apply the LBFGS[40] method to solve the problem.

(a)

(b)

Fig. 8. The result of offline motion control. The red line on the groundis the user input constraint, the yellow line is the root trajectory for thesynthesized motion. We draw a few keyframe of the synthesized motionhere. (a): a walking result, (b): a running result. Our results are best seenin video form.

5.3 Online Motion ControlIn addition to offline motion design, our model is also suited foronline motion control. We allow the user to control the speed anddirection of the character (see Fig 9). Similar to the offline case,we also model the problem in an MAP framework.

argmaxs,d

P(s,d|s0)P(c|s0,s,d)P(e|s0,s,d) (21)

We still have the RNN prior term, the control term and thecontact awareness term. One difference lies in the control term,where we need to address direction and speed constraints fromthe user. We also perform the optimization in a sequence of nframes each time. As a result, the control response has a latency.Here, we set n to 5. As the control term is not the same, weformulate the corresponding loss function for the new controlterms. Similarly, our realtime motion control system allows foraccurate and precise motion control at the kinematic level. Ourcurrent implementation allows the user to control the speed anddirection of locomotion, such as walking and running. The onlineconstraints includes speed control and direction control, which canbe written as:

P(c|s0,s,d) = P(cspeed |s0,s,d)P(cdirection|s0,s,d) (22)

We address speed and orientation online control in the following.Speed control. We allow the user to give a speed control to

the character. As the speed of the character naturally changes withtime, we constrain the character’s average speed over the windowas close as possible to the given speed control. Assuming that the

10

control result is a Gaussian distribution around the control inputwith standard deviation σspeed , we have:

P(cspeed |s0,s,d) ∝ e

|| 1T ∑

Ti=1 speedi−cspeed‖

2

σ2speed (23)

Direction control. The user can also control the movingdirection of the character. We define the facing direction of thecharacter as the angle around the y-axis of the root joint. Wemeasure the difference between the direction of the last frameand the control input. We assume a Gaussian distribution for thecontrol input:

P(cdirection|s0,s,d) ∝ e||directT−cdirection‖2 (24)

where directT is the direction of the last frame in the batch, andcdirection is the direction control input. The optimization problemis also solved by gradient-based optimization. Using GPU acceler-ation by a GTX970 graphics card, we have an average frame rateof 30 fps.

Fig. 9. Online motion control. We have speed and direction controlhere. The control speed is shown on the left-top of the screen, and thedirection is shown by the arrow on the ground. Our results are best seenin video form.

5.4 Motion DenoisingMotion denoising takes a noisy and unlabeled motion sequenceas input and generates a natural-looking output motion that isconsistent with the input motion (see Fig.10). We formulate theproblem as an MAP problem:

argmaxs,d

P(s,d|s0)P(c|s0,s,d)P(e|s0,s,d) (25)

The first prior term and the third environment contact constraintterm is the same as those in section 5.2. We model the secondconstraint term in the following.

We assume that the noise consists of two parts: the noise ofthe root position and the noise of the joint angles. We model bothtypes of noise in the original motion as Gaussian noise. Therefore,we have:

P(c|s0,s,d) ∝ e(croot−root(si))

2

2σ2root · e

∑nj=1 ||angle j(c)−angle j(si)||2

2σ2angle (26)

where croot and cangle are the root position and joint angles of theoriginal motion, and root(si) and angle(si) are the root positionand joint angles of the filtered motion. This optimization problemis also solved by LBFGS [40].

Initial state estimation. There is a disadvantage for a tra-ditional RNN-based motion model in the first frame. When weperform motion synthesis, the RNN output would have hopping

occur between the first frame and the second frame. For mostapplications, we can simply discard the first frame to avoid theproblem. However, certain scenarios (e.g., motion filtering) requireus to solve the problem. The traditional RNN-based sequencesynthesis method starts the synthesis with the zero state [3] [23].However, in training, the input states for most frames are nonzero.The network is trained to fit the next frame with previous framefeatures and the appropriate hidden state. Therefore, when thehidden state of the network is set to zero in the first frame,the network is not sufficiently trained to address this situationand may produce wrong synthesis result. To solve the problem,we estimate the appropriate hidden state for the first frame byour RNN motion model. The appropriate hidden state makes themotion in the second frame very likely to appear in the estimatedmotion distribution, and the loss function is:

minh1

E =−logM

∑i=1

wiP(s2|s1,h1) (27)

where h1 is the hidden state for the first frame. We estimate theinitial hidden state by gradient-based optimization. We computethe gradient ∂E

∂hby backpropagation and use the gradient descent

method to perform the optimization. Given appropriate initialstates, the generated motion would be smooth between the firstframe and second frame.

frame index0 20 40 60 80 100

join

t ang

le

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

0.2

dataset motionnoisy motion

(a)

frame index0 20 40 60 80 100

join

t ang

le

-0.15

-0.1

-0.05

0

0.05

0.1

dataset motionfiltered motion

(b)

Fig. 11. A comparison of one joint angle for the motion before and afterthe filtering and a similar motion in the training dataset. The filteredmotion is more smooth and similar to the dataset motion.

6 RESULTS

We have demonstrated the power and effectiveness of our modelin motion generation, motion control and motion denoising. Tothe best of our knowledge, this is the first generative deep learningmodel capable of generating an infinite number of high-quality

11

(a)

(b)

Fig. 10. A comparison of motion before and after filtering. The trajectories of the root joint and foot joints are shown in yellow. (a) the original noisymotion. (b) the motion after applying filtering. The joint trajectory of the motion before filtering is jerky and outlier poses are circled in red. This resultis best seen in video form.

motion sequences with infinite lengths. The user studies showthat motions generated by our model are comparable to motioncapture data obtained by ten Vicon cameras. In addition, we havedemonstrated the superiority of our model over a baseline RNNmodel. Our results are best seen in video form.

We have captured walking and running data of a single actorfor 525 motion sequences. These motion sequences vary in speed,step length, and turning angle. In addition, we use CMU dataset toperform pretraining. Our model is relatively small because we donot need to preserve the original motion sequence once the modelis trained. Our generative model (41 Mb) is much smaller thanthe size of the original training datasets (133 Mb). The originaldataset contains 419488 frames (58.26 minutes) of walking andrunning data that vary in speed, step size, and turning angles.When we acquire new training data, we do not always need totrain the model from scratch. If the total amount of data does notexceed the capacity of the network, we can utilize the previousnetworks as a pretrained model and fine-tune the model with thenew training data instead.

Random Motion Generation. This experiment demonstratesthat our model can generate an infinite number of high-qualityanimation sequences with infinite length. We select a certain start-ing frame and then generate 100 motions following the methodwe described in 4.2.3. The results show that the generated motionvaries in step size, speed, and turning angle. The results can beenseen in Fig 1. To demonstrate that our method can be used tomodel various motions, we also train our model on a stylizedmotion dataset [41]. The result shows that our model samplesvarious motions from the same initial frame. We show the resultin Fig.7. This result is best seen in the supplementary video.

Offline Motion Design. The user can generate a desiredanimation that is consistent with the control input. This experi-ment shows that our model can generate a desired animation byspecifying the projection of the root trajectories on the ground.In the accompanying video, we show the process of creatingdesired, natural-looking animations with both hand-drawn curvesand predefined curves. The results can be seen in Fig 8.

Online Motion Control. The online motion control systemoffers precise, realtime control over human characters, includingthe speed and direction of walking and running. The accompany-ing video shows that the character can make a sharp turn, e.g.,-180 degrees or +180 degrees, or speed up and down with littlelatency. A result can be seen in Fig 9.

Motion Denoising. The accompanying video shows a side-by-side comparison between input noisy motion and output motionafter denoising for both walking and running. In both cases, theinput motion appears very jerky and contains significant foot-sliding artifacts and occasional outlier poses. It is hard to judgewhen and where foot contact occurs in the input motion. Our mo-tion denoising algorithm automatically identifies the foot contactframes and outputs high-quality motion that closely matches theinput motion. One joint angle of the motion before and after thefiltering can been seen in Fig 11. The keyframes of the motionbefore and after motion filtering can be seen in Fig 10.

The advantage of combining RNNs and adversarial train-ing. The accompanying video shows a side-by-side comparisonof generated motions from RNNs with and without adversarialtraining. For both walking and running cases, the results fromRNNs alone are occasionally jerky and contain frequent foot-sliding artifacts, while the combined model produces highly real-istic motion without any noticeable visual artifacts. This can alsobe seen in Fig 6.

Comparison between our method and PFNN [30]. Weevaluate the effectiveness of our method by comparing againstPFNN [30]. Our comparison focuses on offline motion design.Although PFNN is designed an online algorithm, the authorsalso show results of following predefined paths. The purposeof this experiment is to demonstrate that our optimization-basedalgorithm achieves a more accurate result than online algorithm.As our model is a contact-aware model, our result also has higherquality.

In offline motion design, we select a predefined curve as con-straint and then synthesize motions by the two models separately.For our model, the method described in Section 5.2 is used to

12

synthesize a motion that best fits the given constraints whilesatisfying the motion prior from the neural network. For PFNN,the given constraints are used as the trajectory of the networkinput. We use the code and trained model from the authors forPFNN [30] in the experiment. We extract the contact informationfor walking motion clips in the dataset of PFNN, then train ourmodel on these motion clips for the experiment.

We show the given curves and the root trajectories of thesynthesized motions in Fig.12. The difference between the themare measured by mean square error (MSE) between the roottrajectory of the synthesized motion to the given curve. The resultis shown in Table1 and Fig12. The results show that our methodachieves a more accurate result.

(a) (b)

Fig. 12. Comparison against PFNN in offline motion design: (a) resultfrom our method; (b) result from PFNN. In both figures, the given curveis shown in red, and the synthesized root trajectory is in yellow. Fromtop to bottom are the circle curve and sine curve.

TABLE 1MSE error in cm for our method and pfnn.

method circle curve sine curveours 1.8428 1.2704

PFNN 2.4764 2.0886

Motion quality. We evaluate the quality of synthesized mo-tions via user studies. Specifically, we compare the quality ofour synthesized motions against high-quality motion capture data(“Mocap”) and those generated from RNNs (“RNN”). We imple-ment the “RNN” synthesis method based on the algorithm de-scribed in Section 4.1.4. “Mocap” represents high-quality motioncapture training data captured by ten Vicon [42] cameras. “GAN1”and “GAN2” represent motions generated by a combination of ourRNNs and the refiner network with and without automatic footcontact handling, respectively. Note that our contact-aware deeplearning model can automatically label every frame of the gener-ated motion with contact information. This information allows usto automatically enforce environmental contact constraints on theoutput motion.

We have evaluated the quality of motions on 25 users, in-cluding males and females. Most of the users have no previousexperience with 3D animation. For both running and walking,we randomly generate seven animation sequences from eachalgorithm along with seven animation sequences randomly chosenfrom the mocap database. We render animations on a stick figure,similar to Fig 9. Then, we randomly organize all the animationclips. We ask the users to watch each video and provide a scoreof how realistic the motion is, ranging from 1 (“least realistic”)

RNN GAN1 GAN2 Mocap

scor

e

1

2

3

4

5

motion quality from user study(walking)

RNN GAN1 GAN2 Mocap

scor

e

1

2

3

4

5

motion quality from user study(running)

Fig. 13. Comparisons of motion quality. We ask the users to give a score(1-5) of how realistic the synthesized motions are. This graph shows theresults of these scores, including the means and standard deviations formotions generated via our methods with and without contact handling(“GAN1” and “GAN2”), a baseline RNNs method (“RNN”), and high-quality motion capture data from ten Vicon cameras (“Mocap”).

to 5 (“most realistic”). We report the mean scores and standarddeviations for the motions generated by each method, see Fig13. The user studies show that the motions generated by ourmethod (“GAN2”) are highly realistic and comparable to thosefrom the mocap database (“Mocap”). The evaluation also showsthe advantage of combining RNNs and adversarial training forhuman motion synthesis, as both “GAN1” and “GAN2” producemore realistic results than “RNN”. In addition, the score differencebetween “GAN1” and “GAN2” shows the advantage of embed-ding contact information into the deep learning model for humanmotion modeling and synthesis.

7 CONCLUSION AND FUTURE WORK

We have introduced a generative deep learning model for humanmotion synthesis and control. Our model is appealing for humanmotion synthesis because it is generative and can generate aninfinite number of high-fidelity motion sequences to match variousforms of user-defined constraints. We have demonstrated thepower and effectiveness of our models by exploring a variety ofapplications, ranging from random motion synthesis to offline andrealtime motion control, and motion denoising.

We have shown that motions generated by our model arealways highly realistic. One reason is that our generative modelcan handle both the nonlinear dynamics and long term temporaldependence of human motions. Our model is compact because wedo not need to preserve the original training data once the modelis trained. Our model is also contact aware and embedded withcontact information, thereby removing unpleasant visual artifactsoften present in the motion generalization process.

We formulate the motion control problem in a maximum aposteriori (MAP) framework. The MAP framework provides aprincipled way to balance the tradeoff between user constraintsand motion priors. In our experiments, the input constraints frommotion control might be noisy and could even be unnatural.When this occurs, the system prefers to generate a “natural-looking” motion that “best” matches the input constraints ratherthan generating the “best” possible motion that “exactly” matchesthe user constraints.

13

We have tested the model on a walking and running dataset.In the future, we plan to test the model on aperiodic motions, suchas jumping and dancing. Currently, our model does not achievea good result for aperiodic motions. One possible solution iscombining our model with other systems that can handle complexmotions, e.g., [28]. We also plan to test the model on a het-erogeneous motion database, such as walking, running, jumpingand their transitions. Our system achieves realtime control (30fps) on a GTX970 card; however, it is still not fast enough formobile applications. Therefore, one direction for future work isto speed up the system via network structure simplification andmodel compression.

We show that our generative deep learning motion model canbe applied for motion generation, motion control and filtering. Inthe future, one possibility is to model heterogeneous motions withit. We believe that the model could also be leveraged for manyother applications in human motion analysis and processing, suchas video-based motion tracking, motion recognition, and motioncompletion. One of the immediate directions for future work is,therefore, to investigate the applications of the models to humanmotion analysis and processing.

ACKNOWLEDGEMENT

This work was supported by the National Natural Science Foun-dation of China No. 61772499 and Natural Science Foundation ofBeijing Municipality No. L182052.

APPENDIX

σ

X

concat

σ tanh σ

tanh

X

+

X

Ct-1

ft it otgtht-1

Ct

ht

xt

ht

Fig. 14. The structure of LSTM.

LSTM cells divide hidden states into two parts: h is sensitiveto short term memory, while C carries long term memory. Mathe-

matically, it is:

(i, f ,o,g)t =Wi, f ,o,g ·

Ct−1ht−1

xt

+bi, f ,o,g

it = σ(it)

ft = σ( ft)

ot = σ(ot)

gt = tanh(gt)

Ct = ft ·Ct−1 + it · gt

ht = ot · tanh(Ct)

(28)

where σ is the sigmoid function σ(x) = 11+e−x , and tanh(x) =

1−e−x

1+e−x . ft is called the forget gate. When ft is rather small, all thehistorical memory of the LSTM cell will be lost. This is usefulwhen we do not want to maintain the memory any more. it iscalled the input gate. New information is acquired by the longterm memory C through this gate. ot is called the output gate. Itinfluences the output of the LSTM cell.

REFERENCES

[1] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” inAdvances in neural information processing systems, 2014, pp. 2672–2680.

[2] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neuralcomputation, vol. 9, no. 8, pp. 1735–1780, 1997.

[3] A. Graves, “Sequence transduction with recurrent neural networks,”arXiv preprint arXiv:1211.3711, 2012.

[4] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition withdeep recurrent neural networks,” in Acoustics, speech and signalprocessing (icassp), 2013 ieee international conference on. IEEE, 2013,pp. 6645–6649.

[5] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudanpur,“Recurrent neural network based language model.” in Interspeech, vol. 2,2010, p. 3.

[6] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learningwith neural networks,” in Advances in neural information processingsystems, 2014, pp. 3104–3112.

[7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation byjointly learning to align and translate,” arXiv preprint arXiv:1409.0473,2014.

[8] K. Gregor, I. Danihelka, A. Graves, D. J. Rezende, and D. Wierstra,“Draw: A recurrent neural network for image generation,” arXiv preprintarXiv:1502.04623, 2015.

[9] L. A. Gatys, A. S. Ecker, and M. Bethge, “A neural algorithm ofartistic style,” CoRR, vol. abs/1508.06576, 2015. [Online]. Available:http://arxiv.org/abs/1508.06576

[10] C. B. Choy, D. Xu, J. Gwak, K. Chen, and S. Savarese,“3d-r2n2: A unified approach for single and multi-view 3d objectreconstruction,” CoRR, vol. abs/1604.00449, 2016. [Online]. Available:http://arxiv.org/abs/1604.00449

[11] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. P. Aitken,A. Tejani, J. Totz, Z. Wang, and W. Shi, “Photo-realisticsingle image super-resolution using a generative adversarialnetwork,” CoRR, vol. abs/1609.04802, 2016. [Online]. Available:http://arxiv.org/abs/1609.04802

[12] P. Isola, J. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” computer vision and patternrecognition, pp. 5967–5976, 2017.

[13] M. Mirza and S. Osindero, “Conditional generative adversarial nets.”arXiv: Learning, 2014.

[14] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb,“Learning from simulated and unsupervised images through adversarialtraining,” arXiv preprint arXiv:1612.07828, 2016.

[15] D. Berthelot, T. Schumm, and L. Metz, “Began: Boundary equilibriumgenerative adversarial networks,” arXiv preprint arXiv:1703.10717, 2017.

[16] M. Brand and A. Hertzmann, “Style machines,” in Proceedings of ACMSIGGRAPH 2000, 2000, 183–192.

14

[17] Y. Li, T. Wang, and H.-Y. Shum, “Motion texture: A two-level statisticalmodel for character synthesis,” in ACM Transactions on Graphics, 2002,21(3):465–472.

[18] J. Chai and J. Hodgins, “Constraint-based motion optimization using astatistical dynamic model,” in ACM Transactions on Graphics, 2007,26(3):Article No.8.

[19] M. Lau, Z. Bar-Joseph, and J. Kuffner, “Modeling spatial and temporalvariation in motion data,” in ACM Transactions on Graphics, 2009, 28(5):Article No. 171.

[20] J. Min, Y.-L. Chen, and J. Chai, “Interactive Generation of HumanAnimation with Deformable Motion Models,” ACM Transactions onGraphics, 2009, 29(1): article No. 9.

[21] X. Wei, J. Min, and J. Chai, “Physically valid statisticalmodels for human motion generation,” ACM Trans. Graph.,vol. 30, pp. 19:1–19:10, May 2011. [Online]. Available:http://doi.acm.org/10.1145/1966394.1966398

[22] J. Min and J. Chai, “Motion graphs++: A compact generative modelfor semantic motion analysis and synthesis,” ACM Trans. Graph.,vol. 31, no. 6, pp. 153:1–153:12, Nov. 2012. [Online]. Available:http://doi.acm.org/10.1145/2366145.2366172

[23] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Recurrent networkmodels for human dynamics,” in Proceedings of the IEEE InternationalConference on Computer Vision, 2015, pp. 4346–4354.

[24] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn: Deeplearning on spatio-temporal graphs,” CVPR, 2016.

[25] J. Martinez, M. J. Black, and J. Romero, “On human motion predictionusing recurrent neural networks,” CoRR, vol. abs/1705.02445, 2017.[Online]. Available: http://arxiv.org/abs/1705.02445

[26] I. Habibie, D. Holden, J. Schwarz, J. Yearsley, and T. Komura, ARecurrent Variational Autoencoder for Human Motion Synthesis, 7 2017.

[27] J. Butepage, M. J. Black, D. Kragic, and H. Kjellstrom,“Deep representation learning for human motion prediction andclassification,” CoRR, vol. abs/1702.07486, 2017. [Online]. Available:http://arxiv.org/abs/1702.07486

[28] Y. Zhou, Z. Li, S. Xiao, C. He, Z. Huang, and H. Li, “Auto-conditionedrecurrent networks for extended complex human motion synthesis,” inInternational Conference on Learning Representations, 2018. [Online].Available: https://openreview.net/forum?id=r11Q2SlRW

[29] D. Holden, J. Saito, and T. Komura, “A deep learning frameworkfor character motion synthesis and editing,” ACM Trans. Graph.,vol. 35, no. 4, pp. 138:1–138:11, Jul. 2016. [Online]. Available:http://doi.acm.org/10.1145/2897824.2925975

[30] D. Holden, T. Komura, and J. Saito, “Phase-functionedneural networks for character control,” ACM Trans. Graph.,vol. 36, no. 4, pp. 42:1–42:13, Jul. 2017. [Online]. Available:http://doi.acm.org/10.1145/3072959.3073663

[31] K. Lee, S. Lee, and J. Lee, “Interactive character animation by learningmulti-objective control,” in SIGGRAPH Asia 2018 Technical Papers.ACM, 2018, p. 180.

[32] A. Jain, A. R. Zamir, S. Savarese, and A. Saxena, “Structural-rnn:Deep learning on spatio-temporal graphs,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition, 2016, pp.5308–5317.

[33] J. Martinez, M. J. Black, and J. Romero, “On human motion predictionusing recurrent neural networks,” arXiv preprint arXiv:1705.02445, 2017.

[34] P. J. Werbos, “Generalization of backpropagation with application to arecurrent gas market model,” Neural networks, vol. 1, no. 4, pp. 339–356,1988.

[35] A. Graves, “Generating sequences with recurrent neural networks,” arXivpreprint arXiv:1308.0850, 2013.

[36] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradientby a running average of its recent magnitude,” COURSERA: Neuralnetworks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.

[37] G. W. Taylor and G. E. Hinton, “Factored conditional restricted boltz-mann machines for modeling motion style,” in Proceedings of the 26thannual international conference on machine learning. ACM, 2009, pp.1025–1032.

[38] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural networks,”IEEE Transactions on Signal Processing, vol. 45, no. 11, pp. 2673–2681,1997.

[39] J. Kennedy, “Particle swarm optimization,” in Encyclopedia of machinelearning. Springer, 2011, pp. 760–766.

[40] D. C. Liu and J. Nocedal, “On the limited memory bfgs method forlarge scale optimization,” Mathematical programming, vol. 45, no. 1, pp.503–528, 1989.

[41] S. Xia, C. Wang, J. Chai, and J. Hodgins, “Realtime style transferfor unlabeled heterogeneous human motion,” ACM Trans. Graph.,

vol. 34, no. 4, pp. 119:1–119:10, Jul. 2015. [Online]. Available:http://doi.acm.org/10.1145/2766999

[42] Vicon, “http://www.vicon.com,” 2015.

Zhiyong Wang Zhiyong Wang received a BScdegree in automation from Tsinghua University(THU), China, in 2011. He is currently workingtoward a Ph.D. degree in computer science atthe University of Chinese Academy of Science,supervised by Prof. Shihong Xia.

Jinxiang Chai Jinxiang Chai received a Ph.D.degree in computer science from Carnegie Mel-lon University (CMU). He is currently an asso-ciate professor in the Department of ComputerScience and Engineering at Texas A&M Uni-versity. His primary research is in the area ofcomputer graphics and vision, with broad appli-cations in other disciplines such as virtual andaugmented reality, robotics, human computerinteraction, and biomechanics. He received anNSF CAREER award for his work on theory and

practice of Bayesian motion synthesis.

Shihong Xia Shihong Xia received a Ph.D. de-gree in computer science from the University ofChinese Academy of Sciences. He is currently aprofessor of the Institute of Computing Technol-ogy, Chinese Academy of Sciences (ICT, CAS),and the director of the human motion laboratory.His primary research is in the area of computergraphics, virtual reality and artificial intelligence.

Documents

1 Combining Recurrent Neural Networks and Adversarial Training …humanmotion.ict.ac.cn/papers/2019P3_RNN/paper.pdf · 2019-11-18 · 1 Combining Recurrent Neural Networks and Adversarial