Learning Sampling Distributions for Robot Motion Planning · Learning Sampling Distributions for Robot Motion Planning Brian Ichter; 1, James Harrison 2, Marco Pavone Abstract—A

Learning Sampling Distributions for Robot Motion Planning

Brian Ichter∗,1, James Harrison∗,2, Marco Pavone1

Abstract— A defining feature of sampling-based motion plan-ning is the reliance on an implicit representation of thestate space, which is enabled by a set of probing samples.Traditionally, these samples are drawn either probabilisticallyor deterministically to uniformly cover the state space. Yet, themotion of many robotic systems is often restricted to “small”regions of the state space, due to e.g. differential constraintsor collision-avoidance constraints. To accelerate the planningprocess, it is thus desirable to devise non-uniform samplingstrategies that favor sampling in those regions where an optimalsolution might lie. This paper proposes a methodology for non-uniform sampling, whereby a sampling distribution is learntfrom demonstrations, and then used to bias sampling. The sam-pling distribution is computed through a conditional variationalautoencoder, allowing sample generation from the latent spaceconditioned on the specific planning problem. This methodologyis general, can be used in combination with any sampling-basedplanner, and can effectively exploit the underlying structure of aplanning problem while maintaining the theoretical guaranteesof sampling-based approaches. Specifically, on several planningproblems, the proposed methodology is shown to effectivelylearn representations for the relevant regions of the state space,resulting in an order of magnitude improvement in terms ofsuccess rate and convergence to the optimal cost.

I. INTRODUCTION

Sampling-based motion planning (SBMP) has emergedas a successful algorithmic paradigm for solving high-dimensional, complex, and dynamically-constrained motionplanning problems. A defining feature of SBMP is thereliance on an implicit representation of the state space,achieved through sampling the feasible space and prob-ing local connections through a black-box collision check-ing module. Traditionally, these samples are drawn eitherprobabilistically or deterministically to uniformly cover thestate space. Such a sampling approach allows arbitrarilyaccurate representations (in the limit of the number ofsamples approaching infinity), and thus allows theoreticalguarantees on completeness and asymptotic optimality. How-ever, many robotic systems only operate in small subsetsof the state space, which may be very complex and onlyimplicitly defined. This might be due to the environment(e.g., initial and goal conditions, narrow passageways), thesystem’s dynamics (e.g., a bipedal walking robot’s preferencetowards stable, upright conditions), or implicit constraints(e.g., loop closures, multirobot systems remaining out ofcollision). The performance of SBMP is thus tied to theplacement of samples in these promising regions, a resultuniform sampling is only able to achieve through sheerexhaustion. Therein lies a fundamental challenge of SBMPalgorithms: their implicit representation of the state space,

* Brian Ichter and James Harrison contributed equally to this work.1Department of Aeronautics and Astronautics, Stanford University, Stan-

ford, CA 94305. {ichter, pavone}@stanford.edu2Department of Mechanical Engineering, Stanford University, Stanford,

CA 94305. [email protected]

Fig. 1: A fast marching tree (FMT∗) generated with learned samplesfor a double integrator, conditioned on the initial state, goal state,and workspace obstacles. Note the significantly higher density ofsamples along the solution plan.

while beneficial due to its generality and minimal set ofassumptions, is limiting towards their ability to use knowninformation about the workspace or robotic system, or toleverage knowledge gained from previous planning problemsto accelerate solutions.

In this work we approach this challenge through biasingthe sampling of the state space towards these promisingregions via learned sample distributions (see Fig. 1). Atthe core of this methodology is a conditional variationalautoencoder (CVAE) [1], which is capable of learning com-plex manifolds and the regions around them, and is trainedfrom demonstrations of successful motion plans and previousrobot experience, rather than expert-crafted heuristics. Thelatent space of the CVAE can then be sampled and projectedinto a representation of the operational regions of the roboticsystem, conditioned on information specific to a given plan-ning problem, for example, initial state, goal region, andobstacles (as defined in the workspace). In this way one canbalance the theoretical and exploration benefits of sampling-based motion planning with the generality, extensibility, andpractical performance associated with hyperparametric learn-ing approaches. Remarkably, this methodology is extensibleto any system and available problem-specific information.Related Work: Several previous works have presented non-uniform sampling strategies for SBMP, often resulting insignificant performance gains [2]. Recent work by Gammellet al. [3] adaptively places batches of samples based onsolution information gained while planning [3]. Other works

leverage models of the workspace to bias samples via de-composition techniques [4], [5] or via approximation of themedial axis [6]. While these sampling heuristics are effectivein the problems for which they were designed, it is oftenunclear how they perform on environments that are outsideof their expected operating conditions [7]. This is particularlythe case for planning problems that must sample the full statespace (e.g., that include velocity). These methods furtherrequire significant expert intuition, while our method is ableto compute efficient non-uniform distributions from onlydemonstration.

Several approaches to adaptive sampling and biased sam-pling have used online learning to improve generaliza-tion. Burns and Brock [8] aim to optimally sample theconfiguration space based on maximizing a given utilityfunction for each new sample, related to the quality ofconfiguration space representation. Zucker et al. [9] use areinforcement learning approach to learn where to sample ina discretized workspace. Berenson et al. [10] cache paths,and use a learned heuristic to store new paths as wellas recall and repair old paths. Coleman et al. [11] use arelated experience-based approach to store experiences in agraph rather than as individual paths. Li and Kostas [12]learn cost-to-go metrics for kinodynamic systems, whichoften do not have an inexpensive-to-compute metric, es-pecially for nonholonomic systems. Baldwin and Newman[13] learn distributions that leverage semantic information.Ye and Alterovitz [14] present a learning-based approachthat leverages human demonstrations and SBMP to generateplans. Each of these approaches either fails to generalize toarbitrary planning problems or cannot leverage both the fullstate (e.g., velocity information) and external (e.g., obstacleinformation) factors. In this work we propose the use ofrecent advances in representation learning due to their abilityto represent general, high-dimensional, complex conditionaldistributions and thus include all available state and probleminformation.

Representation learning (also known as feature learningor manifold learning, and generally considered a subset ofdeep learning or probabilistic graphical models [15]) has seenincreasing use in robotics in the last decade [16]. This fieldprimarily aims to extract features from unstructured data,to either achieve a lower dimensional representation (oftenreferred to as encoding) or learn features for supervisedlearning or reinforcement learning. In the context of motionplanning, Havoutis and Ramamoorthy [17] present an algo-rithm to identify manifolds that were used to focus sampling,based on Locally Smooth Manifold Learning [18]. Thisapproach relies on a relatively expensive gradient descentapproach and it is unclear how it may be combined withproblem-specific information. Chen et al. [19] use a time-dependent variational autoencoder to learn dynamic motionprimitives for humanoids, which, though it is not addressed,could be used toward sampling for the motion planning prob-lem. In this work, we aim to develop a general methodologyto incorporate environmental knowledge (which changes overthe operation of the robot) with learned representations of therobotic system in question, to improve the quality of samplesused to generate motion plans.Statement of Contributions: The contributions of this paper

are threefold. First, we present a methodology to learnsampling distributions for sampling-based motion planning.Such learned sampling distributions can be used to focussampling on those regions where an optimal solution maylie, as dictated by both system-specific and problem-specificconstraints. This methodology is general to arbitrary systemsand problems, scales well to high dimensions, and maintainsthe typical theoretical results of SBMP (chiefly, completenessand asymptotic optimality). Second, we demonstrate thelearned sampling distribution methodology on a wide varietyof problems, from low-dimensional to high-dimensional,and from a geometric setting to problems with complexdynamics. Third, we compare our methodology to uniformsampling, finding approximately an order of magnitude im-provement in success rates and path costs, as well as stateof the art approaches for sample biasing. Our findings showanalogous benefits to those seen recently in the computervision community, where feature learning-based approacheshave resulted in substantially improved performance overhuman intuition and handcrafted features.Organization: The remainder of this document is orga-nized as follows. Section II reviews the problem addressedherein. Section III outlines the learning sampling distributionmethodology. Section IV demonstrates the performance ofthe method through numerical experiments. Section V sum-marizes the results and provides directions for future work.

II. PROBLEM STATEMENT

The goal of this work is to generate sample distributionsto aid sampling-based motion planning algorithms in solvingthe optimal motion planning problem. Informally, solvingthis problem entails finding the shortest free path from aninitial state to a goal region, if one exists. The simplestversion of the problem, referred to herein as the geometricmotion planning problem, is defined as follows. Let X =[0, 1]d be the state space, with d ∈ N, d ≥ 2. Let Xobsdenote the obstacle space, Xfree = X \ Xobs the free statespace, xinit ∈ Xfree the initial condition, and Xgoal ⊂ Xfreethe goal region. A path is defined by a continuous function,σ : [0, 1] → Rd. We refer to a path as collision-free ifσ(τ) ∈ Xfree for all τ ∈ [0, 1], and feasible if it is collision-free, σ(0) = xinit, and σ(1) ∈ Xgoal. We thus wish to solve,

Problem 1 (Optimal motion planning). Given a motionplanning problem (Xfree, xinit,Xgoal) and a cost measure c,find a feasible path σ∗ such that c(σ∗) = min{c(σ) : σ isfeasible}. If no such path exists, report failure.

Generally, there exist formulations of this problem forsystems with kinematic, differential, or more complex con-straints, for which we refer the reader to [20], [21]. Thegeneral form of the motion planning problem is knownto be PSPACE-complete [21], and thus one often turnsto approximate methods to solve the problem efficiently.Sampling-based motion planning has achieved particularsuccess in solving complex, high-dimensional planning prob-lems. These algorithms (e.g., PRM∗, RRT∗ [22], FMT∗ [23])approach the complexity of the motion planning problemby only implicitly representing the state space with a set ofprobing samples in Xfree and making local, free connections

to neighbor samples (sampled states within an rn > 0 costradius, where n is the number of samples). As more samplesare added, the implicit representation is able to model the truestate space arbitrarily well, allowing theoretical guaranteesof both probabilistic completeness (the probability that asolution will be found, if one exists, converges to one asthe number of samples approaches infinity) and asymptoticoptimality (the cost of the found path converges almost surelyto the optimum) [22].

At the core of these algorithms is a sampling strat-egy from which the implicit representation is constructed.These samples are drawn from a sample source (randomor deterministic) and then distributed over the state spaceaccording to a probability distribution [2]. Sampling sources,while traditionally independently and identically distributedrandom points, can be drawn through low-dispersion anddeterministic sampling strategies allowing for improved the-oretical guarantees and practical performance [24]. However,compared to the sampling distribution, the impact of thesample source is limited [2]. This work focuses primarilyon computing sampling distributions to allocate samples toregions more likely to be in an optimal motion plan.

III. LEARNING-BASED SAMPLE DISTRIBUTIONS

The goal of this work is to develop a methodologycapable of identifying subspaces of the state space containingoptimal trajectories with high probability, and generatingsamples from this subspace to improve the performance ofsampling-based motion planning (SBMP) algorithms. Theseregions may be arbitrary and potentially complex, definedby internal or external factors. Specifically, we refer tointrinsic properties of the robotic system, independent of theindividual planning problem (e.g., the system dynamics) asinternal factors, and we refer to properties specific to theplanning problem itself (e.g., the obstacles, the environment,the initial state, and the goal state) as external factors. Atthe core of our methodology is a Conditional VariationalAutoencoder (CVAE), as it is expressive enough to repre-sent very complex, high-dimensional regions and generalenough to admit arbitrary problem inputs. The CVAE isan extension of the standard variational autoencoder, whichis a class of generative models that has seen widespreadapplication in recent years, particularly in the image andvideo processing communities [25]. This extension allowsconditional data generation by sampling from the latentspace of the model [1]; in the motion planning context,the conditioning variables represent external factors. Lastly,these samples are used, along with uniform samples thatensure state space coverage, as the sampling distributionfor SBMP algorithms. The method thus leverages previousrobotic experience (motion plans and demonstrations) toinform planning algorithms. This combination of learningand SBMP allows both the generality of learning and theexhaustive exploration ability and theoretical guarantees ofSBMP.

We now examine the methodology in detail, followingalong with the outline below. It begins with an offlinephase which trains the CVAE, to be later sampled from.Line 1 initializes this phase with the required demonstrationdata. This data (states and any additional planning problem

Learning Sample Distribution Methodology OutlineOffline:

1 Input: Data (successful motion plans, robot in action,human demonstration, etc.)

2 Construct conditioning variables y3 Train CVAE, as in Fig. 2a

Online:4 Input: Motion planning problem (Xfree, xinit,Xgoal),

learned sample fraction λ5 Construct conditioning variable y6 Generate λN free samples from the CVAE latent space

conditioned on y, as in Fig. 2b7 Generate (1 − λ)N free samples from an auxiliary

(uniform) sampler8 Run sampling-based planner (e.g., PRM∗, FMT∗, RRT∗)

information) may be from successful motion plans, previoustrajectories in the state space, human demonstration, or othersources that provide insight into how the system operates.In this work we use each of these data sources (SectionIV), though, when available, optimal solutions to previousmotion planning problems are preferred since these willintuitively provide the most insight into the optimal motionplanning problem. In order to generate the required breadthof data, (in this work, on the order of one hundred thousandsample points), we leverage GPU-accelerated, approximatemotion planning algorithms to generate plans quickly [26].The data is then processed into the state of the robot andthe conditioning variables. In particular, these conditioningvariables (Line 2) contain information about the state of theproblem, such as workspace information (e.g., obstacles) orthe initial state and goal region. The CVAE is then trained inLine 3 according to the following process, with the goal oflearning the internal representation of the system conditionedon external properties of the problem (which may informwhere in the state space the system will operate, adaptivelyto a problem).

Let x denote a continuous random variable, z denote thelatent variable and y denote the conditioning variable, suchthat z is drawn from the prior pθ∗(z | y) and x is drawn fromthe conditional distribution pθ∗(x | z, y). In this case, thesedistributions are parametric, and θ∗ denotes the true set ofparameters. Generally, we write pθ to denote parameterizeddistributions. In our work, x is a sampled state, and y isa set of environmental or problem-specific factors. We aimto construct the distribution pθ∗(x | y). The outline of thetraining process is outlined in Fig. 2a. We fix the priordistribution pθ(z | y) to be the isotropic normal distributionN (0, I), and seek a deterministic function, f , that mapsthis distribution to pθ(x | z, y) (typically, called a decoder).To learn the decoder, we rely on an encoder of the formqφ(z | x, y) (where φ denotes distribution parameters). Inpractice, this is a deterministic function that maps samplesto a mean (µ) and variance (Σ) in the latent space.

The estimation of the set of the distribution parametersθ through maximum likelihood methods is generally anintractable problem, and so the variational lower bound (also

(a) Training

(b) Sampling

Fig. 2: Conditional Variational Autoencoder (CVAE) setup [27]. Inthe context of this work, x represents training states, y the con-ditioning variable (possibly initial state, goal state, and workspaceinformation), and z the latent state. The decoder (b) is used onlineto project conditioned latent samples into our distribution.

referred to as the Evidence Lower Bound or ELBO),

log pθ(x | y) ≥ −DKL(qφ(z | x, y) ‖ pθ(z | y))

+ Eqφ(z|x,y)[log pθ(x | z, y)]

is optimized instead, where DKL(·) denotes the KL diver-gence. This lower bound can then be approximated withMonte Carlo methods, giving the empirical lower bound,

log pθ(x) ≥ −DKL(qφ(z | x, y) ‖ pθ(z | y))

+1

L

L∑l=1

log pθ(x | z, y(l)). (1)

Using the “reparameterization trick,” (covered in detail in[25]), this estimator can be maximized with stochastic opti-mization techniques standard in machine learning.

The standard construction of the CVAE consists of neuralnetworks for the encoder qφ(z | x, y) and the decoderpθ(x | z, y). The encoder and decoder are trained jointlyusing the reconstruction error (see [27]). Once trained, thedecoder allows us to approximately generate samples frompθ∗(x | y) by simply sampling from the normal distributionof the latent variable (see Fig. 2b). While one iteration ofthis offline phase is often sufficient, with problems that areexpensive to solve and thus expensive to acquire data for,the entire methodology may be performed iteratively. Thus,a partially trained CVAE may generate samples that resultin better planning performance, allowing more, high-qualitydata and subsequently allowing the CVAE to be furthertrained.

The online phase of the methodology begins with a plan-ning problem, Line 4, defined by the tuple (Xfree, xinit,Xgoal),which is formed into a conditioning variable y in Line 5.For example, the initial state, the goal state, or workspaceobstacles encoded in an occupancy grid. With this in hand,we now generate samples by sampling the latent space asN (0, I), conditioning on y, and projecting these samples tothe state space through the decoder network (Line 6). Inorder to maintain the ability of SBMP algorithms to representthe true state space with arbitrarily high fidelity, and thusmaintain the theoretical guarantees of SBMP algorithms (see

Remark 1), we also sample from an auxiliary sampler, in ourcase a uniform sampler. We denote the fraction of learnedsamples as λ, i.e., we generate λN samples from the learnedsampler and (1− λ)N from the auxiliary sampler. We havefound through experimentation that λ = 0.5 represents thebest balance between leveraging the learned sample regionsand ensuring full coverage of the state space. Finally, in Line8, we use these samples to seed a SBMP algorithm, such asPRM∗, FMT∗, or RRT∗, and solve the planning problem.This methodology is applied to a variety of problems withvarying state space dimensionality, constraints, and trainingdata-generation approaches in the following section.

Remark 1 (Probabilistic Completeness and Asymptotic Op-timality). Note that the theoretical guarantees of probabilis-tic completeness and asymptotic optimality from [23], [24],and [22] hold for this method by adjusting any referencesto n (the number of samples) to (1 − λ)N (the number ofuniform samples in our methodology). This result is detailedin Appendix D of [23] and Section 5.3 of [24], which showthat adding samples can only improve the solution or lowerthe dispersion (which the theoretical results are based on),respectively.

IV. NUMERICAL EXPERIMENTS

To demonstrate the performance and generality of learningsample distributions, this section shows several numericalexperiments with a variety of robotic systems. The results inSection IV-A were implemented in MATLAB with the FastMarching Tree (FMT∗) and Batch Informed Trees (BIT∗)algorithms [23], [3], while the remainder of the results wereimplemented in CUDA C with a GPU version of the Proba-bilistic Roadmap (PRM∗) algorithm (for convergence plots)and the Group Marching Tree (GMT∗) algorithm (to generatetraining data) [22], [26]. The CVAE was implemented inTensorFlow. The simulations were then run on a Unix systemwith a 3.4 GHz CPU and an NVIDIA GeForce GTX 1080Ti GPU. Example code and the network architecture maybe found at https://github.com/StanfordASL/LearnedSamplingDistributions. We begin with asimple geometric planning problem in which we show ourmethod performs as well as or better than state of theart approaches. We also note that these state of the artapproaches are less general than the method we present inthis paper, and tuned well towards these geometric problems.We then demonstrate the benefits of learning distributions fora high-dimensional spacecraft system and a dynamical sys-tem conditioned on workspace obstacle information. Theseresults are examined conceptually as well as quantitativelyin terms of convergence, finding an order of magnitudeimprovement in success rate and cost. Finally, we apply themethod to two very complex systems, a humanoid walker anda multirobot system (trained from human demonstration),and conceptually discuss their resulting distributions. Notesample generation time is included in runtime, but generallyaccounts for only a fraction of the total runtime–generatingthousands of samples takes a only few milliseconds.

A. Geometric Planning ComparisonsWhile this methodology is very general and can be applied

to complex systems, we first show the methodology performs

(a) Uniform FMT∗ (b) Hybrid FMT∗ (c) Learned FMT∗

(d) Convergence for FMT∗ (e) Convergence for BIT∗

Fig. 3: (3a-3c) Solutions to the geometric planning problem withdifferent sample distributions and (3d-3e) convergence results forsample distrubtions with FMT∗ and BIT∗ (results averaged over100 runs, standard deviation shaded, and λ = 0.5).

well for a simple, geometric problem. All problems arecreated with randomly generated initial states, goal states,and 10 cube obstacles, as shown in Fig. 3. The learnedsample distributions were conditioned on all the probleminformation (initial and goal states and obstacles), and trainedover successful motion plans.

For this problem, we make comparisons to a non-uniformsampling strategy and combine our methodology with anexploration-guided non-uniform sampling algorithm. Specif-ically, we consider the hybrid sampling strategy proposedin [28], and BIT∗ [3]. The hybrid sampling approach usesuniform samples, Gaussian samples, and bridge samplesto create a distribution favoring narrow passageways andregions nearby obstacles. BIT∗ uses successive batches ofsamples to iteratively refine a tree, leveraging solutions fromprevious batches to selectively sample only states that canimprove the solution.

The results of these comparisons are shown in Fig. 3d.The first comparison shows each strategy with FMT∗ [23].We find that each strategy performs nearly equivalently interms of finding a solution, but the learned strategy findssignificantly better solutions in the same amount of time. Infact, the learned sampling strategy finds within 5% of the bestsolution almost immediately, instead of converging to it. Theresults show less of a performance gap with BIT∗, thoughthe learned strategy continues to perform at least as well asthe others. The delayed start of the hybrid convergence isonly due to the time required to generate samples, which forBIT∗ was implemented here by rejection.

B. Spacecraft Debris Recovery

The second numerical experiment considered is a sim-plified spacecraft debris recovery problem, whereby a cubeshaped spacecraft with 3D double integrator dynamics (x =u, no rotations) and a pair of 3 DoF kinematic arms (assumedto be much less massive than the spacecraft body), for atotal of 12 dimensions, must maneuver from an initial state,

Fig. 4: An example spacecraft debris recovery problem, wherebythe spacecraft must maneuver from an initial state to recover debrisbetween its end effectors. The spacecraft is modeled as a doubleintegrator with a pair of 3 DoF kinematic arms.

through a cluttered asteroid field, to recover debris betweenits end effectors. The cost is set as a mixed time/quadraticcontrol effort penalty with an additional term for joint anglemovement in the kinematic arms. The initial and goal stateswere randomly generated, as were the asteroids (i.e., obsta-cles). Fig. 4 shows an example problem and the spacecraftsetup. The CVAE was conditioned on the initial state anddebris location, and trained with successful motion plans.

The resulting learned distributions are shown in Fig. 5. Fig.5a shows the learned distribution of the x and y positionsfor two problems. The distribution resembles an ellipsoidconnecting the initial and goal states, with some spreadin the minor axis to account for potential obstacles in thetrajectory and a slight skew in the direction of the initialvelocity–we note the similarity to the sample distributionsfound after exploration in BIT∗ [3]. Fig. 5b shows thelearned distribution of x and x, i.e., a phase portrait of thex dimension. The green distribution favors velocities suchthat any sample flows from the initial to the goal state, firstaccelerating near the maximum sampled velocity (x = 1),and then maintaining the velocity until nearby the goal. Thered distribution, whose initial position is much closer to thegoal region has a much larger spread, favoring samples thatmove towards the goal in all directions. Finally, the learneddistributions for a single arm are shown in Figs. 5c-5d. Theangle distributions demonstrate the arm movement should bekept to a minimum, by holding one dimension fixed to a fewvalues only. In the problem setup, the arms have significantlyless impact on obstacle avoidance, but can incur a large costfor movement, which is reflected in the distributions.

Fig. 6 shows the convergence of the methods in time. Thelearned sampling distribution outperforms the uniform byapproximately an order of magnitude in finding solutionswhen they exist. The cost convergence curve is even moreextreme; the learned sample converges to a few percentagefrom optimal almost immediately, while even after 10,000samples, the uniform distribution is still more than 60% fromoptimal. This immediate convergence is similar to what wasobserved in the geometric planning problem and the narrowpassage problem (in the following section). We also note thevariance is smaller for the learned distributions.

C. Workspace Learning

Our next problem, shown in Figs. 1 and 7, was looselyinspired by the narrow passage problems in [9], demonstrates

(a) x vs y (b) x vs x

(c) α vs β1 (d) β1 vs β2

Fig. 5: Spacecraft learned distributions for various dimensions. Thelearned distributions are shown in red and green, the goal stateis shown in blue, and the initial training distributions are shownin black (when displayed). (5a) shows the distribution favoring anellipse connecting the initial and goal states. (5b) shows a phaseportrait of the x dimension, where the samples favor velocitiestowards the goal region. (5c-5d) show distributions for the kinematicarm angles, where it has converged to a few fixed values to sample,thus reducing movement cost (α and β1 denote rotation angles atthe arm-spacecraft hub and β2 denotes the joint angle in the arm).

Fig. 6: Convergence results for the spacecraft planning problem.The learned distributions (50% and 90% learned) significantlyoutperform a uniform distribution (results averaged over 100 runsand the standard deviation shaded).

the ability of the methodology to learn distributions condi-tioned on workspace information. The problem features a3D double integrator (6 dimensional state space) operatingin an environment with 3 narrow passages. The initial state,goal state, and gap locations are all randomly generated andused to condition the CVAE for each problem (obstacleswere passed through an occupancy grid). Fig. 7 showsseveral problems and their learned distribution; clearly, theCVAE has been able to capture both initial state and goalstate biasing, some sense of dynamics, and the obstacleset. The velocity distributions too show the samples effec-tively favoring movement from the initial to goal state. Theconvergence results, shown in Fig. 8, demonstrate learneddistribution solutions can be found with approximately anorder of magnitude fewer samples and converge in costalmost immediately, while the uniform sampling results showthe classic convergence curve we may expect.

This method’s ability to learn both dynamics and obstaclesdemonstrates that learning is capable of almost entirelysolving some problems. While this would be quite efficient,

Fig. 7: Example learned distributions for the narrow passage prob-lem, conditioned on the initial state (red), goal state (blue), and theobstacles (through an occupancy grid).

Fig. 8: Convergence results for the narrow passage problem, demon-strating the learned sample distributions (50% and 90% learned)achieve approximately an order of magnitude better performancein terms of success rate, and are able to converge with few samples(results averaged over 100 runs and the standard deviation shaded).

we also found the learned distributions susceptible to failuremodes (e.g., cutting corners), which result in infeasible so-lution trajectories. In our methodology, this is easily handledthrough the uniform sampling and the guarantees of SBMP.This methodology may thus be thought of as attemptingto solve the problem through learning, and accounting forpossible errors with a theoretically sound algorithm.

D. Learning Dependent Sample SetsTo showcase the generality of the learning distributions

methodology and its ability to capture arbitrary and complexdistributions, we used the same methodology with sets ofsamples (in this case, solutions that contained three or moresamples). We thus are learning not only the distribution forone sample, but three samples with dependency betweenthem (in particular, they should be well dispersed). Fig. 9shows resulting distributions and the success rate of thismethod compared to learning only a single sample. Asexpected, the distributions learn to be well-distributed alongthe solution trajectory, resulting in higher success rates (e.g.,the 90% of learned to uniform increased from 82% to 93%at 100 samples). This result corroborates the findings of [24],which found the primary benefit of low-dispersion samplingis in finding solutions with fewer samples.

E. Humanoid WalkerWe demonstrated our methodology on samples from the

HumanoidFlagrunHarder-v1 task of the Roboschoollearning environment [29]. In this task, a humanoid figure isgiven a goal location in the workspace every 1000 frames,and receives a reward relative to the its distance fromthis goal. The humanoid policy trained was taken froman example script, in which Proximal Policy Optimization

(a) (b) (c)

(d) Convergence

Fig. 9: (9a-9c) Learned distributions of multiple, dependent samplesdrawn together (each one in red, blue and green), effectivelyenforcing some dispersion between them. (9d) Success rates forsingle sample distributions and multi-sample distributions (lightercolors denote the multi-sample counterparts).

[29] was used to train a feedforward neural network policyparameterization. The state space of this system is 44-dimensional, corresponding to joint angles, as well as theposition of the desired goal, and binary variables indicatingwhether each foot was in contact with the ground. Fig.10 shows four samples from our CVAE (left) a samplewith randomly sampled joint angles as comparison (right).One may note that in all of our samples, the humanoid’sarms are raised forward and in front of its body. In thelearning environment, small blocks are intermittently thrownat the humanoid so that the learned policy is more robust todisturbances. In response, the learned policy keeps its armsextended to provide stability. The samples are in the full statespace, so joint velocities are also sampled, but those are notvisualized here.

Note, we only examine our methodology qualitatively inthis section due to the difficulty in making convergencecomparisons to uniform sampling; uniform sampling wouldrequire orders of magnitude more samples and computationtime than considered in previous problems and would notbe an approach considered in practice. Specialized tech-niques do exist towards selecting samples for humanoidrobotic systems, however these have required expert tuningto specific problems, are more computationally intensive,and generally more conservative [30], [19], [31]. Many ofthese approaches are based on sampling configurations thatare statically balanced, with little consideration for samplingjoint velocities. Our approach generates samples from a hu-manoid robot successfully running, implying that these statesare dynamically stable since the system was trained withadded disturbances. The generated samples are dynamic, andthus can yield substantially lower costs as they may be fluidlyconnected.

F. Multirobot and Human DemonstrationFinally, we demonstrate the methodology in a multi-robot

problem trained through human demonstration. The problemis set up as two cars completing a lane changing maneuver(Fig. 11a), with a total of 22 states, as well as a constrainton the two cars colliding and a preference towards the

Fig. 10: Four samples from the learned distribution for a 44 dimen-sional humanoid model (left) and one drawn from a uniform randomdistribution (right). Note the learned samples are in robust (armsup), upright poses, thus allowing momentum to be maintained.

(a) (b) (c)

Fig. 11: (11a) Setup for lane change problem, generated fromhuman demonstration, where the red and blue car must switch lanesand make their respective exits. (11b-11c) The training dataset isvisualized in black, overlayed with learned distributions for the blueand red initial car states. Displayed are the x and y positions ofeach car at several samples. Initially, the red car is behind the bluecar, but at a higher velocity and thus eventually passes the blue carin most samples (the red car is leading samples visualized in (11b)and tailing in (11c)). Note each sample is not in self collision.

cars remaining in their lanes when not changing lanes. Thedata was collected from human demonstration on a twoperson driving simulator. The resulting learned distributionis visualized for a single initial state in Fig. 11b andoverlayed on the total dataset. Again, we only examine thisproblem qualitatively due to the computational challengesof solving the same problem with uniform samples. Thelearned distribution effectively encapsulates the necessaryfactors for a successful motion plan: the collision constraint,the preference towards the center of lanes, the preference tomaintain forward velocity, and the choice of states applicableto a lane change maneuver.

V. CONCLUSIONS

Conclusions: In this paper we have presented a methodol-ogy to bias samples in the state space for sampling-basedmotion planning algorithms. In particular, we have used aconditional variational autoencoder to learn subspaces ofvalid or desirable states, conditioned on problem detailssuch as obstacles. We have compared our methodology toseveral state of the art methods for sample biasing, and havedemonstrated it on multiple systems, showing approximatelyan order of magnitude improvement in cost and success

rate over uniform sampling. This learning-based approach ispromising due to its generality and extensibility, as it canbe applied to any system and can leverage any probleminformation available. Its ability to automatically discoveruseful representations for motion planning (as seen by thenear immediate convergence) is similar to recent results fromdeep learning in the computer vision community, whichrequire less human intuition and handcrafting, and exhibitsuperior performance.Future Work: There are many possible avenues open forfuture research. One promising extension is the incorporationof semantic workspace information through the conditioningvariable. These semantic mappings show promise towardsallowing mobile robots to better understand task specifi-cations and interact with humans [32]. Another promisingextension builds upon recent work demonstrating the fa-vorable theoretical properties and improved performance ofnon-independent samples [24]. An approach to reducing theindependence of the samples was presented in Section IV-C, but extensions beyond this exist, including generatinglarge sample sets (e.g., > 1000). Lastly, for systems withconstraints that force valid configurations to lie on zero-measure manifolds, recent work has focused on projectivemethods [33]. Our methodology, while not capable of sam-pling directly on this manifold, can easily learn to samplenear it, and therefore, potentially dramatically improve theperformance of these methods.Application in Practice: Note again, the goal of this workis to compute a distribution representing promising regions(i.e., regions where optimal motion plans are likely to befound) through a learned latent representation of the systemconditioned on the planning problem. During the offlineCVAE training phase, we recommend training from optimalmotion plans to best demonstrate promising regions. Wegenerally train on the order of one hundred thousand motionplans, the generation of which can be accelerated withapproximately optimal, GPU-based planning algorithms [26].To form the conditional variable, we recommend includingall available problem information, particularly the initial stateand goal state. If available, workspace obstacles can be easilyincluded through occupancy grids; we note however thatobstacles can be included in any form, as the neural networkrepresentation for the CVAE has the potential to arbitrarilyproject these obstacles as necessary. Multiple dependentsamples can also be generated at once (Section IV-D); wefound even as few as three resulted in marked increases insuccess rate. Finally, in the online phase of the algorithm,we draw a combination of learned and uniform samples toensure good state space coverage; we found a 50-50 split tobe most effective.

ACKNOWLEDGMENT

This work was supported by a Qualcomm Innovation Fel-lowship and by NASA under the Space Technology ResearchGrants Program, Grant NNX12AQ43G. James Harrison wassupported in part by the Stanford Graduate Fellowship andthe National Sciences and Engineering Research Council(NSERC). The authors also wish to thank Edward Schmer-ling for helpful discussions.

REFERENCES

[1] K. Sohn, H. Lee, and X. Yan, “Learning structured output represen-tation using deep conditional generative models,” in NIPS, 2015.

[2] D. Hsu, J.-C. Latombe, and H. Kurniawati, “On the probabilisticfoundations of probabilistic roadmap planning,” IJRR, 2006.

[3] J. D. Gammell, S. S. Srinivasa, and T. D. Barfoot, “Batch InformedTrees (BIT*): Sampling-based optimal planning via the heuristicallyguided search of implicit random geometric graphs,” in IEEE ICRA,2015.

[4] H. Kurniawati and D. Hsu, “Workspace importance sampling forprobabilistic roadmap planning,” in IEEE IROS, 2004.

[5] J. P. Van den Berg and M. H. Overmars, “Using workspace informationas a guide to non-uniform sampling in probabilistic roadmap planners,”IJRR, 2005.

[6] Y. Yang and O. Brock, “Adapting the sampling distribution in PRMplanners based on an approximated medial axis,” in IEEE ICRA, 2004.

[7] R. Geraerts and M. H. Overmars, “Sampling techniques for proba-bilistic roadmap planners,” IAS, 2004.

[8] B. Burns and O. Brock, “Toward optimal configuration space sam-pling.” in RSS, 2005.

[9] M. Zucker, J. Kuffner, and J. A. Bagnell, “Adaptive workspace biasingfor sampling-based planners,” in IEEE ICRA, 2008.

[10] D. Berenson, P. Abbeel, and K. Goldberg, “A robot path planningframework that learns from experience,” in IEEE ICRA, 2012.

[11] D. Coleman, I. A. Sucan, M. Moll, K. Okada, and N. Correll,“Experience-based planning with sparse roadmap spanners,” in IEEEICRA, 2015.

[12] Y. Li and K. E. Bekris, “Learning approximate cost-to-go metrics toimprove sampling-based motion planning,” in IEEE ICRA, 2011.

[13] I. Baldwin and P. Newman, “Non-parametric learning for natural plangeneration,” in IEEE IROS, 2010.

[14] G. Ye and R. Alterovitz, “Demonstration-guided motion planning,” inISRR, 2015.

[15] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: Areview and new perspectives,” IEEE TPAMI, 2013.

[16] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training ofdeep visuomotor policies,” JMLR, 2016.

[17] I. Havoutis and S. Ramamoorthy, “Motion synthesis through random-ized exploration on submanifolds of configuration space,” in RobotSoccer World Cup, 2009.

[18] P. Dollar, V. Rabaud, and S. Belongie, “Non-isometric manifoldlearning: Analysis and an algorithm,” in ICML, 2007.

[19] N. Chen, M. Karl, and P. van der Smagt, “Dynamic movementprimitives in latent space of time-dependent variational autoencoders,”in IEEE-RAS HUMANOIDS, 2016.

[20] E. Schmerling, L. Janson, and M. Pavone, “Optimal sampling-basedmotion planning under differential constraints: the driftless case,”IEEE ICRA, 2015.

[21] S. LaValle, Planning algorithms. Cambridge university press, 2006.[22] S. Karaman and E. Frazzoli, “Sampling-based algorithms for optimal

motion planning,” IJRR, 2011.[23] L. Janson, E. Schmerling, A. Clark, and M. Pavone, “Fast Marching

Tree: A fast marching sampling-based method for optimal motionplanning in many dimensions,” IJRR, 2015.

[24] L. Janson, B. Ichter, and M. Pavone, “Deterministic sampling-basedmotion planning: Optimality, complexity, and performance,” IJRR,2017.

[25] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,”arXiv preprint arXiv:1312.6114, 2013.

[26] B. Ichter, E. Schmerling, and M. Pavone, “Group Marching Tree:Sampling-based approximately optimal motion planning on GPUs,”in IEEE IRC, 2017.

[27] C. Doersch, “Tutorial on variational autoencoders,” arXiv preprintarXiv:1606.05908, 2016.

[28] D. Hsu, G. Sanchez-Ante, and Z. Sun, “Hybrid PRM sampling witha cost-sensitive adaptive strategy,” in IEEE ICRA, 2005.

[29] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,“Proximal policy optimization algorithms,” arXiv preprintarXiv:1707.06347, 2017.

[30] S. Dalibard, A. El Khoury, F. Lamiraux, A. Nakhaei, M. Taıx, andJ.-P. Laumond, “Dynamic walking and whole-body motion planningfor humanoid robots: an integrated approach,” IJRR, 2013.

[31] K. Hauser, T. Bretl, K. Harada, and J.-C. Latombe, “Using mo-tion primitives in probabilistic sample-based planning for humanoidrobots,” in WAFR, 2008.

[32] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobilerobotics tasks: A survey,” Robotics and Autonomous Systems, 2015.

[33] L. Jaillet and J. M. Porta, “Path planning under kinematic constraintsby rapidly exploring manifolds,” IEEE TRO, 2013.

Documents

Learning Sampling Distributions for Robot Motion Planning · Learning Sampling Distributions for Robot Motion Planning Brian Ichter; 1, James Harrison 2, Marco Pavone Abstract—A