Variational Autoencoder for End-to-End Control of ... · Variational Autoencoder for End-to-End Control of Autonomous Driving with Novelty Detection and Training De-biasing Alexander

Variational Autoencoder for End-to-End Control of Autonomous Drivingwith Novelty Detection and Training De-biasing

Alexander Amini1, Wilko Schwarting1, Guy Rosman2, Brandon Araki1, Sertac Karaman3, Daniela Rus1

Abstract— This paper introduces a new method for end-to-end training of deep neural networks (DNNs) and evaluatesit in the context of autonomous driving. DNN training hasbeen shown to result in high accuracy for perception to actionlearning given sufficient training data. However, the trainedmodels may fail without warning in situations with insufficientor biased training data. In this paper, we propose and evaluatea novel architecture for self-supervised learning of latentvariables to detect the insufficiently trained situations. Ourmethod also addresses training data imbalance, by learning aset of underlying latent variables that characterize the trainingdata and evaluate potential biases. We show how these latentdistributions can be leveraged to adapt and accelerate thetraining pipeline by training on only a fraction of the totaldataset. We evaluate our approach on a challenging datasetfor driving. The data is collected from a full-scale autonomousvehicle. Our method provides qualitative explanation for thelatent variables learned in the model. Finally, we show howour model can be additionally trained as an end-to-end con-troller, directly outputting a steering control command for anautonomous vehicle.

I. INTRODUCTION

Robots operating in human-centered environments haveto perform reliably in unanticipated situations. While deepneural networks (DNNs) offer great promise in enablingrobots to learn from humans and their environments (asopposed to hand-coding rules), substantial challenges re-main [1]. For example, previous work in autonomous drivinghas demonstrated the ability to train end-to-end a DNNcapable of generating vehicle steering commands directlyfrom car-mounted video camera data with high accuracyso long as sufficient training data is provided [2]. But trueautonomous systems should also gracefully handle scenarioswith insufficient training data. Existing DNNs will likelyproduce incorrect decisions without a reliable measure ofconfidence when placed in environments for which they wereinsufficiently trained.

A society where robots are safely and reliably integratedinto daily life demands agents that are aware of scenarios forwhich they are insufficiently trained. Furthermore, subsys-tems of these agents must effectively convey the confidenceassociated their decisions. Finally, robust performance ofthese systems necessitates an unbiased, balanced trainingdataset. To date, many such systems have been trained with

Toyota Research Institute (TRI) provided funds to assist the authors withtheir research but this article solely reflects the opinions and conclusionsof its authors and not TRI or any other Toyota entity. We gratefullyacknowledge the support of NVIDIA Corporation with the donation of theVolta V100 GPU used for this research.

1 Computer Science and Artificial Intelligence Lab, Massachusetts Insti-tute of Technology {amini,wilkos,araki,rus}@mit.edu

2 Toyota Research Institute (TRI) {guy.rosman}@tri.global3 Laboratory for Information and Decision Systems, Massachusetts

Institute of Technology {sertac}@mit.edu

Dataset Debiasing

Sample EfficientAccelerated Training

Uncertainty Estimation& Novelty Detection

Resampled data distribution

Encoder

Road Surface

SteeringControl

CommandWeather(Snow)

Unsupervised Latent Variables

Propagate Uncertainty

AdjacentVehicles

Fig. 1: Semi-supervised end-to-end control. An encoderneural network is trained to learn a supervised controlcommand as well as various other unsupervised outputsthat qualitatively describe the image. This enables two keycontributions of novelty detection and dataset debiasing.

datasets that are either biased or contain class imbalances,due to the lack of labeled data. This negatively impacts boththe speed and accuracy of training.

In this paper, we address these limitations by developingan end-to-end control method capable of novelty detectionand automated debiasing through self-supervised learning oflatent variable representations. In addition to learning a finalcontrol output directly from raw perception data, we alsolearn a number of underlying latent variables that qualita-tively capture the underlying features of the data cf. Fig. 1.These latent variables, as well as their associated uncertain-ties, are learned through self-supervision of a network trainedto reconstruct its own input. By estimating the distributionof latent factors, we can estimate when the network is likelyto fail (thus increasing the robustness of the controller,) andadapt the training pipeline to cater to the distribution of theseunderlying factors, thereby improving training accuracy. Ourapproach makes two key contributions:

1) Detection of novel events which the network has beeninsufficiently trained for and not trusted to producereliable outputs; and

2) Automated debiasing of a neural network trainingpipeline, leading to faster training convergence andincreased accuracy.

Our solution uses a Variational Autoencoder (VAE) net-work architecture comprised of two parts, an encoder and adecoder. The encoder is responsible for learning a mappingfrom raw sensor data directly to a low dimensional latent

Latent Space DecoderEncoder

Domain Knowledge1. Recover from o�-center and o�-orientation 2. Downsample straight roads

Data CollectionData collected from human drivers

Conv Deconv FC Mean Variance Rand Sample

k-1

0

1

0

1

k-1

0

1

k-1

Fig. 2: Novel VAE architecture for end-to-end control. Image features are extracted through convolutional layers toencode the input image into the variational latent space with one of the latent variables explicitly supervised to learnsteering control. The resulting latent variables are self-supervised by feeding the entire encoding into a decoder network thatlearns to reconstruct the original input image. Uncertainty is modeled through the variance of each latent variable (σ2

k).

space encoding that maximally explains as much of thedata as possible. The decoder is responsible for learning theinverse mapping that takes as input a single sample fromthe aforementioned latent space and reconstructs the originalinput. As opposed to a standard VAE model, which self-supervises the entire latent space, we also explicitly supervisea single dimension of the latent space to represent the roboticcontrol output.

We use end-to-end autonomous driving as the robotic con-trol use case. Here a steering control command is predictedfrom only a single input image. As a safety-critical task, au-tonomous driving is particularly well-suited for our approach.Control systems for autonomous vehicles, when deployed inthe real world, face enormous amounts of uncertainty andpossibly even environments that they have never encounteredbefore. Additionally, autonomous driving is a safety criticalapplication of robotics; such control systems must possessreliable ways of assessing their own confidence.

We evaluate our VAE-based approach on a challengingreal driving dataset of time synchronized image, control datasamples collected with a full scale autonomous vehicle anddemonstrate both novelty detection and automated debiasingof the dataset. Our algorithm includes a VAE-based indicatorto detect novel images that were not contained in the trainingdistribution. We demonstrate our algorithm’s ability to detectover 97% of novel image frames that were trained duringa different time of day and detect 100% of camera sensormalfunctions in our dataset.

We address training set data imbalance by introducingunsupervised latent variables into the VAE model. Theselatent variables qualitatively represent many of the high levelsemantic features of the scene. Moreover, we show that wecan estimate the distributions of the latent variables overthe training data, enabling automated debiasing of newlycollected datasets against the learned latent variables. Thisresults in accelerated and sample efficient training.

The remainder of the paper is structured as follows: wesummarize the related work in Sec. II, formulate the modelin Sec. III, describe our experimental setup and dataset inSec. IV, and provide an overview of our results in Sec. V.

II. RELATED WORKS

Traditional methods for autonomous driving first decom-pose the problem into smaller components, with an individualalgorithm applied to each component. These submodules canrange from mapping and localization [3], [4], to perception[5]–[7], planning [8], [9], and control [10], [11]. On theother hand, end-to-end systems attempt to collapse the entireproblem (from raw sensory data to control outputs) into asingle learned model. The ALVINN system [12] originallydemonstrated this idea by training a multilayer perceptron tolearn the direction a vehicle travel from pixel image inputs.Advancements in deep learning have caused a revival of end-to-end methods for self-driving cars [2], [13]–[15]. Thesesystems have shown enormous promise by outputting a singlesteering command and controlling a full-scale autonomousvehicle through residential and highway streets [2]. Thesystem in [13] learns a discrete probability distribution overpossible control commands while [15] applies the outputof their convolutional feature extractor to a long short-termmemory (LSTM) recurrent neural network for generatinga driving policy. However, all of these these models arestill largely trained as black-boxes and lack a measureof associated confidence in their output and method forinterpreting the learned features.

Understanding and estimating confidence of the output ofmachine learning models has been investigated in differentways: One can formulate training of DNNs as a maximumlikelihood problem using a softmax activation layer on theoutput to estimate probabilities of discrete classes [16] aswell as discrete probability distributions [17]. Introspectivecapacity has been used to evaluate model performance for avariety of commonly used classification and detection tasks[18] by estimating an algorithm’s uncertainty by the distancefrom train to test distribution in feature space. Bayesian deeplearning has even been applied to end-to-end autonomousvehicle control pipelines [19] and offers an additional wayto estimate confidence through Monte Carlo dropout sam-pling of weights in recurrent [20] and convolutional neuralnetworks [21]. However, Monte Carlo dropout sampling isnot always feasible in real-time on many hardware platforms

due to its repetitive inference calls. Additionally, there is noexplanation or interpretation of the resulting control actionsavailable, as well as no confidence metric of whether themodel was trained on similar data as the current input data.

Variational Autoencoders (VAEs) [22], [23] are gainingimportance as an unsupervised method to learn hidden rep-resentations. Such latent representations have been shownto qualitatively provide semantic structure underlying rawimage data [24]. Previous works have leveraged VAEs fornovelty detection [25], but did not directly use the pre-dicted distributions of the sensor for a model fit criterion.Conversely, our work presents an indicator to detect novelimages that were not contained in the training distribution byweighting the reconstructed image by the latent uncertaintypropagated through the network. High loss indicates that themodel has not been trained on that type of image and thus re-flects lower confidence in the network’s ability to generalizeto that scenario. Operating over input distributions divergingfrom the training distribution is potentially dangerous, sincethe model cannot sufficiently reason about the input data.

Additionally, learning in the presence of class imbalanceis one of the key challenges underlying all machine learningsystems. When the classes are explicitly defined and labeledit is possible to resample the dataset [26], [27] or evenreweight the loss function [28], [29] to avoid training bias.However, these techniques are not immediately applicableif the underlying class bias is not explicitly labeled (as isthe case in many real world training problems). While [30]demonstrates how K-means can be used to find clusterswithin the data before training to provide a loss reweightingscheme, this method does not adapt to the model duringtraining. Additionally, running K-means on high dimensionalimage data can be extremely computational and destroysall spatial information between nearby pixels. This paperprovides an algorithm for adaptive dataset debiasing duringtraining by learning the latent space distribution and sam-pling from the most representative regions of this space.

III. MODEL

In this section, we start by formulating the end-to-endcontrol problem and then describe our model architecturefor estimating steering control of an autonomous vehicle (inan end-to-end manner). We explain how our model allows usto self-supervise on a number of other latent variables whichqualitatively describe the driving dataset. All models in thispaper were trained on a NVIDIA Volta V100 GPU.

A. End-to-End Model FormulationWe start from the end-to-end model framework. In

this framework we observe n training images, X ={x1, . . . , xn}, which are collections of raw pixels from afront-facing video camera. We aim to build a model that candirectly map our input images, X , to output steering com-mands based on the curvature of the road Y = {y1, . . . , yn}.To train a single image model we optimized the meansquared error (MSE) or L2 loss function between the humanactuated control, yi, and the predicted control, yi:

Ly(y, y) =1

n

n∑i=1

(yi − yi)2 (1)

Note that only a single image is used as input at everytime instant. This follows from original observations wheremodels that were trained end-to-end with a temporal infor-mation (CNN+LSTM) are unable to decouple the underlyingspatial information from the temporal control aspect. Whilethese models perform well on test datasets, they face controlfeedback issues when placed on a physical vehicle andconsistently drift off the road.

B. VAE Architecture

We extend the end-to-end control model from Sec. III-A into a variational autoencoding (VAE) model. Unlike theclassical VAE model [22], [23], one particular latent vari-able is explicitly supervised to predict steering control, andcombined with the remaining latent variables to reconstructthe input image. Our model is shown in Fig. 2. The modelaccepts as input a 66× 200× 3 RGB image in mini-batchesof size n = 50. We use a convolutional network encoder,comprised of convolutional (Conv) and fully connected (FC)layers, to compute Q(z|X), the distribution of the latentvariables given a data point. The encoder has a similararchitecture as a standard end-to-end regression model andis detailed in Table I, with 5 convolutional layers andtwo fully connected layers with dropout. The latent spacesection of Fig. 2 shows the encoder outputting 2k activationscorresponding to µ ∈ Rk,Σ = Diag[σ] � 0 used to definethe distribution of z. Next, there is a decoder network thatmirrors the encoder (two FC layers with dropout and then 5de-convolutional layers) to reconstruct the image back fromthe latent space sample z.

In order to differentiate the outputs through the samplingphase, VAEs use a reparameterization “trick”, sampling ε ∼N (0, I) and compute z = µ(X) + Σ1/2(X) × ε. Thisallows us to train the encoder and decoder, end-to-end, usingbackpropagation.

In our VAE model we additionally supervise one of thelatent variables to take on the value of the curvature ofthe vehicle’s path. We represent this modified variable asz0 = zy = y. It is visible in Fig. 2 as the steering wheelgraphic. The loss function of our VAE has three parts –a reconstruction loss, a latent loss, and a supervised latentloss. The reconstruction loss is the L1 norm between theinput image and the output image, and serves the purpose ofTABLE I: Architecture of the encoder neural network.“Conv” and “FC” refer to convolutional and fully connectedlayers, while k denotes the number of latent variables.

Layer Output Activation Kernel Stride1. Input 66x200x3 Identity - -2. Conv1 31x98x24 ReLU 5x5 2x23. Conv2 14x47x36 ReLU 5x5 2x24. Conv3 5x22x48 ReLU 5x5 1x15. Conv4 3x20x64 ReLU 3x3 1x16. Conv5 1x18x64 ReLU+Flatten 3x3 1x17. FC1 1000 ReLU - -8. FC2 100 ReLU - -9. FC3 2 k Identity - -10. Latent Sample k Identity - -

training the decoder. We define this loss function as follows:

Lx(x, x) =1

n

n∑i=1

|xi − xi| (2)

The latent loss is the Kullback-Liebler divergence betweenthe latent variables and a unit Gaussian, providing regulariza-tion for the latent space [22], [23]. For Gaussian distributions,the KL divergence has the closed form

LKL(µ, σ) = −1

2

k−1∑j=0

(1− µ2

j − σ2j + log (σj + ρ)2

)(3)

where ρ = 10−8 is added for numerical stability.Lastly, the supervised latent loss is a new addition to the

loss function, and it is defined as the MSE between thepredicted and actual curvature of the vehicle’s path. It isthe same as Eq. 1 from Sec. III-A.

These three losses are summed to compute the total loss:

LTOTAL(·) = c1Ly(y, y)+ c2Lx(x, x)+ c3LKL(µ, σ) (4)

where c1, c2, and c3 are used weight the importance ofeach loss function. We found that c1 = 0.033, c2 = 0.1,c3 = 0.001 yielded a nice trade off in importance betweensteering control, reconstruction, and KL loss, wherein noone individual loss component overpowers the others duringtraining.

C. Novelty Detection

A crucial aspect underlying many deep learning systemsis their ability to reliably compute uncertainty measurementsand thus determine when they have low confidence in thegiven environment. Many standard end-to-end networks areseen as black-boxes, producing only a single control outputnumber at the final layer. In this section, we explore anovel method for using the architecture to detect drivingenvironments with insufficient training that cannot be trustedto produce a reliable decision.

We note the VAE architecture provides uncertainty es-timates for every variable in the latent space via theirparameters {µi, σi}k−1

i=0 . However, what we really desireare uncertainty estimates in the input image space sincethese observations are available at test time. We thereforepropagate the uncertainties through the remainder of thenetwork by sampling in the latent layer space and computingempirical uncertainty estimates in any of the future layers(including the final reconstructed image space).

This can be explained by treating the VAE encoder net-work as a posterior model inference of the parameters θ,z as samples from the posterior distribution inferred fromθ. Propagation of z into x results in a posterior predictivesample. A common model fit measure for θ would be

logP (x|θ); θ = {σi, µi}k−1i=0 (5)

Using a common pixel-wise independence approximationallows us to get a model rejection criteria based on theimages, using a weighted L2 error. As commonly done inimage processing, we use instead the L1 norm due to its

robustness. Averaging over image pixels yields the errorterm:

D(x, x) =

⟨∣∣x(p) − E[x(p)|θ

]∣∣√Var[x(p)|θ

]⟩

(6)

where E[x(p)|θ

],Var

[x(p)|θ

]are computed by sampling

values of z and computing statistics for the resulting decodedimages x. Additionally, x(p) denotes the value of pixel p inimage x. The approach for pixelwise uncertainty estimatesis similar to the unscented transform [31], and is capturedin Algorithm 1. Using these statistics indicates whetheran observed image is well captured by our trained model.In practice, we implemented a real-time version of thisalgorithm using 20 samples on each time iteration.

Algorithm 1 Compute pixel-wise expectation and variance

Require: Input image x, Encoder NN, & Decoder NN1: {σi, µi}ki=1 ← Encoder(x)2: θ ← {σi, µi}ki=1

3:4: for j = 1 to T do5: for i = 1 to k do6: Sample zi ∼ N (µi, σ

2i )

7: end for8: xj = Decoder(z)9: end for

10:11: E[x(p)|θ] = 1

T

∑Tj=1 x

(p)j

12: Var[x(p)|θ] = 1T

∑Tj=1

(x(p)j − E[x(p)|θ]

)213:14: return E[x|θ], Var[x|θ]

D. Increased Training on Rare Events

A majority of driving data consists of straight road driving,while turns and curves are vastly underrepresented. End-to-end networks training often [2] handles this by resamplingthe training set to place more emphasis on the rarer events(i.e., turns). We generalize this notion to the latent spaceof our VAE model, better exploring the space of bothcontrol events and nuisance factors, for not only the steeringcommand but all other underlying features. By feeding theoriginal training data through the learned model we estimatethe training data distribution Q(z|X) in the latent space.Subsequently, it is possible to increase the proportion of rarerdatapoints by dropping overrepresented regions of the latentspace. We approximate Q(z|X) as a histogram Q(z|X),where z is the output of the Encoder NN corresponding to theinput images x ∈ X . To avoid data-hungry high-dimensional(in our case 25 dimensions) histograms, we further simplifyand utilize independent histograms Qi(zi|X) for each la-tent variable zi and approximate Q(z|X) ∝

∏i Qi(zi|X).

Naturally, we would like to train on a higher number ofunlikely training examples and drop many samples over-represented in the dataset. We therefore train on a subsam-pled training set Xsub including datapoints x with probabilityW (z(x)|X) ∝

∏i 1/(Qi(zi(x)|X) + α). For small α the

subsampled training distribution is close to uniform over z,whereas large α keep the subsampled training distributioncloser to the original training distribution. At every epoch allimages x from the original dataset X are propagated throughthe (learned) model to evaluate the corresponding latentvariables z(x). The histograms Qi(zi(x)|X) are updatedaccordingly. A new subsampled training set Xsub is drawn bykeeping images x from the original dataset X with likelihoodW (z(x)|X). Training on the subsampled data Xsub nowforces the classifier into a choice of parameters that workbetter in rare cases without strong deterioration of perfor-mance for common training examples. Most importantly, thedebiasing is not manually specified beforehand but based onlearned latent variables.

IV. DATASET

The vehicle base platform employed to collect the datasetis a Toyota Prius 2015 V used in collaboration with the MIT-Toyota partnership. A forward facing Leopard Imaging LI-AR0231-GMSL camera, which has a field-of-view of 120degrees and captures 1080p RGB images at approximately30Hz, is the vision data source for this study. Sensors alsocollected the human actuated steering wheel angle (rad) aswell as vehicle speed (m/sec). A Xsens MTi 100-seriesInertial Measurement Unit (IMU) was used to collect ac-celeration, rotation, and orientation data from the rigid bodyframe and thus compute the curvature of the vehicle’s path.Specifically, given a yaw rate γi (rad/sec), and the speed ofthe vehicle, vi (m/sec), we compute the curvature (or inverseradius) of the path as yi = γi

vi. Note that we can model a

simple relationship between steering wheel angle and roadcurvature given the vehicle slip angle by approximating thevehicle according to the Bicycle Model. While we employcurvature (yi) to model our networks, for the remainderof this paper we will use the terms “steering control” and“curvature” interchangeably, since we can compute one giventhe other by estimating slip angle directly from IMU data.

All communication and data logging was done directlyon an NVIDIA PX2, which is an in-car supercomputerspecifically developed for autonomous driving. As part ofdata collection for this project we setup the PX2 to con-nect, communicate, and control a full-scale Toyota Priuswith drive-by-wire. Additionally, we developed the softwareinfrastructure to read the video stream and synchronize withthe other data streams (inertial and steering data) on the PX2.

We drove the vehicle in the Boston metropolitan area andcollected data for approximately 4 hours (which was split3/1 into training and test sets). In the following subsection,we outline the data processing and augmentation techniquesthat were performed prior to training our models.

A. Processing and AugmentationA number of preprocessing steps were taken to clean and

prepare the data for training. First, we annotated each frameof collected video according to the time of day, road type,weather, and maneuver (lane-stable, turn, lane change, junk).This labeling process allowed us to segment out the piecesof our data which we wanted to train with. Note the labelswere not used as part of the training process, but only fordata filtering and preprocessing. Following previous work [2]

0

10

20

110

120

130

0.20.40.60.8

11.2

120

130

140

Iterations x104 x1040 1 20 1 2Iterations

Training Set Validation Set

Fig. 3: Loss evolution. Convergence of our four lossfunctions (defined in Eqs. [1]-[4]) on the train (blue) andvalidation (red) data.

we used only “lane-stable” in our dataset, which correspondsto sections of driving where the vehicle is stable in its lane.

To inject domain knowledge into our network we aug-mented the dataset with images collected from camerasplaced approximately 2 feet to the left and right of the maincenter camera. We correspondingly changed the supervisedcontrol value to teach the model how to recover from off-center positions. For example, an image collected from theright camera we perturb the steering control with a negativenumber to steer slightly left, and vice versa. We add imagesfrom all three cameras (center, left, and right) to our datasetfor training.

B. Optimization

We trained our models with the Adam optimizer [32]with α = 10−4, β1 = 0.9, β2 = 0.999, and ε = 10−8.We considered the number of latent variables, k, to be ahyperparameter and trained models with 400, 100, 50, 25,and 15 latent variables. By analyzing the validation errorupon convergence, we were able to identify that the modelwith 25 latent variables provided the best fit of our datasetwhile providing realistic reconstructions. Therefore, we usek = 25 for the rest of our analysis. Fig. 3 shows the evolutionof all four losses (as defined in Eqs. 1-4) for the training andvalidation sets over 2.6× 104 steps. The MSE steering lossconverges to nearly 0 for training but decreases more slowlyfor the validation set. Meanwhile, the KL divergence lossof Eq. 3 increases rapidly before plateauing at a relativelylow value for both training and validation data. This is to beexpected, since the latent variable distributions are initializedas N (0, 1) and are then perturbed to find approximationsof the Gaussian that allow the distributions to best reducethe overall loss. Since we use 25 latent variables to modelan image with 66 × 200 × 3 = 39, 600 dimensions, it isexpected that the decoder cannot exactly reproduce the inputimage (i.e. Lx(x, x) > 0).

V. RESULTS

A. Steering Control Accuracy

To evaluate the performance of our model for the task ofend-to-end autonomous vehicle control of steering we startby training a standard regression network which takes asinput a single image and outputs steering curvature [2]. The

Worst Steering Prediction Best Steering Prediction

Mean Uncertainty: Mean Uncertainty:

Fig. 4: Bad apple analysis. Test images with the best (right)and worst (left) steering error.

architecture of this model is almost identical to the first ninelayers of our encoder neural network with the exceptionof the final layer being only one neuron (as opposed to2k neurons). We found that the training loss of both theregression and our VAE network were almost exactly thesame. The steering validation loss for the VAE was roughly4.4, versus a value of 3.8 for the regression model loss.Therefore the loss is 14% higher, corresponding to a meanerror in steering wheel angle of only 1.4 degrees. Therefore,we use the VAE model to predict both a steering controland uncertainty measure with roughly equal accuracy as thebaseline regression model but simultaneously gain all of theadditional advantages from the learned latent encodings.

Fig. 4 shows the images associated with the best andworst steering predictions. The mean uncertainty of the bestpredictions was 9.3 × 10−4 vs 1.7 × 10−2 for the worstpredictions, indicating that our model can indeed predictwhen its estimated steering angle is uncertain. The imagesassociated with the worst steering position are mostly fromwhen the car is at a high angle of incidence towards thecenter of the road, representing a small portion of our trainingdataset. On the other hand, the images with the best steeringprediction are mostly straight roads with strongly definedlanes, probably because our dataset has many examples ofstraight roads (despite debiasing) and because lane markersare a key feature for predicting steering angle.

Fig. 5 corroborates these results. The first chart showsthat there is a strong correlation between the true steeringand the estimated steering. Interestingly, the values of theestimated steering fan out at the extreme values, showing that1) our dataset is sparse at extreme steering values and 2) thatthe uncertainty of the prediction should increase at extremesteering values. The second chart shows the relationshipbetween true steering and the estimated uncertainty, and itindeed shows that uncertainty increases as the absolute valueof the steering increases. Although this shows a weak pointof a dataset – that it is sparse in images associated with largesteering angle – it shows a strong point of the VAE model,which is that it is able to predict that there is high uncertaintyfor large values of the steering angle. Additionally, suchuncertainty at greater turns makes intuitive sense since largerroad curvatures imply less future road is visible in the image.For example, on extreme turns it is common for less than10m to be visible in image, whereas straight road imagespresent can present visible far more into the future closerto 100m. The fact that we can see less of the future roadsupports the increased uncertainty in such scenarios.

0

4

8

-0.1 0 +0.1-0.1 0 +0.1

Est.

Stee

ring

Est.

Unce

rtaint

y

-0.1

0

+0.1

True Steering True Steering

Fig. 5: Precision and uncertainty estimation. Plots of truevs. estimated steering (left) and true steering vs. estimateduncertainty (right). Both show that the model variance tendsto increase on larger turns (i.e., greater steering magnitude).

B. Latent Variables

In this subsection, we analyze the resulting latent spacethat our encoder learned. We start by gauging the underlyingmeaning of each of the latent variables. We synthesizeimages using our decoder, starting with a vector in thelatent space, and perturb a single latent variable as wereconstruct output images. By varying the perturbation wecan understand how that specific latent variable effects theimage. The results for an examplary set of latent variablesis shown in Fig. 6. We observe that the network is ableto generate intuitive representations for lane markings andadditional environmental structure such as other surroundingvehicles and weather without ever actively being told todo so. By identifying latent variable representations we canimmediately observe what the network sees and explain howthe corresponding steering control is derived.

Intuitively, we know that the network will be penalizedfor having redundant latent variables due to the fact that thereconstruction loss penalizes reconstructed images that donot resemble the input image. This means that the encodershould learn a latent representation of the input such that asmuch distinct information is explained away as possible. Thiscauses the variables learned to be the most important under-lying variables in the dataset so the decoder can reconstructthe image from such a low dimensional space.

C. Detecting out of sample environments

Next, we examined ways to interpret if our networkis confident in its end-to-end control prediction. Figure 7shows sample pixel-wise uncertainties (red) obtained fromthis method overlaid on top of the respective input images.As one might expect, details around adjacent vehicles, thedistant horizon and the presence of snow next to the roadhighlight the greatest uncertainty. This make sense since themodel is constrained to only 25 latent variables and doesnothave capacity to hold any of these less meaningful details.

We can now plot the distribution of this distance overdatasets on which the network received sufficient and insuf-ficient training data. To be very explicit, we train a networkon daytime driving and detect novel nighttime driving. Sincethe network has not been trained with night data, we shouldnot trust the output and therefore want to understand, asbest as possible, when nighttime data is being fed into thenetwork. We set a simple threshold γ95 at the 95th percentileof D(x, x) for every x in the entire training set. For any

Fig. 6: Latent variable perturbation. A selection of six learned latent variables with associated interpretable descriptors(left labels). Images along the x-axis (from left to right) were generated by linearly perturbing the latent vector encodingalong that single latent dimension. While steering command (top) was a supervised latent variable, all others (bottom five)were entirely unsupervised and learned by the model from the dataset.

Fig. 7: Propagating uncertainty into pixel-space. Sampleimages from the dataset along with pixel-wise uncertaintyestimates. Uncertain pixels are highlighted red.

subsequent datasample xtest, we classify it as novel if

D(xtest, xtest) > γ95 (7)

Figure 8 illustrates the training set distribution (day-time driving) in blue as well as the extracted thresholdγ95. When image frames collected at night are fed throughnetwork we observe the orange distribution and are able tocorrectly classify 97% of these frames as novel. Furthermore,we experiment with frames collected during image sensormalfunctions. These malfunctions are a realistic and commonfailure mode for the AR0231 sensor caused from incorrectwhite balance estimation. Feeding such images through thenetwork can be catastrophic and yield unpredictable controlresponses. Using our metric D(x, x) we see that the distri-bution of these images (green) fall far from the training setand we can successfully detect 100% of these frames.

D. Debiasing the modelTo evaluate the effect of debiasing during training we train

two models, one without debiasing the dataset for inherent la-tent imbalances and once again now subsampling our datasetto reduce these over-represented (i.e., biased) samples. Onevery epoch we sample only 50% of the dataset for trainingwhile the remaining data is discarded. Figure 9 illustrates theloss evolution throughout both training schemes. Note thatthe loss is computed on the original data distribution (and

Sensor Malfunction

Novel(Night)

Training(Day)

Histo

gram

Co

unt

0

300

60

120

180

240

10987654321x107

Fig. 8: Novelty Detection. Feeding training distribution data(day time) through the network as well as a novel datadistribution (night time). A third peak forms when the camerasensor malfunctions from poor illumination.

OriginalDebiased

Iterations0 2 4 6 8 10 12

x103

0

50

100

Fig. 9: Accelerated training with debiasing. Comparisonof training loss evolution with/without automated debiasing.

not the subsampled distribution), since we ultimately onlycare about our performance on the original data. Debiasingthe training pipeline allows the model to focus on events thatare typically more rare (and inherently more difficult sincethey occur less frequently). This results in training that ismore data efficient (using only 50% of the data), and also

faster than standard training. Figure 9 shows a minimum lossof 20 achieved after roughly half as many training iterationscompared to training on the original data distribution.

VI. CONCLUSION

This paper presents a novel deep learning-based algo-rithm for end-to-end autonomous driving that also capturesuncertainty through an intermediate latent representation.Specifically, we built a learning pipeline that computes asteering control command directly from raw pixel inputs ofa front facing camera. Compared to previous research onend-to-end driving, our model also captures uncertainty ofthe final control prediction. We treat our input image data asbeing modeled by a set of underlying latent variables (one ofwhich is the steering command taken by a human driver) witha VAE architecture. Additionally, we propose novel methodfor detecting novel inputs which have not been sufficientlytrained for by propogating the VAE’s latent uncertaintythrough the decoder. Finally, we provide an algorithm fordebiasing against learned biases based on the unsupervisedlatent space. By adaptively subsampling half of the datasetthroughout training, we remove the over-represented latentregions and empirically observe 2× training speedups.

While the steering command latent variable is explicitlysupervised by ground truth human data and can be used tocontrol the vehicle, we also learn many other latent variablesin an unsupervised manner. We show that these unsupervisedlatent variables represent concrete and interpretable featuresin the driving scene, such as presence of different lanemarkers, surrounding vehicles, and even the weather.

Our approach is scalable to massive driving datasets sinceit does not require any manual data-labeling of the supervisedsignals. While previous work in end-to-end learning presentsa form of reactionary control, lane following, and objectavoidance, this technique encodes a much richer set ofinformation in a probability distribution. Furthermore, sinceautonomous driving is an extremely safety critical applicationof AI, the uncertainty measurements that we provide areabsolutely crucial for end-to-end techniques to be deployedonto real-world roads.

REFERENCES

[1] W. Schwarting, J. Alonso-Mora, and D. Rus, “Planning and decision-making for autonomous vehicles,” Annual Review of Control, Robotics,and Autonomous Systems, vol. 1, pp. 187–210, 2018.

[2] M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp,P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhang, et al., “End toend learning for self-driving cars,” arXiv preprint arXiv:1604.07316,2016.

[3] J. Levinson and S. Thrun, “Robust vehicle localization in urbanenvironments using probabilistic maps,” in Robotics and Automation(ICRA), 2010 IEEE International Conference on, 2010.

[4] S. Thrun, D. Fox, W. Burgard, and F. Dellaert, “Robust monte carlolocalization for mobile robots,” Artificial intelligence, vol. 128, no.1-2, pp. 99–141, 2001.

[5] A. Amini, B. Horn, and A. Edelman, “Accelerated Convolutions forEfficient Multi-Scale Time to Contact Computation in Julia,” arXivpreprint arXiv:1612.08825, 2016.

[6] A. A. Assidiq, O. O. Khalifa, M. R. Islam, and S. Khan, “Real timelane detection for autonomous vehicles,” in Computer and Communi-cation Engineering, 2008. ICCCE 2008. International Conference on.IEEE, 2008, pp. 82–88.

[7] Z. Kim, “Robust lane detection and tracking in challenging scenarios,”IEEE Transactions on Intelligent Transportation Systems, vol. 9, no. 1,pp. 16–26, 2008.

[8] U. Lee, S. Yoon, H. Shim, P. Vasseur, and C. Demonceaux, “Localpath planning in a complex environment for self-driving car,” in CyberTechnology in Automation, Control, and Intelligent Systems (CYBER),2014 IEEE 4th Annual International Conference on, 2014.

[9] C. Richter, W. Vega-Brown, and N. Roy, “Bayesian learning for safehigh-speed navigation in unknown environments,” in Proceedings ofthe International Symposium on Robotics Research (ISRR 2015), SestriLevante, Italy, 2015.

[10] W. Schwarting, J. Alonso-Mora, L. Paull, S. Karaman, and D. Rus,“Safe nonlinear trajectory generation for parallel autonomy with adynamic vehicle model,” IEEE Transactions on Intelligent Transporta-tion Systems, no. 99, pp. 1–15, 2017.

[11] P. Falcone, F. Borrelli, J. Asgari, H. E. Tseng, and D. Hrovat,“Predictive active steering control for autonomous vehicle systems,”IEEE Transactions on control systems technology, 2007.

[12] D. A. Pomerleau, “ALVINN, an autonomous land vehicle in a neuralnetwork,” Carnegie Mellon University, Computer Science Department,Tech. Rep., 1989.

[13] A. Amini, L. Paull, T. Balch, S. Karaman, and D. Rus, “Learningsteering bounds for parallel autonomous systems,” in 2018 IEEEInternational Conference on Robotics and Automation (ICRA), 2018.

[14] H. Xu, Y. Gao, F. Yu, and T. Darrell, “End-to-end learning ofdriving models from large-scale video datasets,” arXiv preprintarXiv:1612.01079, 2016.

[15] J. Zhang and K. Cho, “Query-efficient imitation learning for end-to-end autonomous driving,” CoRR, vol. abs/1605.06450, 2016.

[16] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Advances in neuralinformation processing systems, 2012, pp. 1097–1105.

[17] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Burgard, “Deepreinforcement learning with successor features for navigation acrosssimilar environments,” CoRR, vol. abs/1612.05533, 2016.

[18] H. Grimmett, R. Triebel, R. Paul, and I. Posner, “Introspective classi-fication for robot perception,” The International Journal of RoboticsResearch, vol. 35, no. 7, pp. 743–762, 2016.

[19] A. Amini, A. Soleimany, S. Karaman, and D. Rus, “Spatial UncertaintySampling for End-to-End control,” in Neural Information ProcessingSystems (NIPS); Bayesian Deep Learning Workshop, 2017.

[20] Y. Gal and Z. Ghahramani, “A theoretically grounded applicationof dropout in recurrent neural networks,” in Advances in neuralinformation processing systems, 2016, pp. 1019–1027.

[21] Y. Gal and Z. Ghahramani, “Bayesian convolutional neural networkswith bernoulli approximate variational inference,” arXiv preprintarXiv:1506.02158, 2015.

[22] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,”arXiv preprint arXiv:1312.6114, 2013.

[23] D. J. Rezende, S. Mohamed, and D. Wierstra, “Stochastic backprop-agation and approximate inference in deep generative models,” arXivpreprint arXiv:1401.4082, 2014.

[24] Y. Pu, Z. Gan, R. Henao, X. Yuan, C. Li, A. Stevens, and L. Carin,“Variational autoencoder for deep learning of images, labels andcaptions,” in Advances in neural information processing systems, 2016,pp. 2352–2360.

[25] C. Richter and N. Roy, “Safe visual navigation via deep learningand novelty detection,” in Proc. of the Robotics: Science and SystemsConference, 2017.

[26] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer,“Smote: synthetic minority over-sampling technique,” Journal of arti-ficial intelligence research, vol. 16, pp. 321–357, 2002.

[27] A. More, “Survey of resampling techniques for improving clas-sification performance in unbalanced datasets,” arXiv preprintarXiv:1608.06048, 2016.

[28] Z.-H. Zhou and X.-Y. Liu, “Training cost-sensitive neural networkswith methods addressing the class imbalance problem,” IEEE Trans-actions on Knowledge and Data Engineering, vol. 18, no. 1, pp. 63–77,2006.

[29] S. Suresh, N. Sundararajan, and P. Saratchandran, “Risk-sensitive lossfunctions for sparse multi-category classification problems,” Informa-tion Sciences, vol. 178, no. 12, pp. 2621–2638, 2008.

[30] G. H. Nguyen, A. Bouzerdoum, and S. L. Phung, “A supervisedlearning approach for imbalanced data sets,” in Pattern Recognition,2008. ICPR 2008. 19th International Conference on. IEEE, 2008,pp. 1–4.

[31] E. A. Wan and R. Van Der Merwe, “The unscented kalman filterfor nonlinear estimation,” in Adaptive Systems for Signal Processing,Communications, and Control Symposium 2000. AS-SPCC. The IEEE2000. Ieee, 2000, pp. 153–158.

[32] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014.

Documents

Variational Autoencoder for End-to-End Control of ... · Variational Autoencoder for End-to-End Control of Autonomous Driving with Novelty Detection and Training De-biasing Alexander