End-to-End Driving using Point Cloud Features Program... · Qi et al. [18] proposed PointNet, a method in which point cloud data is fed directly into a deep learning model. This method

Proceedings of the 5th International Symposium on Future Active Safety Technology toward ZeroAccidents(FAST-zero ’19)September 9-11, 2019, Blacksburg, VA, USA

End-to-End Driving using Point Cloud FeaturesShunya Seiya, Alexander Carballo, Eijiro Takeuchi and Kazuya Takeda

Nagoya University

Furo-cho, Chikusa, Nagoya, Aichi, 464-8601, JapanPhone: +81-52-789-3647

Fax: +81-52-789-3172E-mail: [email protected]

June 28, 2019

Keyword(s): Autonomous Driving, End-to-End Driving, Deep learning, End-to-End learning, Point Cloud

Abstract

With the goal of making the driving experience safer and more comfortable, the field of autonomous drivinghas become the focus of much research and investment. Several automotive manufacturers have developed model-based autonomous navigation systems, but optimizing the many parameters of the behavior planning, global pathplanning, local path planning and motion control modules has proven difficult, and adapting each module tovarious driving situations has also been a problem. Deep learning-based systems, on the other hand, has a simpleframework, so the optimization of many parameters, as well as the associated costs, can be avoided. Thus, in thisstudy we explore a method of training a deep learning model capable of generating the necessary control signalsto steer a vehicle autonomously, using a large amount of data about the surrounding environment obtained fromvarious sensors installed on the vehicle.

In related research, Bojarski et al. used a convolutional neural network (CNN) to generate steering anglesdirectly from frontal camera images, and demonstrated the effectiveness of this method by keeping an autonomousvehicle within a traffic lane. Other studies in the area of end-to-end driving have mostly used regular cameras,fisheye cameras or depth cameras. These camera-based approaches have drawbacks however, such as limitedfields of view, difficulty operating under low-constract conditions and inability to determine precise distanceswithin the surrounding outdoor environment. LiDAR sensors, which use reflected laser pulses to scan the areaaround a vehicle, can overcome such limitations. LiDAR scanner data is used to create a 360 degree point cloud,which solves the limited field of view problem experienced in camera-based systems, and LiDAR data is morerobust to changes in weather and to illumination issues in indoor and outdoor environments.

Our proposed deep learning method uses the features of LiDAR point cloud data as input for an end-to-enddriving model. Grid maps and images from a frontal camera are also used as input for our model, which containsconvolutional layers for each feature and a concatenation layer which merges these features. An output layerthen generates the steering signal used to control the vehicle. We conducted an experiment to evaluate the abilityof our proposed navigation method to follow trajectories, using a robot in an outdoor environment. Our resultsshowed that our proposed model could follow a training trajectory more accurately than a camera-only model.Our proposed model also achieved higher motion stability when compared to our previously proposed, image-only, camera-based model, which suffered from large lateral oscillations.

I. Introduction

With the ultimate goals of making the driving experience safer and more comfortable, the field of autonomousdriving has become the focus of much research and investment. Several automotive manufacturers have devel-oped autonomous driving system, and many research groups are conducting testing and demonstrations in realenvironments.

Most of these systems are model-based systems, in which the autonomous driving task is divided into percep-tion, decision and planning modules, and each module is developed separately. These modules are then integrated

1

to realize an autonomous driving system. Various approaches for accomplishing the tasks assigned to each ofthese modules have been researched for many years, for example localization using SLAM[1], obstacle detectionusing deep learning[2] and so on. Many companies have developed autonomous driving system using this frame-work, but optimizing the many parameters of each module is difficult, as is redesigning each modules to adaptedto different driving situations.

Using a different approach, some researchers have developed learning-based autonomous driving systems,known as end-to-end driving systems. End-to-end models learn the relationships between the sensing information(inputs) and actuation information (outputs), using human driving data. For example, an image collected at acertain point in time is fed into the model, and the model outputs the steering angle associated with that pointin the time. This framework is simple and avoids the cost of optimizing many parameters, thus the goal of ourresearch is the realization of such a framework.

In previous research, end-to-end driving methods have been demonstrated in real world driving scenarios.Bojarski et al. [3] used a convolutional neural network (CNN [4]) to generate steering angles directly from frontalcamera images, and demonstrated the effectiveness of their method by keeping an autonomous vehicle within itslane. The Wayve.ai team[5] has also achieved autonomous navigation using end-to-end navigation model in realworld driving situations [6]. The user inputs their destination using a smart phone, and the autonomous drivingsystem controls their car using image and satellite navigation information.

Previous studies in the area of end-to-end driving have mostly used regular cameras, fisheye cameras or depthcameras. These camera-based approaches have drawbacks, such as limited field of view, difficulty functioningin low-contrast environments and difficultly obtaining precise distances to objects in the surrounding outdoorenvironment. LiDAR sensors, which use reflected laser pulses to scan the area around the vehicle, can overcomesuch limitations. LiDAR scanner data is used to create a 360 degree point cloud, which solves the limited fieldof view problem experienced in camera-based systems, and LiDAR data is robust to changes in weather andto illumination issues in indoor and outdoor environments. For these reasons, the deep learning based methodproposed in this study is primarily a LiDAR based end-to-end driving system, since these systems are more robustin outdoor driving scenarios.

In this paper, we introduce the use of point cloud features in end-to-end driving systems. Fig.1 provides anoverview of our proposed method. A LiDAR sensor (Velodyne VLP-16) provides 16 channels of point clouddata, and the point clouds from each channel are then converted into a grid map in which occupied spaces = 1and unoccupied spaces = 0. Since grid maps provide low resolution representations of point cloud features, ournetwork can process this data at a lower cost than PointNet and 3D-CNN based models. These grid maps, waswell as images from a frontal camera, are also used as inputs for our model, which has convolutional layers foreach feature, a concatenation layer that merges these features and an output layer which generates the steeringsignal which controls the vehicle.

II. Related works

We will now discuss in more detail some previous studies that have explored topics related to end-to-enddriving. Pomerleau et al.[7] used a frontal camera on a vehicle and a fully connected (FC) neural network togenerate steering angles, achieving over 400 meters of autonomous navigation. LeCun et al.[8] used a frontalcamera on a mobile robot and a CNN to generate steering angles for autonomous obstacle avoidance.

Bojarski et al.[3] also used a CNN for autonomous driving on diverse types of roads, and achieved over 10miles of lane keeping. For training data, they collected actual steering data and images from three cameras installedon the front of the vehicle (left, center and right), under different weather and road conditions. They tested theirapproach both on a simulator and with an actual vehicle, using a human driver to control braking and acceleration.Their study did not consider trajectories at intersections however.

Shun et al. [9] and Shunya et al. [10] have also investigated end-to-end navigation methods using CNNs.Shun analyzed the influences of image features on the performance of controllers trained using a CNN, resulting inguidelines for feature selection which could be used to reduce computational costs. Shunya tested the effectivenessof data augmentation methods when using a three-camera data collection method and viewpoint conversion. Wemake use of the insights of these previous studies in this paper.

Shunya et al. [11] also proposed a method of end-to-end navigation which could be used along routes whichinclude branching, verifying that it is possible to follow learned routes that include right and left turns using a CNNwith image input only. It was also demonstrated that it is possible to follow unknown routes by inputting a targetdirection vector. And, a proposed L2 norm method succeeded in reducing the redundancy of the target directionvector input. These results show that a CNN with image input only can learn to turn at a specific intersection, andthat models can learn to turn at intersections in general through command inputs. Hubschneider et al. [12] andCodevilla et al. [13] have also proposed methods similar to our approach for intersection tasks.

2

Figure 1: Our proposed method.

Wayve.ai [5] have also achieved autonomous navigation using end-to-end driving in real world driving sce-narios [6]. The user inputs their destination using a smart phone, and the autonomous end-to-end driving systemthen directs their car using image and satellite navigation information. Possible driving scenarios are limited tosingle lane driving only, without traffic lights, however their system can handle other scenarios, including turningat intersections along the route.

Bansal et al. [14] proposed ChauffeurNet as a mid-to-mid driving method, i.e., inputs and outputs are pre-and post-processed in relation to the deep learning model, respectively, as opposed to end-to-end methods whichuse raw data as the input and output vehicle control signals. Bird ’s-eye view feature maps, including HD maps,results of obstacle tracking, and traffic light information are all input into the ChaufferNet model to estimate theposition of the ego vehicle 10 seconds in the future. As a result, this method can control real vehicles in complexscenarios that are difficult to navigate using an end-to-end method. ChaufferNet uses bird ’s-eye view featuremaps using 3D information. We have made use of this insight in our proposed method as well.

Some researchers have used different types of cameras. Toromanoff et al. [15] proposed an end-to-end drivingmethod which used a fisheye camera for data augmentation, in order to create images from additional viewpoints.As a result, this method can also be used to control vehicles along routes which include obstacles. Maqueda et al.[16] proposed an event camera based method, which obtains only the changed pixels. As a result, their methodcan estimate steering angle better than methods based on black-and-white images.

Regarding deep learning models which use point cloud data as input, Wu et al. [17] proposed 3D ShapeNets.This method requires lengthy processing however, because it is based on 3D convolution, so it would not be agood method to apply to end-to-end driving. Qi et al. [18] proposed PointNet, a method in which point cloud datais fed directly into a deep learning model. This method can be used for clustering and recognition of point clouds,however the number of points in the point cloud dataset used in that study was 10 times smaller than the LiDARpoint clouds used for end-to-end driving.

III. End-to-end driving using point cloud features

One problem which needs to be resolved if point cloud data is to be used for deep learning is how to inputthis data into the learning model. Qi et al. [18] proposed a method using a neural network for point cloud datacalled PointNet, which uses point cloud data directly to estimate which class a point cloud belongs to, or toestimate a segmentation mask. However, the default PointNet input size is 10 times smaller than the size of theLiDAR generated point clouds used for autonomous driving. Jing et al. proposed a method called 3D CNN formasking and labeling objects in point clouds. This method uses point cloud data as the input and then estimatessegmentation masks for the detected objects. However, a lengthy amount of processing time is required for 3DCNN (O(n), n =size of point cloud).

Our proposed method introduces the use of point cloud features for end-to-end driving. Fig.1 shows anoverview of our proposed method. A LiDAR sensor (Velodyne VLP-16) provides 16 channels of point clouddata. We convert the point clouds from each channel into a grid map, where occupied boxes = 1 and unoccupied

3

!""#$%&'%($$)*%+'

)"*$(%,%$%-.+'/,%,0"/

12,-%3$,*%4'2,"*5

6*"/,%3$7%+'*%

80'9:"0/,$

2"/;'*<0"/$0+%.'

=',-*/0/.$,*%4'2,"*5

Figure 2: Data augmentation methods.

boxes = 0. Since grid maps can represent point cloud features using a small amount of data, our network can pro-cess the data at a lower cost compared to PointNet and 3D-CNN. These grid maps and the images from the frontalcamera are used as the inputs for our model, which has convolutional layers for each feature, and a concatenationlayer which merges the grid map and image features. The output layer then generates the steering signals whichcontrol the vehicle. Our model has 5 × 2 convolutional layers and 4 fully-connect layers. A rectified linear unit(ReLU) function is used as the activation function.

During the training of the CNN, we augment the camera images using viewpoint transformation, based onthe method used by Shunya et al. We also used the camera calibration method proposed by Zhang et al. [19] toextract the intrinsic matrices of the camera, which are necessary for viewpoint transformation. We then createddifferent images by rotating the viewpoint, and also created grid maps using each point rotation. As shown in Fig.2, we used the PurePursuit Algorithm [20] on each modified image to generate the necessary return path fromthe camera’s viewpoint angle to the robot’s trajectory, and used the curvature of these calculated paths duringsupervised learning. Adding this data during training allowed us to include the necessary control signals to allowthe robot to return back to the center of its trajectory when it deviated from the route.

IV. Experiment

In our experiment, we evaluated the trajectory-following ability of our proposed method using a robot in anoutdoor environment, as shown in Fig.3 (left). The target trajectory is represented by the red lines, while the manybranching points are shown in yellow. As mentioned previously, the robot was equipped with a frontal camera anda Velodyne VLP-16 LiDAR scanner. First, we collected a dataset which included images, point cloud data and thetarget trajectory. We collected data six times along the same trajectory for a total of 27,437 frames of data. Next,we augmented this dataset using viewpoint conversion, creating images, grid maps and steering signals at -20, -15,-10, -5, 0, 5, 10, 15, and 20 degrees in comparison to the originals. The range of each grid map is 20m × 20m, andthe size of each grid sector is 20cm × 20cm. We then divided this entire augmented dataset into a training datasetand a validation dataset and trained our proposed model. Three trajectories were used as the training dataset andthe same three trajectories were used as the validation dataset. Finally, we evaluated our model’s ability to followthe training trajectory using the robot. Table 1 shows the details of the experimental condition.

During our evaluation, the robot completed the entire course three times. If the vehicle departed from its lane,it was restarted in the center of the lane. We used two indicators to evaluate our model’s performance, maximumdistance traveled without the vehicle ’s departure from its lane, and the average number of lane departures fromthe starting point to the goal. The longer the maximum distance without a lane departure, the better, and a smalleraverage of number of departures is also better. Since our proposed system only outputs the steering angle, we setthe coefficient value 0.5 m/s as a constant speed.

Fig.3 (right) shows the results for one of our test runs, demonstrating that when using our model, the robotcould follow the training trajectory. Table 2 shows the evaluational results of our proposed model and image-onlymodel. Comparing models, our model achieved the longer maximum distance without a lane departure and asmaller average of number of departures than image-only model. These results mean that our proposed model

4

Table 1: Conditions for experimentTraining data size 13,718 images × 9 directions

Validation data size 13,719 images × 9 directionsour CNN input RGB images & Grid map from point cloud

Image size horizontal 200 pixelsvertical 66 pixels

Grid map size horizontal 100 gridvertical 100 grid

CNN output CurvatureCNN structure 1 Norm. layer + 5 Conv. layers

+ 5 Conv. layers + Concatenation layer+ 4 FC layers

Data augmentation angle 9 directions at 5 deg. intervalsLook ahead frame 20 frames

of data augmentationActivation function Relu

Training times 100 epochMini-batch size 1

Learning optimization AdamLearning rate 0.00001

Speed 0.5 m/s

Table 2: Experimental resultsInput Max in-lane distance [m] Avg. # times off course per trial

image only 106.9 4.33image and point cloud grid map 235.0 0.67

achieved higher motion stability compared to our previous image-only based model, which suffered from largelateral oscillations.

V. Discussion

In this experiment, the grid map consisted of a 100 × 100 grid, in which each cell of the grid represented anarea of 20 cm × 20cm. In this setting, our system was able to stay in its lane and follow the target route, howeverthe robot had difficulty avoiding obstacles because to do so, the model would need to learn to make large changesin behavior in situations where there are only a few changes in the grid map. If the proposed method is appliedfor obstacle avoidance, the grid map used in this study may need to be changed to an occupancy grid map. Also,if the point cloud data could be input directly, the model would be able to use all of the points in the point cloudfor training, so it may be useful to apply PointNet or a range image that is feature map before converting the datainto a point cloud.

VI. Conclusion

In this paper we proposed an end-to-end driving method using point cloud features. We converted LiDAR pointclouds into grid map features, which provided a 360 degree field of view that is robust to weather and lightingconditions. We evaluated our model experimentally using a robot for a trajectory following task in an outdoorenvironment. The results of our experiment demonstrated that the trajectory following ability of our model ismore robust than models using only camera images as input.

In future work, we will evaluate the efficiency of representing point cloud data using an occupancy grid map,and will also introduce PointNet into our end-to-end driving method.

5

Figure 3: Training environment and target course (left). Trajectory results of test run (right).

References

[1] E. Takeuchi and T. Tsubouchi, “A 3-d scan matching using improved 3-d normal distributions transformfor mobile robotic mapping,,” in Proc. of IEEE International Conference on Intelligent Robots and Systems(IROS 2006), pp. 3068–3073, 2006.

[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. Berg, “Ssd: Single shot multiboxdetector,” in Proc. of European Conference on Computer Vision (ECCV 2016), pp. 21–37, 2016.

[3] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort,U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba, “End-to-end learning for self-driving cars,” arXivpreprint arXiv:1604.07316, 2016.

[4] Y. Lecun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Back-propagation applied to handwritten zip code recognition,” Neural Computation, vol. 1, pp. 541–551, 1989.

[5] “Wayve.ai.” https://wayve.ai, [Accessed: 2019/6/28].

[6] “Learning to drive like a human.” https://wayve.ai/blog/driving-like-human, [Accessed:2019/6/28].

[7] D. A. Pomerleau, “Alvinn, an autonomous land vehicle in a neural network,” Technical report CMU-CS-89-107 Carnegie Mellon University, 1989.

[8] Y. LeCun, U. Muller, J. Ben, E. Cosatto, and F. Beat, “Off-road obstacle avoidance through end-to-endlearning,” in Proc. of Advances in Neural Information Processing Systems (NIPS 2005), 2005.

[9] S. Yang, W. Wang, C. Liu, K. Deng, and J. K. Hedrick, “Feature analysis and selection for training anend-to-end autonomous vehicle controller using the deep learning approach,” in Proc. of Intelligent VehiclesSymposium (IV 2017), 2017.

[10] S. Seiya, D. Hayashi, E. Takeuchi, C. Miyajima, and K. Takeda, “Evaluation of deep learning-based drivingsignal generation methods for vehicle control,” in Proc. of 4th Fast-ZERO International Symposium (Fast-ZERO 2017), 2017.

[11] S. Seiya, A. Carballo, E. Takeuchi, C. Miyajima, and K. Takeda, “End-to-end navigation with branch turningsupport using convolutional neural network,” in Proc. of IEEE International Conference on Robotics andBiomimetics (ROBIO 2018), 2018.

6

[12] C. Hubschneider, A. Bauer, M. Weber, and J. M. Zollner, “Adding navigation to the equation: Turning deci-sions for end-to-end vehicle control,” in Proc. of IEEE International Conference on Intelligent TransportationSystems (ITSC Workshop 2017), 2017.

[13] F. Codevilla, M. Muller, A. Lopez, V. Koltun, and A. Dosovitskiy, “End-to-end driving via conditionalimitation learning,” in Proc. of IEEE International Conference on Robotics and Automation (ICRA 2018),2018.

[14] M. Bansal, A. Krizhevsky, and A. Ogale, “Chauffeurnet: Learning to drive by imitating the best and synthe-sizing the worst,” arXiv preprint arXiv:1812.03079, 2018.

[15] M. Toromanoff, E. Wirbel, F. Wilhelm, C. Vejarano, X. Perrotton, and F. Moutarde, “End to end vehiclelateral control using a single fisheye camera,” in Proc. of IEEE International Conference on Intelligent Robotsand Systems (IROS 2018), 2018.

[16] A. Maqueda, A. Loquercio, G. Gallego, N. Garcia, and D. Scaramuzza, “Event-based vision meets deeplearning on steering prediction for self-driving cars,” in Proc. of IEEE Conference on Computer Vision andPattern Recognition (CVPR 2018), 2018.

[17] Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation forvolumetric shapes,” in Proc. of IEEE Computer Vision and Pattern Recognition (CVPR 2015), 2015.

[18] C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification andsegmentation,” in Proc. of IEEE Computer Vision and Pattern Recognition (CVPR 2017), 2017.

[19] Z. Zhang, “A flexible new technique for camera calibration,” in Journal of IEEE Transactions on PatternAnalysis and Machine Intelligence, volume.22, no.11, 2000.

[20] R. C. Coulter, “Implementation of the pure pursuit path tracking algorithm,” Carnegie Mellon UniversityTechnical Report, CMU-RI-TR-92-01, 1992.

7

Documents

End-to-End Driving using Point Cloud Features Program... · Qi et al. [18] proposed PointNet, a method in which point cloud data is fed directly into a deep learning model. This method