Upload
others
View
13
Download
0
Embed Size (px)
Citation preview
Computer Vision and Machine Learning for AutonomousVehicles
by
Zhilu Chen
A Dissertation
Submitted to the Faculty
of the
WORCESTER POLYTECHNIC INSTITUTE
in partial fulfillment of the requirements for the
Degree of Doctor of Philosophy
in
Electrical and Computer Engineering
August 2017
APPROVED:
Prof. Xinming Huang, Major Advisor
Prof. Lifeng Lai
Prof. Haibo He
Abstract
Autonomous vehicle is an engineering technology that can improve transporta-
tion safety, alleviate traffic congestion and reduce carbon emissions. Research on
autonomous vehicles can be categorized by functionality, for example, object detec-
tion or recognition, path planning, navigation, lane keeping, speed control and driver
status monitoring. The research topics can also be categorized by the equipment or
techniques used, for example, image processing, computer vision, machine learning,
and localization. This dissertation primarily reports on computer vision and machine
learning algorithms and their implementations for autonomous vehicles. The vision-
based system can effectively detect and accurately recognize multiple objects on the
road, such as traffic signs, traffic lights, and pedestrians. In addition, an autonomous
lane keeping system has been proposed using end-to-end learning. In this disserta-
tion, a road simulator is built using data collection and augmentation, which can be
used for training and evaluating autonomous driving algorithms.
The Graphic Processing Unit (GPU) based traffic sign detection and recogni-
tion system can detect and recognize 48 traffic signs. The implementation has three
stages: pre-processing, feature extraction, and classification. A highly optimized and
parallelized version of Histogram of Oriented Gradients (HOG) and Support Vector
Machine (SVM) is used. The system can process 27.9 frames per second with the
active pixels of a 1,628 Ö 1,236 resolution, and with the minimal loss of accuracy.
In an evaluation using the BelgiumTS dataset, the experimental results indicate that
the detection rate is about 91.69% with false positives per window of 3.39×10−5, and
the recognition rate is about 93.77%.
We report on two traffic light detection and recognition systems. The first sys-
i
tem detects and recognizes red circular lights only, using image processing and SVM.
Its performance is better than that of traditional detectors and it achieves the best
performance with 96.97% precision and 99.43% recall. The second system is more
complicated. It detects and classifies different types of traffic lights, including green
and red lights in both circular and arrow forms. In addition, it employs image process-
ing techniques, such as color extraction and blob detection to locate the candidates.
Subsequently, a pre-trained PCA network is used as a multi-class classifier for obtain-
ing frame-by-frame results. Furthermore, an online multi-object tracking technique is
applied to overcome occasional misses and a forecasting method is used to filter out
false positives. Several additional optimization techniques are employed to improve
the detector performance and to handle the traffic light transitions.
A multi-spectral data collection system is implemented for pedestrian detection,
which includes a thermal camera and a pair of stereo color cameras. The three cameras
are first aligned using trifocal tensor, and the aligned data are processed by using
computer vision and machine learning techniques. Convolutional channel features
(CCF) and the traditional HOG+SVM approach are evaluated over the data captured
from the three cameras. Through the use of trifocal tensor and CCF, training becomes
more efficient. The proposed system achieves only a 9% log-average miss rate on our
dataset.
Autonomous lane keeping system employs an end-to-end learning approach for
obtaining the proper steering angle for maintaining a car in a lane. The convolutional
neural network (CNN) model uses raw image frames as input, and it outputs the
steering angles corresponding to the input frames. Unlike the traditional approach,
which manually decomposes the problem into several parts, such as lane detection,
path planning, and steering control, the model learns to extract useful features on
ii
its own and learns to steer from human behavior. More importantly, we find that
having a simulator for data augmentation and evaluation is important. We then
build the simulator using image projection, vehicle dynamics, and vehicle trajectory
tracking. The test results reveal that the model trained with augmented data using
the simulator has better performance and achieves about a 98% autonomous driving
time on our dataset.
Furthermore, a vehicle data collection system is developed for building our own
datasets from recorded videos. These datasets are used in the above studies and
have been released to the public for autonomous vehicle research. The experimental
datasets are available at http://computing.wpi.edu/Dataset.html.
iii
Acknowledgements
I would like to express my gratitude to my advisor, Professor Xinming Huang, for
the opportunity to do research at WPI and his guidance in my research.
Thanks for Professor Haibo He, Lifeng Lai and many other professors for their
help. I’ve learned a lot from them.
Thanks to my families and my friends for giving me the courage and confidence.
iv
Contents
Abstract i
Acknowledgements iv
Contents ix
List of Tables x
List of Figures xv
List of Abbreviations xvii
1 Introduction 1
1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2 Background 10
2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2 Object detection and recognition . . . . . . . . . . . . . . . . . . . . 13
2.2.1 Traffic sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
v
2.2.2 Traffic light . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Pedestrian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Lane keeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3 A GPU-Based Real-Time Traffic Sign Detection and Recognition
System 21
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Traffic Sign Detection and Recognition System . . . . . . . . . . . . . 23
3.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.2.3 Traffic Sign Detection . . . . . . . . . . . . . . . . . . . . . . 26
3.2.4 Traffic Sign Recognition . . . . . . . . . . . . . . . . . . . . . 29
3.3 Parallelism on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4 Automatic Detection of Traffic Lights Using Support Vector Ma-
chine 36
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Proposed Method for Traffic Light Detection . . . . . . . . . . . . . . 38
4.2.1 Locating candidates based on color extraction . . . . . . . . . 38
4.2.2 Traffic light detection using template matching . . . . . . . . . 38
4.2.3 An improved method using SVM . . . . . . . . . . . . . . . . 40
4.3 Data Collection and Performance Evaluation . . . . . . . . . . . . . . 43
4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
vi
5 Accurate and Reliable Detection of Traffic Lights Using Multi-Class
Learning and Multi-Object Tracking 48
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2 Data Collection and Experimental Setup . . . . . . . . . . . . . . . . 51
5.2.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.2 Test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Proposed Method of Traffic Light Detection and Recognition . . . . . 58
5.3.1 Locating candidates based on color extraction . . . . . . . . . 60
5.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.2.1 PCANet . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.2.2 Recognizing green traffic lights using PCANet . . . 66
5.3.2.3 Recognizing red traffic lights using PCANet . . . . . 69
5.3.3 Stabilizing the detection and recognition output . . . . . . . 69
5.3.3.1 The problem of frame-by-frame detection . . . . . . 69
5.3.3.2 Tracking and data association . . . . . . . . . . . . . 71
5.3.3.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . 72
5.3.3.4 Minimizing delays . . . . . . . . . . . . . . . . . . . 74
5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.4.1 Detection and recognition . . . . . . . . . . . . . . . . . . . . 76
5.4.2 False positives evaluation . . . . . . . . . . . . . . . . . . . . . 78
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.5.1 Comparison with related work . . . . . . . . . . . . . . . . . 79
5.5.2 Limitation and plausibility . . . . . . . . . . . . . . . . . . . . 80
5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
vii
6 Pedestrian Detection for Autonomous Vehicle Using Multi-spectral
Cameras 84
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Data Collection and Experimental Setup . . . . . . . . . . . . . . . . 87
6.2.1 Data Collection Equipment . . . . . . . . . . . . . . . . . . . 87
6.2.2 Data Collection and Experimental Setup . . . . . . . . . . . . 90
6.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
6.3.2 Trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6.3.3 Sliding windows vs. region of interest . . . . . . . . . . . . . 93
6.3.4 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6.3.5 Information fusion . . . . . . . . . . . . . . . . . . . . . . . . 99
6.3.6 Additional constraints . . . . . . . . . . . . . . . . . . . . . . 100
6.3.6.1 Disparity-size . . . . . . . . . . . . . . . . . . . . . . 100
6.3.6.2 Road horizon . . . . . . . . . . . . . . . . . . . . . . 100
6.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
7 End-to-End Learning for Lane Keeping of Self-Driving Cars 109
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.1 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 111
7.2.2 CNN implementation details . . . . . . . . . . . . . . . . . . . 113
7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
viii
7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
7.4.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 122
7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
8 Building an Autonomous Lane Keeping Simulator Using Real-World
Data and End-to-End Learning 124
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
8.2 Building a Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
8.2.2 Image projection . . . . . . . . . . . . . . . . . . . . . . . . . 130
8.2.3 Vehicle dynamics and vehicle trajectory tracking . . . . . . . . 134
8.2.4 CNN implementation . . . . . . . . . . . . . . . . . . . . . . . 142
8.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . 143
8.3.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 147
8.3.3 Evaluation using simulator . . . . . . . . . . . . . . . . . . . . 147
8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
9 Conclusions 154
Bibliography 157
ix
List of Tables
3.1 HOG parameters in our system . . . . . . . . . . . . . . . . . . . . . 28
4.1 Evaluation result based on Rin/Rout for different p values . . . . . . . 45
4.2 Evaluation result: precision and recall . . . . . . . . . . . . . . . . . 46
5.1 Number of training samples of Green ROI-n and Red ROI-n . . . . . 58
5.2 Information of 23 test sequences . . . . . . . . . . . . . . . . . . . . . 59
5.3 Test result of 17 sequences that contain traffic lights . . . . . . . . . . 78
5.4 Number of false positives in traffic-light-free sequences . . . . . . . . 79
5.5 Results of several recent works on traffic lights detection . . . . . . . 81
8.1 Evaluation result using the simulator, with and without augmented data.149
x
List of Figures
2.1 Performance results from the Caltech Pedestrian Detection Benchmark. 17
3.1 Three stages in our proposed system. . . . . . . . . . . . . . . . . . . 24
3.2 48 classes of traffic signs can be detected and recognized in our system. 25
3.3 An example of color enhancement. . . . . . . . . . . . . . . . . . . . . 26
3.4 Selecting ROI from the original image. . . . . . . . . . . . . . . . . . 27
3.5 Grouping detected windows. . . . . . . . . . . . . . . . . . . . . . . . 29
3.6 Normal CUDA kernel launches. . . . . . . . . . . . . . . . . . . . . . 30
3.7 CUDA kernel launches using CUDA streams. . . . . . . . . . . . . . . 30
3.8 HOG computing time on CPU and GPU. . . . . . . . . . . . . . . . . 32
3.9 The total processing time when HOG is computed using OpenCV on
GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.10 The total processing time when using our optimized GPU code. . . . 34
4.1 Applying traffic light detector on a candidate. . . . . . . . . . . . . . 40
4.2 Are they traffic lights or not? Dark background on the top and bright
background at the bottom. . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 The left traffic light has bright background and the right traffic light
has dark background. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xi
4.4 Rin/Rout values for true positive candidates (left) and true negative
candidates (right). Y-axis is from 0 to 2000. . . . . . . . . . . . . . . 44
4.5 Rin/Rout values for true positive candidates (left) and true negative
candidates (right). Y-axis is from 0 to 20. . . . . . . . . . . . . . . . 45
4.6 Both traffic lights are detected. . . . . . . . . . . . . . . . . . . . . . 47
5.1 Examples of 5 classes of Green ROI-1. . . . . . . . . . . . . . . . . . 53
5.2 Examples of 5 classes of Green ROI-3. . . . . . . . . . . . . . . . . . 54
5.3 Examples of 5 classes of Green ROI-4. . . . . . . . . . . . . . . . . . 55
5.4 Examples of 3 classes of Red ROI-1. . . . . . . . . . . . . . . . . . . . 56
5.5 Examples of 3 classes of Red ROI-3. . . . . . . . . . . . . . . . . . . . 57
5.6 Flowchart of the proposed method of traffic light detection and recog-
nition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.7 Color extraction, blob detection and closing operation. . . . . . . . . 62
5.8 A sample frame from our traffic light dataset. . . . . . . . . . . . . . 64
5.9 The structure of two-stage PCANet. . . . . . . . . . . . . . . . . . . 66
5.10 An arrow light in three consecutive frames. The middle one is vague
and look similar to a circular light. A detector often fails on such vague
frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.11 All traffic lights are detected and recognized correctly in the frame. . 77
6.1 Instrumentation setup with both thermal and stereo cameras mounted
on the roof of a vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.2 Framework of the proposed pedestrian detection method. . . . . . . . 92
6.3 Proper alignment of color and thermal images using trifocal tensor. . 94
6.4 Examples of pedestrians in color and thermal images. . . . . . . . . . 96
xii
6.5 The relationship between the mean disparity and the height of an object.101
6.6 Performance of different input data combinations, all using HOG fea-
tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.7 Performance improvement by adding disparity-size and road horizon
constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.8 Performance of different input data combinations, all using CCF. . . 105
6.9 A pedestrian is embedded in the shadow of a color image. . . . . . . . 106
6.10 An example thermal image with two pedestrians. . . . . . . . . . . . 107
7.1 Comparison between the traditional approach and end-to-end learning. 111
7.2 An example of image frame from the dataset. . . . . . . . . . . . . . 112
7.3 Histogram of steering angles in training data. . . . . . . . . . . . . . 114
7.4 The proposed CNN architecture for deep learning. . . . . . . . . . . . 115
7.5 Histogram of error of predicted steering angles during test. . . . . . . 117
7.6 An example frame with the ground truth angle, predicted angle and
their respective projected path . . . . . . . . . . . . . . . . . . . . . . 118
7.7 Visualization of the results from first two convolutional layers. . . . . 119
7.8 An example of the disadvantage of frame by frame evaluation with 5
consecutive frames: the error in the middle frame is false . . . . . . . 121
8.1 Comparison between the traditional framework and end-to-end learning.126
8.2 The flowchart of test phase. . . . . . . . . . . . . . . . . . . . . . . . 131
8.3 The flowchart of training phase, using original data and augmented data.132
xiii
8.4 Example of original image and generated images given arbitrary camera
poses. (a) Original image. A checkerboard pattern on a flat surface.
(b) Generated image as if the camera is shifted left by 50 mm. (c)
Generated image as if the camera is rotated right by 15.25 degrees.
(d) Generated image as if the camera is shifted left by 50 mm and
rotated right by 15.25 degrees. . . . . . . . . . . . . . . . . . . . . . . 135
8.5 Camera calibration and ground surface estimation. (a) Selected points
in the image taken by the center camera. (b) Cameras and selected
points in the world coordinates. . . . . . . . . . . . . . . . . . . . . . 136
8.6 A virtual bicycle vehicle dynamics. . . . . . . . . . . . . . . . . . . . 138
8.7 Correction of vehicle’s position and orientation using vehicle trajectory
tracking. (a) Ground truth and predicted trajectory. (b) Ground truth
and predicted orientation. (c) Ground truth and predicted steering
wheel angle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
8.8 An example of cropped image frame from the dataset. . . . . . . . . . 143
8.9 The CNN structure used, slightly modified from NVIDIA’s PilotNet. 144
8.10 Our data collection system, including three forward facing cameras, a
USB hub, a laptop and access to OBD-II port. . . . . . . . . . . . . . 145
8.11 Example frames under different weather or lighting condition. (a)
Cloudy. (b) Shadowed. (c) Foggy. (d) Sunny. . . . . . . . . . . . . . 146
8.12 Example of original image and augmented images given arbitrary vehi-
cle poses. (a) Original image. (b) Augmented image as if the vehicle is
shifted right by 0.5 m. (c) Augmented image as if the vehicle is rotated
left by 7 degrees. (d) Augmented image as if the vehicle is shifted right
by 0.5 m and rotated left by 7 degrees. . . . . . . . . . . . . . . . . . 148
xiv
8.13 An example of the simulation result, produced by the CNN trained
with data augmentation. (a) Overview of the trajectory in a test se-
quence. (b) Trajectory zoomed-in in the black rectangle in (a). (c)
Trajectory zoomed-in in the black rectangle in (b). . . . . . . . . . . 150
8.14 An example of failure. The vehicle is going out of lane to the right be-
cause another vehicle is changing lane, and lane markings are partially
blocked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.15 An example of failure. The vehicle is going out of lane to the right
because of unclear lane markings. . . . . . . . . . . . . . . . . . . . . 152
xv
List of Abbreviations
ACF Aggregated Channel Features
CCF Convolutional Channel Features
CNN Convolutional Neural Network
FN False Negatives
FOV Field of View
FP False Positives
FPPI False Positives Per Image
FPPW False Positives Per Window
FPS Frames per Second
GPU Graphics Processing Unit
HOG Histograms of Oriented Gradients
LKAS Lane Keeping Assist System
LQR Linear Quadratic Regulator
xvi
MOT Multi-object Tracking
MR Miss Rate
ODE Ordinary Differential Equatio
PCA Principal Component Analysis
PCANet PCA network
RBF Radial Basis Function
ROI Region of Interest
SLAM Simultaneous Localization and Mapping
SMA Simple Moving Average
SVM Support Vector Machine
TP True Positives
xvii
Chapter 1
Introduction
In this chapter, we first introduce the background and discuss the motivations of
our work in Section 1.1. The major contributions of our work are summarized in
Section 1.2. Finally, the organization of this dissertation is presented in Section 1.3.
1.1 Motivations
Road safety is an important topic. After all, data from the Insurance Institute of
Highway Safety (IIHS) revealed that in the year of 2012, red-light-running crashes
caused around 133,000 injuries and 683 deaths on US roads [1]. These injuries and
deaths may be reduced or avoided with the introduction of more advanced technolo-
gies. Many researchers are dedicated to the area of autonomous vehicles. Therefore,
we believe that this topic is meaningful and important.
Releated to the topic of autonomous vehicles are cameras, which are common in
our daily lives, andare much cheaper than some sensors, such as LiDAR. In light of
this, vision-based systems are more intuitive, as humans use their eyes to understand
1
the surrounding environment. In addition, humans can easily interpret the informa-
tion obtained from images or videos, which makes building manually labeled datasets
easier. Therefore, we believe that the vision-based approach is reasonable. In addi-
tion to some public datasets, we design and deploy our own data collection system to
build our own datasets, especially when the public datasets are limited or not ideal.
Object detection and recognition are important for understanding a road scene.
Traffic signs, traffic lights, pedestrians, and many other objects on the road need to
be detected and recognized to guide drivers or autonomous driving systems. Our
projects witness the evolution of object detection and recognition in computer vision.
Initially, hand-crafted features (e.g., HOG) proved their effectiveness in detecting
objects with certain shapes or patterns. A classifier, such as SVM or AdaBoost,
is often used upon the extracted features. Image processing is often used as a pre-
processing or post-processing step, and certain assumptions are often made to improve
the detector’s performance. Later, researchers found a more generic way of detecting
objects, without using hand-crafted features. It is called two-stage training. The
first stage performs unsupervised training on all of the training data to determine
the bestmethod for extracting features, and the second stage performs supervised
training to train the classifiers based on these features. After the two-stage training
approach, the one-stage approach became popular again, but with an end-to-end
learning Convolutional Neural Network (CNN) instead of hand-crafted features. The
CNN takes raw images as input and outputs the classified labels. As the CNN being
trained, it learns how to extract information from the raw images and how to classify
them. The training is one stage and supervised, and no clear boundary for the feature
extractor and classifier in the model exists. The CNN can deliver state-of-the-art
performance in object detection and recognition as of now.
2
Besides object detection and recognition, we are also motivated to look at the
lane keeping problem, which is an essential part of autonomous cars. The CNN is
futhermore used and it takes raw image frames as input, and outputs the steering
angles corresponding to the input frames, to keep the vehicle within the lane. This
is a regression problem instead of a classification problem. A simulator is then built
to provide augmented training data and a proper evaluation metric. The knowledge
of computer vision 3D geometry, vehicle dynamic, and vehicle trajectory tracking are
also used in the simulator.
1.2 Summary of Contributions
We design and implement a group of systems for autonomous vehicles. Our contri-
butions are listed as follows:
� Design and implement a traffic sign detection and recognition system.
Traffic sign detection and recognition are important functions for autonomous
vehicles. The detection process identifies the existence of traffic signs and their
locations in an image, and the recognition process identifies the types of the
detected signs. Our GPU-based traffic signs detection and recognition system is
able to detect and recognize 48 traffic signs. The implementation features three
stages: pre-processing, feature extraction and classification. A highly optimized
and parallelized version of HOG+SVM was used. The system can process 27.9
frames per second with the active pixels of a 1,628 Ö 1,236 resolution, and with
the minimal loss of accuracy. Evaluating using the BelgiumTS dataset, the
experimental results indicate that the detection rate is about 91.69% with false
positives per window of 3.39× 10−5 and the recognition rate is about 93.77%.
3
We emphasize our contributions in the following aspects:
– Our system is able to detect and recognize 48 traffic signs, with a good
detection rate and recognition rate.
– We optimized and parallelized the computation of HOG on GPU, as well
as some pre-processing steps and the deployed SVM classifier.
– Our system achieves real-time on large resolution images.
� Design and implement two traffic light detection and recognition systems.
Two traffic light detection and recognition systems are presented. The first
system detects and recognizes red circular lights only, using image processing
and SVM. Its performance is better than traditional detectors’s performance is.
The second system detects and classifies different types of traffic lights, including
green and red lights in both circular and arrow forms. It combines computer
vision and machine learning techniques. Color extraction and blob detection
are used to locate the candidates, followed by the PCA network (PCANet)
classifiers. The PCANet classifier consists of a PCANet and a linear SVM. Our
experimental results suggest that the proposed method is highly effective for
detecting both green and red traffic lights.
We emphasize our contributions in the following aspects:
– For the first system, we demonstrate that detection using a fixed threshold
ratio is not very effective and the SVM-based classification has much better
performance.
– For the first system, we empirically add more parameters of a candidate
to the SVM input, and this can achieve better performance.
4
– For the first system, we build a traffic light dataset from the original videos
captured while driving on the streets.
– For the second system, we demonstrate that combining image processing
and PCANet can help with detecting and recognize various types of traffic
lights, including green and red lights in both circular and arrow forms.
– For the second system, an online multi-object tracking technique is applied
to overcome occasional misses, and a forecasting method is used to filter
out false positives.
– For the second system, several additional optimization techniques are em-
ployed to improve the detector performance and to handle the traffic light
transitions.
– For the second system, we build our own dataset of traffic lights from
recorded driving videos, including circular lights and arrow lights in various
directions.
� Design and implement a pedestrian detection system.
Pedestrian detection is a critical feature for self-driving cars or advanced driver
assistance systems. Our system consists of a thermal camera and a color stereo
camera. Data received from multiple cameras are aligned using trifocal tensor
based on pre-calibrated parameters. In addition, candidates are generated us-
ing sliding windows at multiple scales. A reconfigurable detector framework is
proposed, in which feature extraction and classification are two separate stages.
The input to the detector can be the color image, disparity map, thermal data,
or any combination of these. When applied to convolutional channel features,
feature extraction uses the first three convolutional layers of a pre-trained con-
5
volutional neural network cascaded with an AdaBoost classifier. The evaluation
results indicate that it significantly outperforms the traditional histogram of ori-
ented gradients features. When combining the color and thermal images, the
proposed detector can achieve a 9% log-average miss rate.
We emphasize our contributions in the following aspects:
– We design and assemble a multi-spectral camera system mounted on a
vehicle to collect data for pedestrian detection.
– We build a dataset for multi-spectral pedestrian detection from on-road
driving data. These data contain many complex scenarios that are chal-
lenging for detection and classification.
– We propose a machine learning based algorithm for pedestrian detection
by combining stereo vision and thermal images. The evaluation results
show satisfactory performance.
– An experimental dataset is built by labeling the data collected when driv-
ing on the city roads.
� Design and implement a lane keeping system.
We present an end-to-end learning approach for obtaining the proper steering
angle to maintain the car in the lane. The CNN model uses raw image frames
as input and outputs the steering angles accordingly. The model is trained
and evaluated using the comma.ai dataset, which contains the front view image
frames and the steering angle data captured when driving on the road. Unlike
the traditional approach, which manually decomposes the autonomous driving
problem into technical components such as lane detection, path planning and
6
steering control, the end-to-end model can directly steer the vehicle from the
front view camera data after training. It learns how to keep the car in the lane
from human driving data. Further discussion of this end-to-end approach and
its limitation are also provided.
We emphasize our contributions in the following aspects:
– We present a working system for lane keeping using the end-to-end learning
approach.
– We provide the evaluation results and discussion of this system. The need
for building a simulator is discussed.
� Design and implement a simulator for the lane keeping system.
In addition to the state-of-the-art end-to-end learning method that predicts
the steering wheel angle for the purpose of staying in the lane, a simulator is
built using image projection, vehicle dynamics and vehicle trajectory tracking,
which can be helpful in both training and evaluation. The simulation results
demonstrate the effectiveness and accuracy of the end-to-end learning method
and the benefits of using the simulator.
We emphasize our contributions in the following aspects:
– We describe the implementation details of building a simulator for vision-
based autonomous lane keeping. Although many recent works exist on lane
keeping algorithms, comparing and evaluating them are difficult. Built on
real-world data, this simulator employs image projection, vehicle dynam-
ics modeling, and vehicle trajectory tracking to predict vehicle movement
and its corresponding camera views. The simulator can be used for both
training and the evaluation of lane keeping algorithms.
7
– The end-to-end learning approach is to produce the proper steering angle
from camera image data aimed at maintaining the self-driving vehicle in
a lane. A highly effective end-to-end learning system is demonstrated
using the aforementioned simulator for both training and evaluation. The
CNN model trained with augmented data from the simulator performs
significantly better than the model trained with recorded data only does.
– We build a dataset for autonomous vehicle research. The dataset contains
recorded video frames from three forward facing cameras (left, center, and
right) as well as a steering wheel angle and vehicle speed information.
1.3 Outline
This dissertation is organized as follows.
Chapter 2 summarizes the background of autonomous vehicles, especially com-
puter vision and machine learning techniques related to this dissertation.
Chapter 3 presents a GPU-based system for real-time traffic sign detection and
recognition that can classify 48 traffic signs included in the library.
Chapter 4 presents a method for the automatic detection of circular red traffic
lights that integrates both image processing and support vector machine techniques.
Chapter 5 presents a novel approach that combines computer vision and machine
learning techniques for the accurate detection and classification of different types of
traffic lights, including green and red lights in both circular and arrow forms.
Chapter 6 presents a novel instrument for pedestrian detection by combining a
thermal camera with a color stereo camera.
Chapter 7 presents an end-to-end learning approach for obtaining the proper steer-
8
ing angle to maintain the car in the lane.
Chapter 8 presents the implementation of a simulator for the lane keeping system,
using image projection, vehicle dynamics and vehicle trajectory tracking, which can
be helpful for both training and evaluation.
Chapter 9 draws the conclusions.
9
Chapter 2
Background
Carnegie Mellon University completed the first project involving autonomous vehicles
in the US in 1995, which included autonomous driving from Pittsburgh, PA, to San
Diego, CA. The vehicle was equipped with a computer, a camera, and a GPS. In
2004, the American Defense Advanced Research Projects Agency (DARPA) started a
competition for autonomous vehicles, but none of the teams completed the 150-mile
course. In 2005, five teams completed the DARPA challenge, and Stanford Univer-
sity’s autonomous car called Stanley took first place. In 2007, the DARPA challenge
involved a 60-mile course in an urban environment and the Carnegie Mellon Univer-
sity’s autonomous car called Boss took first place. In 2016, Stanford University’s au-
tonomous car called Shelley ran on the track in a speed of nearly 120 mph. Nowadays,
many vehicle manufacturers are developing their own autonomous vehicles, including
Ford, Mercedes Benz, Volkswagen, Audi, and BMW. In addition, many IT companies
have also joined this area, including Google, Uber, NVIDIA, and Tesla. For example,
Google started a project for self-driving car in 2009, which is now called Waymo.
It claims that it drives more than 25,000 autonomous miles each week, and mostly
10
on complex city streets. In other words, autonomous vehicles are being developed
rapidly, including both their hardware and their software.
This dissertation focuses on computer vision and machine learning techniques
used in this field, such as the detection and recognition of traffic sign, traffic light and
pedestrian, as well as lane keeping for self-driving cars. Many other topics not covered
in this dissertation are also important, such as pixel level segmentation, 3D recon-
struction, motion estimation, and Simultaneous Localization and Mapping (SLAM).
2.1 Datasets
Machine learning techniques rely heavily on data. Datasets are often built using real-
world data, with manually labeled ground truth. For example, the KITTI dataset
[2–5] uses the autonomous driving platform of Annieway to capture data from the real
world. The sensors mounted on the car are cameras, 360 Velodyne Laser-scanner and
a GPS. The data are manually processed and are divided into several subsets, such
as stereo, flow, object, tracking, and road. Furthermore, many datasets are built
while aiming for specific tasks. For example, the Belgium Traffic Sign Dataset [6]
and German Traffic Sign Benchmark [7] aim for detecting and recognizing a group
of European traffic signs in images. The Traffic Lights Recognition (TLR) public
benchmarks [8] are for the detection of green or red circular traffic lights in images.
The INRIA person dataset [9] and the Caltech Pedestrian Detection Benchmark [10]
are for the detection of upright persons in images. The comma.ai dataset contains
images captured from a forward facing camera, as well as the vehicle status such as
the speed, gear, and steering. It is used for end-to-end learning for the functionality
of lane keeping.
11
The datasets built from real-world data are extremely useful for researchers. How-
ever, collecting and labeling these data is tedious and time consuming, and the in-
formation obtained is limited to the types of sensors used. Therefore, real-world
datasets often have limited amounts of data and focus on certain functionalities. On
the other hand, some datasets are built using simulators or game engines, and they
can provide much more information with little human effort. For example, a dataset
generated from a computer game has been proposed for road scene segmentation [11].
The researchers claim that generating the annotation takes seven seconds per im-
age on average, whereas the human annotator takes 90 minutes per image. In such
datasets, the rich information of the 3D scene and object movements is helpful to
researchers, and these data can be generated easily. However, whether the models
trained on virtual data can be applied in the real world is questionable, as the im-
ages from game engines and the real world have inherent differences. Nevertheless,
these virtual datasets provide solid alternatives for researchers to try out their new
algorithms.
An increasing number of datasets are becoming available as researchers keep col-
lecting data and building their own datasets. Using the existing datasets reduces the
time and efforts needed to verify an algorithm, as collecting and labeling data are
very time consuming. It also makes it easier to compare one’s work with the existing
work of other researchers who use the same dataset [12, 13], because works done on
different datasets cannot be compared directly. However, sometimes researchers must
collect their own data, if the existing datasets are not ideal or are not available. In
addition, the newly built datasets can benefit other researchers.
12
2.2 Object detection and recognition
Object detection and recognition are important aspects of autonomous vehicles. This
dissertation focuses on the detection and recognition of traffic signs, traffic lights and
pedestrians. In addition, many other objects not covered in this dissertation can also
be detected and recognized to guide drivers or autonomous driving systems, such as
vehicles, road markings, and traffic cones.
2.2.1 Traffic sign
There are several existing works focused on detecting and recognizing a particular
class of traffic signs such as stop sign or speed limit sign [14, 15]. The designs were
optimized and can be highly efficient for detecting and recognizing a specific class
of signs, but they were hardly useful for other types of signs. Other research papers
attempted to detect and recognize multiple signs and used the common features such
as shapes and colors [6,16,17]. Advanced image processing algorithms were proposed
and analyzed thoroughly in order to obtain accurate results. However, the previous
works primarily focused on the algorithms, and the computing time is less a concern,
which prevents those designs from becoming practically useful. There are some other
works which investigate the trade-off between accuracy and computing time [18–20].
Many of them claimed to achieve real-time performance at a high accuracy, but the
datasets that they used were varied. Without using the same data set, it is unfair
to compare the accuracy among different designs. It is also worth mentioning that
the image resolution is another important factor that can affect the processing time
as well as accuracy. A higher resolution image can reveal small objects in it. As a
result, traffic signs can be detected and recognized even when they are far away and
13
thus leaves more time for drivers to respond.
2.2.2 Traffic light
Spot light detection [21, 22] is a method based on the fact that a traffic light is
much brighter than the lamp holder usually in black color. A morphological top-hat
operator was used to extract the bright areas from gray-scale images, followed by a
number of filtering and validating steps. In [23], an interactive multiple-model filter
was used in conjunction with the spot light detection. More information was used
to improve its performance, such as status switching probability, estimated position
and size. Fast radial symmetry transform is a fast variation of the circular Hough
transform, which can be used to detect circular traffic lights as demonstrated in [24].
Several other methods also combined the vehicle GPS information. A geometry-
based filtering method was proposed to detect traffic lights using mobile devices at
low computational cost [25]. The GPS coordinates of all traffic lights were presumably
available, and a camera projection model was used. Mapping traffic light locations
was introduced in [26] by using tracking, back-projection and triangulation. Google
also presented a mapping and detection method in [27] which was capable of recog-
nizing different types of traffic lights. It predicted when traffic lights should become
visible with the help of GPS data, followed by classifying possible candidates. Geo-
metric constraints and temporal filtering were then applied during the detection. The
inter-frame information was also helpful for detecting traffic lights. A method that
used Hidden Markov Model to improve the accuracy and stability of the results was
demonstrated in [28]. The state transition probability of traffic lights was considered,
and information from several previous frames was used. Reference [29] introduced a
traffic light detector based on template matching. The assumption was that the two
14
off lamps in the traffic light holder are similar to each other and neither of them look
similar with the surrounding background.
Deep learning [30, 31] is a class of machine learning algorithms that has many
layers to extract hidden features. Unlike hand-crafted features such as Histograms of
Oriented Gradients (HOG) features [9], it learns features from training data. PCANet
is a simple, yet effective deep learning network proposed by [32]. Principal Component
Analysis (PCA) is employed to learn the filter banks. It can be used to extract features
of faces, hand written digits and object images. It has been tested on several datasets
and delivers surprisingly good results [32]. Using PCANet in traffic light detection or
other similar applications has not been researched thus far.
Integration of detection and tracking has been used in a few autonomous vehicles
related works. The trajectory of traffic light was used to validate the theoretical result
in [23]. Kalman filter was employed to predict the traffic sign positions. It claimed
that tracking algorithm was able to improve the overall system reliability [33, 34].
Utilizing accumulated classifier decisions from a tracked speed limit sign, a majority
voting scheme was proven to be very robust against accidental mis-classifications [14].
2.2.3 Pedestrian
The Caltech Pedestrian Detection Benchmark [10] has been widely used by the re-
searchers. It contains frames from a single vision camera with pedestrians annotated.
Based on the CVPR2015 snapshot of the results on the Caltech-USA pedestrian
benchmark, it was stated in [35] that at ˜95% recall, the state-of-the-art detectors
made ten times more errors than the human-eye baseline, which is still a huge gap that
calls for research attentions. Figure 2.1(a) shows some top quality detection meth-
ods presented in [36]. Overall, the detector performance has been improved as new
15
methods were introduced in recent years. Traditional methods such as Viola–Jones
(VJ) [37] and Histogram of Oriented Gradients (HOG) [9] were often included as the
baseline. A total of 44 methods were listed in [38] for Caltech-USA dataset, and 30 of
them made the use of HOG or HOG-like features. Channel features [39] and Convo-
lutional Neural Networks [40–42] also achieved impressive performance on pedestrian
detection. The Convolutional Channel Features (CCF) [43], which combines a boost-
ing forest model and low level features from CNN, is one of top performers listed in
Caltech Pedestrian Detection Benchmark as shown in Figure 2.1(b) . Despite the
progressive improvement of detection results on the datasets, color cameras still have
many limitations. For instance, color cameras are sensitive to the lighting condition.
Most of these detection methods may fail if the image quality is impaired under poor
lighting condition.
Thermal cameras can be employed to overcome some limitations of color cameras,
because they are not affected by lighting condition. Several research works using ther-
mal data for pedestrian detection and tracking were summarized in [44]. Background
subtraction was applied in [45] for people detection, since the camera was static. HOG
features and Support Vector Machine (SVM) were employed for classification [46]. A
two-layered representation was described in [47], where the still background layer and
the moving foreground layer were separated. The shape cue and appearance cue were
used to detect and locate pedestrians. In [48], a window based screening procedure
was proposed for potential candidate selections. The Contour Saliency Map (CSM)
was used to represent the edges of a pedestrian, followed by AdaBoost classification
with adaptive filters. Assuming the region occupied by a pedestrian has a hot spot,
candidates were selected based on thermal intensity value [49] and then classified
by a SVM. In addition, both Kalman filter prediction and mean shift tracking were
16
(a) Benchmark results of different methods as reported in [36].
(b) Benchmark results of different methods as of May 2016.
Figure 2.1: Performance results from the Caltech Pedestrian Detection Benchmark.
17
incorporated for further improvement. A new contrast invariant descriptor [50] was
introduced for far infrared images, which outperformed HOG features by 7% at 10−4
FPPW for people detection. The Shape Context Descriptor (SCD) was also used for
pedestrian detection in [51], followed by AdaBoost classifier. The HOG features were
considered not suitable for this task because of the small size of the target, variations
of pixel intensities and lack of texture information. Probabilistic models for pedes-
trian detection in far infrared images was presented in [52]. The method in [53] found
the head regions at the initial stage, then confirmed the detection of a pedestrian by
the histograms of Sobel edges in the region.
For ADAS applications, several pedestrian detection research works were summa-
rized in [54], including the use of color cameras and thermal cameras, as well as sensor
fusion such as radar and stereo vision cameras. A benchmark for multispectral pedes-
trian detection was presented in [55] and several methods were analyzed. However,
the color-thermal pairs were manually annotated and it is unclear if any automatic
point registration algorithms were used. The combination of stereo vision cameras
and a thermal camera was used in [56]. Trifocal tensor was used to align the thermal
image with color and disparity images. Candidates were selected based on disparity,
and HOG features were extracted from color, thermal and disparity images. Con-
catenated HOG features were then fed to radio basis function (RBF) SVM classifier
to obtain the final decision. Furthermore, more sophisticated applications or systems
can be built upon pedestrian detection, such as pedestrian tracking across multiple
driving recorders [57] and crowd movement analysis [58].
18
2.3 Lane keeping
Maintaining the vehicle within the lane is important for driving safety. Lane keeping
assist system (LKAS) has been studied by many researchers previously. The lane
keeping assist systems [59–62] are able to provide torque to maintain the vehicle
within the lane, and often alert the driver with warning messages or sound. Cameras
are usually used in the system, and lane markings must be recognized. In addition,
the systems also distinguish intended and unintended lane departure, by utilizing
more information such as blinker state, braking or steering angle.
The LKAS needs to be accurate and robust for autonomous cars. Although
Industrial companies have achieved a lot in this area, they seldom publicize their
technologies. It is necessary for researchers to study the theories, algorithms and
implementations of the LKAS. Deep reinforcement learning [63] was used in several
research works on autonomous driving [64–66]. The systems learned the optimal pol-
icy function given the feedback of the reward. These systems went beyond the basic
lane keeping feature, and were able to direct the vehicle to stay in path and avoid
collision. The vehicle was not necessarily to be in lane, and other vehicles on road
were often involved. The learning and evaluation were often done in a virtual simula-
tor, because the learning requires rich ground truth information and needs to interact
with the environment. Inverse reinforcement learning [67], on the other hand, was
used to estimate the reward from the expert demonstrations.
For real world systems, sensors and algorithms are employed to interpret the
surrounding environment, without having the rich ground truth information in the
simulator. The vision-based approaches use cameras because they are cost effective.
An early research work demonstrated a autonomous vehicle ALVINN [68] using neu-
19
ral network to find the proper direction. The input data came from a camera and
a laser range finder, and the input resolution was very small. For large resolution
color images, an end-to-end learning approach using convolutional neural network
was demonstrated in [69]. The system was designed for off-road mobile robots, not
the autonomous vehicles on the road. An end-to-end learning approach using convolu-
tional neural network for self driving cars was demonstrated in [70], and the network
was trained and evaluated with the help of a simulator. The idea of building the
simulator using image projection and vehicle dynamics was described, but there were
little technical details. The network was later named as PilotNet, and the effective-
ness was validated and visualized in [71, 72]. Our previous work [73] followed this
approach using a different dataset and network, and demonstrated the necessity of
building the simulator in both the training and evaluation stage.
Building the simulator requires the knowledge of computer vision, vehicle dynam-
ics and vehicle trajectory tracking. Most autonomous vehicle driving frameworks
present a consistent decoupling between low-level control and path planning, while
constraining the dynamics of the system to satisfy the vehicle’s motion. Typically
nominal path is obtained by optimization-based methods [74], sampling-based ap-
proaches [75] and notable searching algorithms [76]. In terms of the system dynamics
and control, Rami et al. [77] proposed a linear system dynamics and the control for
high speed drifting. Galceran et al. [78] adopted proportional-derivative (PD) feed-
back controller for torque-based steering. Approximating the non-linearity of the
vehicle dynamics, DeSantis et al. [79] Jacobian linearized the vehicle dynamics for
designing a path-tracking controller, but this approximation ignored the high order
of the polynomial of the system dynamics, which led to a potential problem in the
controlling a vehicle when the error is large.
20
Chapter 3
A GPU-Based Real-Time Traffic
Sign Detection and Recognition
System
This chapter presents a GPU-based system for real-time traffic sign detection and
recognition which can classify 48 different traffic signs included in the library. The
proposed design implementation has three stages: pre-processing, feature extraction
and classification. For high-speed processing, we propose a window-based histogram
of gradient algorithm that is highly optimized for parallel processing on a GPU. For
detecting signs in various sizes, the processing was applied at 32 scale levels. For
more accurate recognition, multiple levels of supported vector machines are employed
to classify the traffic signs. The proposed system can process 27.9 frames per second
video with active pixels of 1,628 Ö 1,236 resolution. Evaluating using the BelgiumTS
dataset, the experimental results show the detection rate is about 91.69% with false
positives per window of 3.39× 10−5 and the recognition rate is about 93.77%.
21
3.1 Introduction
Traffic sign detection and recognition are important functions in an Advanced Driver
Assistance System (ADAS). The detection process has two aspects, including the
existence of traffic signs in an image and their locations. Accurately detecting the
signs also improves the recognition rate by filtering out redundant information while
retaining the useful information on an image. Recognition identifies the signs from
the detection result. In the real world, knowing the content of the sign is much more
important than simply knowing the existence of a sign. Many existing works have
been carried out to improve the accuracy of detection and recognition. In practice,
processing time and hardware efficiency also need to be considered.
A traffic sign detection and recognition system often contains three stages: pre-
processing, detection and recognition. The pre-processing stage is optional, but it is
usually included in a real-time system. It identifies and selects the regions of interest
in the original image frame that often contains a large number of pixels. Effectively, it
reduces the computational tasks and improves the efficiency of the subsequent stages.
The second stage detects and locates traffic signs in the selected regions produced
by the pre-processing stage. In some systems, the detection stage also identifies the
categories of the signs based on shapes, such as round, rectangle, triangle, etc. These
categories are called super-classes. The final stage recognizes the detected signs and
sends the processing results (i.e., the types of signs and their locations) to the display
and control units of an ADAS system.
Typically feature extraction and pattern classification algorithms are computa-
tionally intensive. Much research has been done to optimize the algorithms themselves
to improve the accuracy, but very little research has been focused on the implemen-
22
tation to improve the efficiency. In this chapter, we propose to utilize the many-core
architecture in a GPU to accelerate the traffic sign detection and recognition algo-
rithms through massive parallel processing. The objective is to reduce the computing
time considerably such that the GPU implementation can detect and recognize traffic
signs in real-time.
3.2 Traffic Sign Detection and Recognition System
3.2.1 System Overview
The proposed system contains three main stages: pre-processing, detection and recog-
nition, as shown in Fig. 3.1. At first, we perform red and blue color extractions
respectively and select the regions of interest (ROI). Next, Histograms of Oriented
Gradients (HOG) [80] features are extracted on the grayscale image and a sliding
window searches the image exhaustively to find the candidates using a linear Support
Vector Machine (SVM). Color-based HOG detectors are then performed on these
candidates to eliminate false positives, followed by a rectangle grouping operation
to locate the detected traffic signs. Finally, the detected signs are delivered to a
cascade classifier which contains several linear SVMs. The recognized traffic sign is
highlighted with a green rectangle on the image. Furthermore, a standard image
of the identified class of traffic sign, scaled to the same size, is placed next to the
rectangle, which is used to indicate the actual position and class of the sign. For the
proposed system, the BelgiumTS dataset are employed for both training and testing.
Our system is able to detect and recognize 48 classes of traffic signs selected from the
BelgiumTS Dataset [81], as shown in Fig. 3.2. These signs have aspect ratio 1:1 with
red or blue colors on them.
23
Figure 3.1: Three stages in our proposed system.
Although HOG and SVM have been commonly used in detecting and recognizing
objects, it is still challenging to find a good balance between accuracy and efficiency.
In order to reduce the computing latency, we employ linear kernel SVMs in our
implementation. In order to obtain better accuracy, we use multiple HOG features
and SVMs in our system in Fig. 3.1.
3.2.2 Pre-processing
Color and shape information are commonly used as features of traffic signs. Although
road images often contain objects with their color and shape information similar to
that of traffic signs, it is still a simple yet effective way to use such information to
identify the ROI. We perform color extraction using an adaptive threshold method
proposed by [82]. By using red color enhancement, we obtain an image whose pixel
24
Figure 3.2: 48 classes of traffic signs can be detected and recognized in our system.
value fR is computed as
fR = max
(0,min (xR − xG, xR − xB)
s
)(3.1)
s = xR + xG + xB (3.2)
where xR, xG, xB is the pixel value of red, green and blue channel, respectively. The
global threshold is then set to µ + 4 · σ, where µ is the mean and σ is the standard
deviation of the red values of the original image pixels. Applying this threshold to
25
the image results in a binary image IR which is used in the following processing steps.
We also perform blue color enhancement and thresholding using the same method
and obtain an blue color enhanced binary image IB. Fig. 3.3 shows an example of
blue color enhancement.
Figure 3.3: An example of color enhancement.
Next, we find contours in the binary images using the algorithm in [83] and then
place a bounding box for the contours of each object. Small rectangles whose width
or height is less than 32 pixels are ignored to minimize the interference of small
objects and color fragments in the image. Bounding boxes that have similar sizes and
locations are combined to avoid overlapping. Fig. 3.4 shows an ROI is selected from
the original image after pre-processing.
3.2.3 Traffic Sign Detection
In many cases, the selected ROI from pre-processing stage often contains no traffic
sign. In order to provide valid inputs to the classification stage, at first traffic signs
must be detected accurately. False positives need to be eliminated as much as possible.
Applying the HOG method, we compute the HOG features on ROI at different scales
and then use a sliding window to search the entire ROI to find traffic signs.
26
Figure 3.4: Selecting ROI from the original image.
The HOG features can be computed from an RGB image or grayscale image. For
RGB image, horizontal and vertical gradients are computed in 3 channels for red,
green and blue respectively. Only those that have the maximal magnitude compared
with the other two channels are selected for HOG processing. Thus its computational
work load is three times that of a gray scale image. We first convert the original RGB
image to a grayscale image IGRAY and use it to compute the HOG features that are
fed to a linear SVM to determine if there are traffic signs in the image. Although most
of the existing work also applied HOG to a gray scale image for traffic sign detection,
this approach had very high false positive rate. In order to reduce the false positive
rate, our system also extracts the HOG features from the red image IR and the blue
image IB, only on the frames that the detection is reported positive on IGRAY . Two
more SVMs are trained for red and blue images respectively that eliminate some false
positive frames. In addition, these SVMs also classify the detected traffic signs into
27
Table 3.1: HOG parameters in our system
Parameters Size
Window Size 32 by 32 pixels
Block Size 8 by 8 pixels
Cell Size 8 by 8 pixels
Window Stride 8 by 8 pixels
Block Stride 8 by 8 pixels
Scaling factor 1.1
Levels 32
several super-classes, such as red circle, red triangle, blue circle, etc.
The HOG parameters in our system are shown in Table 3.1. The window size is
fixed, but the size of the traffic sign in an image is unknown. Thus the original image
has to be scaled at many different levels and then perform HOG feature extraction
and classification at each level. The size of image in each level Sl is computed as
Sl = S0/fl (3.3)
where S0 is the original image size, l is the level number and fl is the level scaling
factor defined as
fl = 1.1l−1 (3.4)
In our design, 32 scaling levels are applied. Thus our system is able to detect traffic
signs sized from 32 by 32 pixels up to 614 by 614 pixels.
As shown in the central figure of Fig. 3.5, the same traffic sign is detected by
multiple windows at different positions and also at different scale levels. To avoid
overlapping, we perform a grouping operation that combines these detected traffic
signs at the same location into a single box as shown in Fig. 3.5.
28
Figure 3.5: Grouping detected windows.
3.2.4 Traffic Sign Recognition
The final step of our design is traffic sign recognition. The SVM method is applied
to classify the detected traffic signs according to the 48 classes listed in Fig. 3.2.
Each of the final detected windows is classify by the SVMs mentioned in 3.2.3 to
determine and re-assure its category. Once its category is determined, it is classified
by a multi-class SVM in that category. SVMs are trained using k -fold cross-validation
to improve the accuracy. It is also worth mentioning that we use the BelgiumTSC
dataset to train the SVMs to classify different classes of traffic signs in each category.
3.3 Parallelism on GPU
Since pre-processing and HOG algorithms are complex and require extensive compu-
tations, in this section we describe the GPU-based acceleration. For pre-processing,
it is a typical point operation which is suitable for GPU implementation. The HOG
computation is more complicated and we develop several special techniques to handle
it.
There exists a GPU version of HOG in the OpenCV library which accelerates the
computation significantly if compared to the CPU version. However, we find that
there is still room to improve its efficiency. As we mentioned in 3.2.3, the HOG
29
features need to be computed in many different scaling levels of the original image,
and gaps between levels can be reduced or eliminated. Once the input data of each
level is prepared, there is no data dependency during HOG computation between
different levels. In OpenCV implementation, each level stalls until the computation
of previous level is done to ensure data synchronization between kernels, as shown in
Fig. 3.6. Such stalls are unnecessary and can be avoided by using CUDA streams.
As illustrated in Fig. 3.7, kernels can run in multiple CUDA streams at the same
time and can be synchronized in a certain stream without affecting others. By using
CUDA streams, we reduce the gaps between levels significantly and thus improve the
efficiency of HOG computation.
Figure 3.6: Normal CUDA kernel launches.
Figure 3.7: CUDA kernel launches using CUDA streams.
For better performance, the GPU version of HOG in OpenCV is highly optimized
for data re-usage. The image is divided into many blocks and the block histograms
30
are computed only once, though a block can belong to multiple windows. When
extracting the HOG feature of a window, we need to find the already computed block
histograms and line them up. However, after we adjust the detect windows in our
system, the locations and sizes of those windows are changed and their HOG features
need to be recomputed. Moreover, those windows can be anywhere in the image thus
it is impossible to reuse the block histograms. Computing the HOG features of those
windows is inefficient even if we use previous GPU design, because there are gaps
between windows and it cannot parallelize those windows massively.
In order to solve this problem, we propose a window-based HOG solution on GPU.
All windows are extracted and put together to form an image whose width is the same
as window width, and whose height is the window height multiplied by the number
of windows. Then the newly constructed image is sent to GPU for block histogram
computation. As a result, HOG computation for multiple windows is now running
in parallel on GPU threads. Furthermore, we optimize this method by filtering out
blocks crossing two windows since these blocks are not useful. In our parameter
settings, we have 9 blocks in a window and there are 3 blocks that cross two windows.
By filtering out these cross-window blocks, the total computation is reduced by 25%.
3.4 Experimental Results
The proposed traffic sign detection and recognition algorithms are evaluated on a
Tesla K20 GPU platform. The pre-processing stage on GPU takes about 13˜17 ms.
The detection and recognition stages account for most of the processing time. At
first, we compare the HOG computing time on CPU and GPU at each scaling level.
As shown in Fig. 3.8, the speedup of GPU acceleration is significant when the scaling
31
level is small. The original size of testing image is 1,628 by 1,326 pixels. Parameter
setting as listed in Table 3.1. The OpenCV library is employed for comparing the
HOG computing time on CPU and GPU.
Figure 3.8: HOG computing time on CPU and GPU.
Secondly, we test our optimized GPU implementation using 2000 images in the
BelgiumTS dataset. Each image is in the size of 1,628 by 1,326 pixels. The total
execution time for all three stages is compared by using original OpenCV HOG GPU
version and our optimized version. Initialization time is ignored such as reading im-
ages and SVMs. Post-processing time is also ignored such as recording and displaying
results. Fig. 3.9 shows the total execution time for each frame by using the OpenCV
32
GPU code for HOG computation. Fig. 3.10 shows the execution time of our opti-
mized GPU code. We can see that the overall computing time is reduced and some
peaks are suppressed. The average frame rate of the OpenCV version on GPU is 21.3
fps. Our optimized GPU code can achieve the average frame rate of 27.9 fps which
is about 31% faster than the OpenCV version.
Figure 3.9: The total processing time when HOG is computed using OpenCV onGPU.
Finally, we evaluate the detection rate and classification rate of our proposed sys-
tem, using the BelgiumTS dataset [6]. Each test image is in the size of 1,628 by 1,236
pixels. We test 1918 images and the detection rate is 91.69%. We also measure the
false positive rate by using background images provided by the BelgiumTS dataset.
33
Figure 3.10: The total processing time when using our optimized GPU code.
Based on our HOG parameters described in Table 3.1, we extract over 20 million
windows from those images in different scaling levels. The number of false positives
is 684. Thus the False Positives Per Window (FPPW) is 3.39 × 10−5. Similarly,
we use the BelgiumTSC dataset for evaluate the classification rate. Each image in
BelgiumTSC dataset contains one traffic sign with some background. We resize each
image to our window size of 32 by 32 pixels before computing HOG and performing
SVM classification. We use 4,492 images for training and 2,520 images for testing.
All training and test images are from BelgiumTSC dataset and the classification rate
is 93.77%.
34
3.5 Conclusions
This chapter presents a real-time traffic sign detection and recognition system on
the GPU. It is capable of detecting and recognizing 48 classes of traffic signs in any
sizes on each image frame. The detection rate is about 91.69% and the recognition
rate is about 93.77%. The system can process 27.9 fps video with active pixels of
1,628 Ö 1,236 resolution. Since each frame is processed individually, no information
from previous frames are required. As part of our future work, information from
previous frames will be considered for tracking traffic signs which is expected to
further improves the detection accuracy.
35
Chapter 4
Automatic Detection of Traffic
Lights Using Support Vector
Machine
Many traffic accidents occurred at intersections are caused by drivers who miss or
ignore the traffic signals. In this chapter, we present a new method for automatic
detection of traffic lights that integrates both image processing and support vec-
tor machine techniques. An experimental dataset with 21299 samples is built from
the captured original videos while driving on the streets. When compared to the
traditional object detection and existing methods, the proposed system provides sig-
nificantly better performance with 96.97% precision and 99.43% recall. The system
framework is extensible that users can introduce additional parameters to further
improve the detection performance.
36
4.1 Introduction
Automatic detection of traffic lights should be an essential feature of advanced driver
assistance systems and future self-driving vehicles. Today it is an important road
safety issue that many traffic accidents occurred at intersection are caused by drivers
running the red lights. Recent data from Insurance Institute of Highway Safety
(IIHS) show that in the year of 2012 on US roads, red-light-running crashes caused
about 133,000 injuries and 683 deaths [1]. The introduction of automatic traffic light
detection, especially red light detection, has important social and economic impacts.
Because road images often contain complex background and many objects in them,
it is a challenge to develop an algorithm that can detect the traffic lights precisely.
Most of the existing algorithms are based on color, shape and gradient information,
but the detections are not very reliable. Since the traffic lights themselves do not have
sufficient features, traditional feature-based object detection algorithm also does not
work well. In this chapter, we propose a new method that combines computer vision
and machine learning techniques in conjunction with inter-frame information. While
we drive on the road, data have been collected by recording video using a camera
mounted behind the front windshield. Then the data sets are labeled for training and
evaluation of the proposed algorithm. Our experimental results suggest the proposed
method is highly effective for detecting red traffic lights.
The rest of the chapter is organized as follows. In Section 4.2, we propose an
improved method that combines the computer vision and machine learning techniques
for traffic light detection. Data collection and performance evaluation are presented
in Section 4.3, followed by conclusions in Section 4.4.
37
4.2 Proposed Method for Traffic Light Detection
4.2.1 Locating candidates based on color extraction
In this chapter, we focus on the detection of red circular traffic lights only. Green
or yellow lights can be detected by applying similar techniques. At first, we apply
color extraction to locate the candidates of traffic lights. Subsequently, the images
are converted to the hue, saturation, and value (HSV) color space. The red color is
extracted based on the hue values. Flood-filling method is applied for region labeling
and blob extraction.
The blobs can be considered as the potential candidates. In many previous works,
a variety of morphological filtering techniques were applied to eliminate some candi-
dates for the purpose of reducing false positives. However, any filtering has a possi-
bility of missing the true traffic lights, because the traffic lights are not always clear
due to their size in images and the obscure background. Thus we simply perform an
aspect ratio check and keep all blobs that pass the check as candidates of potential
traffic lights. The objective of eliminating false positives is considered in the latter
part of the proposed method.
4.2.2 Traffic light detection using template matching
Once the candidates are located, we apply a template matching method to detect the
traffic lights [29]. Here we consider the traditional and the most popular design of a
traffic light in which red, yellow and green lights are in round shapes and vertically
positioned in that order. For horizontally positioned traffic lights, we can apply the
same method with a few modifications. Typically only one of three colorful lights is
turned on at a time. In the previous step, we have located potential candidates of
38
the red lights on the image. When the red light is on, the yellow and green lights are
off. These two off lights are very similar. So we use the yellow light area ROIref as
template that is the yellow rectangular area in Fig. 4.1. Similarly, the green light area
is highlighted as the green rectangular area. We can perform template matching in the
green rectangular with ROIref . In fact, we purposely make the green rectangular
area slightly larger than the yellow one, which can provide more accurate results
for template matching. The minimal value among the template matching results is
recorded as Rin and the corresponding area is recorded as ROIin.
For the three vertical traffic lights, the assumption is that these two off lights are
almost identical and there should not have any similar objects in the neighboring area.
The background areas around the traffic light bounding box is highlighted as blue
rectangular areas as in Fig. 4.1. Using the same reference ROIref as the template,
we perform template matching in the blue rectangular area. The smallest value of
the template matching results is Rout and its corresponding area is ROIout. Since the
yellow and green lights are both off, they appears almost identical and the Rin value
is small. In contrast, the Rout value is often very large. We can set a threshold value
p. If the ratio Rin/Rout < p, a traffic light is detected and otherwise not.
This template matching method does not require high resolution images. It works
well even if the candidates are small in size when the traffic lights are a long distance
away. In addition, it is effective to eliminate some false positives. As an improvement
to the detection method, additional constraints were considered in [29], such as the
mean and variance of the pixel values at the position of the two off lights should be
smaller than a certain threshold because those regions should be dark.
However, the assumption that Rout is much larger than Rin is not always true. For
example, when the traffic lights are small or not so clear in the image, the off lights
39
Figure 4.1: Applying traffic light detector on a candidate.
are seen as dark regions. if background is also dark such as trees or buildings, the
template matching to the background Rout could also be very small. Then Rin/Rout
is likely above the given threshold p. As a result, the true traffic light is missed.
Additional constraints on the mean and variance of the pixel values do not solve this
problem either. In addition, it is difficult to choose a universal value for threshold
p. Thus, we propose an improved method that is integrated with machine learning
algorithms.
4.2.3 An improved method using SVM
Due to various background and object sizes on the image, it is difficult to manually
set a threshold for Rin/Rout ratio obtained from template matching. So we propose
to build a support vector machine (SVM) that can automatically find the optimal
settings for the parameters (or features) extracted from the image through machine
learning. It requires a large dataset, both positive and negative samples, for training
the SVM. For each candidate, we use Rin and Rout values in conjunction with the pixel
values mref , min and mout to form a vector, where mref , min and mout are mean pixel
values of areas correspond to the areas of ROIref ROIin and ROIout, respectively.
40
Each vector becomes a sample S1 for the SVM.
S1 = {Rin, Rout, mref , min, mout} (4.1)
The SVM is able to automatically adjust its parameters through the training process.
As demonstrated in Section 4.3, using the SVM to find parameters makes a huge leap
in term of detection accuracy when compared to manually setting the threshold p.
However, we discover that the bounding box of a candidate itself is not sufficient to
determine whether it is a traffic light or not. If we cut them out from the original
image, sometimes even a human can hardly decide it. Fig. 4.2 shows some examples
of candidates extracted from road images. The candidates on first row have dark
background and those on second row have bright background. As we can see, it is
difficult to determine a traffic light when background is dark, while it is easier to spot
a traffic light with bright background. Fig. 4.3 gives an example with both scenarios.
The left traffic light has bright background while the right one has dark background.
We also find that the brake lights, which are usually red, of black vehicles, are a major
contributor to false positives.
Figure 4.2: Are they traffic lights or not? Dark background on the top and brightbackground at the bottom.
In order to improve the detection performance, we propose to add the location
information of the candidate bounding box as additional inputs to the SVM. The
41
Figure 4.3: The left traffic light has bright background and the right traffic light hasdark background.
idea is that the size and ration of a traffic light as well as its location should be
consistent among all training samples. For instance, the traffic location cannot be as
low as the vehicle brake lights as shown in Figure 4.2. Each bounding box B has four
parameters,
B = {x, y, w, h} (4.2)
where (x, y) are coordinates of the upper-left corner (or origin) of a bounding box,
w and h are width and height respectively. Intuitively, it is impossible for the traffic
lights to appear on the road, therefore y should be within a certain range and so does
x. There are implicit relationships between the size and position of a traffic light in
an image. Again, it is difficult to explore these relationships explicitly through image
processing. We propose to introduce the additional information of the bounding box
B by including them into the SVM input sample. Thus we form a new vector S2 for
42
each candidate, where
S2 = S1 ∪B = {Rin, Rout, mref , min, mout, x, y, w, h} (4.3)
As demonstrated later in Section 4.3, the expanded SVM vector shows significant
improvement on the detection performance. It is worth noting that the proposed
method can be expanded further by including more parameters and features to the
SVM. The propose method is a frame work that utilizes SVM as a machine learning
tool to automatically find optimal parameter settings for traffic light detection.
4.3 Data Collection and Performance Evaluation
As an experimental set up, we mount a camera behind the front windshield and
record videos when driving on the road. We extract traffic light candidates using the
process discussed in 4.2.1. We obtain a data set with 21299 candidates from 2706
images. These images are extracted from actual videos and contain four independent
instances that contain circular traffic lights. Each image has the resolution of 1920-
by-1080 pixels. We compare these candidates with manually labeled ground truth
and find that there are 4526 true traffic lights and 16773 negative candidates.
This new constructed dataset is used to evaluate the proposed detection method.
In order to compare the performance among different method, here we propose two
metrics named precision and recall, where
precision =true positives
true positives+ false positives(4.4)
43
Figure 4.4: Rin/Rout values for true positive candidates (left) and true negative can-didates (right). Y-axis is from 0 to 2000.
recall =true positives
true positives+ false negatives(4.5)
The dataset with 21299 candidates is sorted randomly. When applied to the proposed
SVM, half of them are used as training data and the remaining are used for testing.
There is no overlap between training and test data.
We first evaluate the ratio Rin/Rout and its feasibility to detect the traffic lights on
the image. Fig. 4.4 shows the ratio values for true negative candidates are generally
larger than that of true positive candidates. But if we zoom into the Rin/Rout values
with Y axis from 0 to 20 as in Fig. 4.5, we can see that many true negative candidates
also have small Rin/Rout values. Therefore, choosing a fixed threshold p is not an
effective method to separate the positive or negative candidates, because some true
positive candidates could be classified as negative and vice versa.
Table 4.1 lists the evaluation results based on Rin/Rout for different p values.
TP, FP, TN and FN stand for True Positives, False Positives, True Negatives and
False Negatives, respectively. The results show that it is difficult to balance between
precision and recall with a fixed threshold value. Thus we opt to use SVM for
44
Figure 4.5: Rin/Rout values for true positive candidates (left) and true negative can-didates (right). Y-axis is from 0 to 20.
classifications based on the Rin and Rout values.Table 4.1: Evaluation result based on Rin/Rout for different p values
Threshold Precision Recall TP FP TN FN
p = 1.5 47.52% 96.51% 4368 4824 11949 158
p = 1.0 60.90% 89.79% 4064 2609 14164 462
p = 0.5 78.48% 76.14% 3446 945 15828 1080
p = 0.2 95.95% 45.01% 2037 86 16687 2489
Table 4.2 shows the performance of different detection methods. We use the classic
objection method with Haar-like features and AdaBoost algorithm as a baseline,
which provides the results of 76.89% precision and 73.40% recall. If we set the
threshold p = 0.5 for the Rin/Rout ratio, the detection performance is only slightly
better than the baseline.
Next, a SVM with a radial basis function (RBF) kernel is trained with {Rin,Rout}
as input for traffic light classification. Table 4.2 shows that the SVM can improve
recall by 15.14% but only 2.28% for precision, compared with that using a fixed set
threshold.
As proposed in Section 4.2, we add the pixel values mref , min and mout in addition
45
to Rin and Rout values to the SVM input vector S1, the detection performance is
improved significantly with 89.09% precision and 96.60% recall. Furthermore, the
origin and geometry information {x, y, w, h}of the bounding box are added to form
S2. The improved method can achieve the performance of 96.97% precision and
99.43% recall that is reasonably accurate and reliable.
Table 4.2: Evaluation result: precision and recall
Detection method Precision Recall
Haar, AdaBoost 76.89% 73.40%
Rin/Rout, p = 0.5 78.48% 76.14%
{Rin, Rout}, SVM 80.76% 91.28%
S1, SVM 89.09% 96.60%
S2, SVM 96.97% 99.43%
Fig. 4.6 shows an example of detected traffic lights in an image. Although their
backgrounds are drastically different, both traffic lights are detected and marked in
the image. Our system is implemented in C++ and executed on the Intel i5-3570K
processor at 3.4 GHz. The processing time for each image frame is approximately 60
ms to 90 ms. For real-time implementation, we are currently migrating the design to
an FPGA platform.
4.4 Conclusions
In this chapter, we propose a new method that can detect traffic lights accurately and
reliably. Color extraction is applied to locate the candidates. A template matching
technique is applied to provide quantitative information of the traffic lights and its
sounding areas. We also demonstrate that detection using a fixed threshold ratio is
not very effective and the SVM-based classification has much better performance. In
46
Figure 4.6: Both traffic lights are detected.
addition, we empirically add more parameters of a candidate to the SVM input and
it can achieve the best performance of 96.97% precision and 99.43% recall. As an
additional contribution, we build a traffic light dataset with 21299 samples from the
original videos captured while driving on the streets. This dataset can be used by
others for computer vision and machine learning research.
47
Chapter 5
Accurate and Reliable Detection of
Traffic Lights Using Multi-Class
Learning and Multi-Object
Tracking
Automatic detection of traffic lights has great importance to road safety. This chap-
ter presents a novel approach that combines computer vision and machine learning
techniques for accurate detection and classification of different types of traffic lights,
including green and red lights both in circular and arrow forms. Initially, color ex-
traction and blob detection are employed to locate the candidates. Subsequently, a
pre-trained PCA network is used as a multi-class classifier to obtain frame-by-frame
results. Furthermore, an online multi-object tracking technique is applied to over-
come occasional misses and a forecasting method is used to filter out false positives.
Several additional optimization techniques are employed to improve the detector per-
48
formance and handle the traffic light transitions. When evaluated using the test
video sequences, the proposed system can successfully detect the traffic lights on the
scene with high accuracy and stable results. Considering hardware acceleration, the
proposed technique is ready to be integrated into advanced driver assistance systems
or self-driving vehicles. We build our own dataset of traffic lights from recorded
driving videos, including circular lights and arrow lights in different directions. Our
experimental dataset is available at http://computing.wpi.edu/Dataset.html.
5.1 Introduction
Automatic detection of traffic lights is an essential feature of an advanced driver
assistance system or self-driving vehicle. Today it is a critically important road safety
issue that many traffic accidents occurred at intersection are caused by drivers running
red lights. Recent data from Insurance Institute of Highway Safety (IIHS) show that
in the year of 2012 on US roads, red-light-running crashes caused about 133,000
injuries and 683 deaths [1]. Introduction of automatic traffic light detection system,
especially red light detection, has important social and economic impacts.
In addition to detecting traffic lights, it is also important to recognize the lights
as they appear in circular or as directional arrow lights. For example, a red left arrow
light and a green circular light can appear at the same time. Without recognition, the
detection systems can get confused because valuable information has been lost. There
are few papers in the literature that combine detection and recognition of traffic lights
together.
Based on our survey, there are very few datasets available for traffic lights. The
Traffic Lights Recognition (TLR) public benchmarks [8] contain image sequences with
49
traffic lights and ground truth. However, the images in the dataset do not have high
resolution, and the number of physical traffic lights is limited due to the fact that
the image sequences are converted from a short video. In addition, this dataset
only contains circular traffic lights, which is not always the case in real applications.
Therefore, we opt to build our own dataset for traffic light detection, including circular
lights and arrow lights in all three directions. Our dataset of traffic lights can be used
by many other researchers in computer vision and machine learning.
In this chapter, we propose a new method that combines computer vision and
machine learning techniques. Color extraction and blob detection are used to locate
the candidates, followed by the PCA network (PCANet) [32] classifiers. The PCANet
classifier consists of a PCANet and a linear Support Vector Machine (SVM). Our
experimental results suggest the proposed method is highly effective for detecting
both green and red traffic lights of many types.
Despite of the effectiveness of PCANet and many outstanding achievements made
by computer vision researchers, object detection from an image still make frequent
errors, which may cause huge problems in the real-world critical applications such as
Advanced Driver Assistance Systems (ADAS). Traditional frame-by-frame detection
methods ignore the inter-frame information in the video. Since the objects in a video
are normally in continuous motion, their identities and trajectories are valuable in-
formation that can improve the frame-based detection results. Unlike a pure tracking
problem that tracks a marked object from the first frame, tracking-by-detection algo-
rithms involves frame-by-frame detection, inter-frame tracking and data association.
In addition, multi-object tracking (MOT) algorithms can be employed to distinguish
different objects and keep track of their identities and trajectories. When it becomes
a multi-class problem such as recognizing different types of traffic lights, additional
50
procedure such as a voting scheme is often applied. In addition, the method needs to
address the situation that traffic light status can change suddenly during the detection
process.
The rest of the chapter is organized as follows. Section 5.2 describes our data
collection and experimental setup. In Section 5.3, we propose a method that combines
computer vision and machine learning techniques for traffic light detection using
PCANet. In Section 5.3.3, we propose a MOT-based method that stabilizes the
detection and improves the recognition results. Performance evaluations is presented
in Section 5.4, followed by some discussion in Section 5.5 and conclusions in Section
5.6.
5.2 Data Collection and Experimental Setup
In this chapter, we focus on the detection of red and green traffic lights, and the
recognition of their types. The amber lights can be detected using similar techniques,
but we do not consider amber lights here due to lack of data. The recognition of arrow
lights requires that the input frames must be high resolution images. Otherwise all
lights are just colored dots or balls in the frame, and it is impossible to recognize
them.
We mount a smartphone behind the front windshield and record videos when
driving on the road. Several hours of videos are recorded around the city of Worcester,
Massachusetts, USA, during both summer and winter seasons. Subsequently, we select
a subset of video frames to build the dataset since most of the frames do not contain
traffic lights. In addition, passing an intersection only takes a few seconds in case of
the green lights. At red lights, the frames are almost identical as the vehicle is stopped.
51
Thus the length of selected video for each intersection is very short. Several minutes
of traffic-light-free frames are retained in our dataset for assessment of false positives.
Each image has a resolution of 1920×1080 pixels. To validate the proposed approach
and to avoid overlapping of training and test data, the data collected in the summer
is used for training and the data collected in the winter is used for testing. Our traffic
light dataset is made available online at http://computing.wpi.edu/Dataset.html.
5.2.1 Training data
All the training samples are taken from the data collected during the summer. Input
data to the classifier are obtained from the candidate selection procedure described in
5.3.1, and the classifier output goes to the tracking algorithm for further processing.
Thus evaluation of the classifier is independent to the candidate selection or the post-
processing (tracking). The classifier is trained to distinguish true and false traffic
lights, and to recognize the types of the traffic lights. OpenCV [84] is used for
SVM training, which chooses the optimal parameters by performing 10-fold cross-
validation.
The positive samples, which contain the traffic lights, are manually labeled and
extracted from the dataset images. The negative samples, such as segments of trees
and vehicle tail lights, are obtained by applying the candidate selection procedure over
the traffic-light-free images. The green lights and red lights are classified separately.
For green lights, there are three types base on their aspect ratios. The first type is
called Green ROI-1, which contains one green light in each image and its aspect ratio
is approximately 1:1. The second type is called Green ROI-3. It contains the traffic
light holder area which has one green light and two off lights, and its aspect ratio is
approximately 1:3. The third type is called Green ROI-4. It contains the traffic light
52
Figure 5.1: Examples of 5 classes of Green ROI-1.
holder area which has one green round light, one green arrow light, and two off lights,
and its aspect ratio is approximately 1:4.
Each type of sample images has several classes. The Green ROI-1 and Green
ROI-3 both has five classes including negative samples as shown in Fig. 5.1 and Fig.
5.2. These 5 classes from top to bottom are Green Negative (GN-1; GN-3), Green
Arrow Left (GAL-1; GAL-3), Green Arrow Right (GAR-1; GAR-3), Green Arrow
Forward (GAF-1; GAF-3) and Green Circular (GC-1; GC-3).
The Green ROI-4 also has five classes including negative samples as shown in Fig.
5.3. The five classes from top to bottom are Green Negative (GN-4), Green Circular
and Green Arrow Left (GCGAL-4), Green Circular and Green Arrow Right (GCGAR-
4), Green Arrow Forward and Left (GAFL-4) and Green Arrow Forward and Right
(GAFR-4). The Green Negative samples are obtained from traffic-lights-free videos
by using the color extraction method discussed in Section 5.3.1.
For red lights, there are two types of sample images base on their aspect ratios.
The first type is called Red ROI-1 as shown in Fig. 5.4. It contain one red light in
each image and its aspect ratio is approximately 1:1. The other type is called Red
53
Figure 5.2: Examples of 5 classes of Green ROI-3.
54
Figure 5.3: Examples of 5 classes of Green ROI-4.
55
Figure 5.4: Examples of 3 classes of Red ROI-1.
ROI-3 as shown in Fig. 5.5. It contains the traffic light holder which contains one
red light and two off lights, and its aspect ratio is approximately 1:3. Each type of
sample images has three classes: Red Negative (RN-1; RN-3), Red Arrow Left (RAL-
1; RAL-3) and Red Circular (RC-1; RC-3). The Red Negative samples are obtained
from traffic-lights-free videos by using the color extraction method mentioned in 5.3.1.
The red light do not have ROI-4 data because the red light is on top followed by an
amber light and one or two green lights at the bottom. If the red light is on, the
amber and green lights beneath must be off. These three lights are in ROI-3 vertical
setting, regardless of the status of the 4th light at the very bottom.
Table 5.1 shows the number of training samples of Green ROI-n and Red ROI-n,
where n is 1, 3 or 4.
Features of a traffic light itself may not be as rich as other objects such as a
human or a car. For example, a circular light is just a colored blob that looks similar
to other objects in the same color. Therefore, it is difficult to distinguish the true
traffic lights from other false candidates solely based on color analysis. The ROI-3 and
ROI-4 samples are images of the holders, which provide additional information for
detection and classification. The approach of combing all these information together
is explained in 5.3.2.2.
56
Figure 5.5: Examples of 3 classes of Red ROI-3.
57
Table 5.1: Number of training samples of Green ROI-n and Red ROI-n
Class n = 1 n = 3 n = 4
GN-n 13218 13218 13213
GAL-n 1485 835 -
GAR-n 1717 617 -
GAF-n 2489 1018 -
GC-n 3909 3662 -
GCGAL-n - - 369
GCGAR-n - - 281
GAFL-n - - 749
GAFR-n - - 1005
RN-n 7788 7619 -
RAL-n 1214 1235 -
RC-n 4768 5035 -
5.2.2 Test data
All test images are taken from the dataset that we collected in the winter. The ground
truths are manually labeled and are used for validating the results. In our proposed
method, tracking technique is used to further improve the performance. However,
traffic lights can move out of the image or change states during the tracking process.
Therefore the test sequences need to cover many possible scenarios for all types of
lights. Detailed information of the test sequences is shown in Table 5.2.
5.3 Proposed Method of Traffic Light Detection
and Recognition
Fig. 5.6 shows the flowchart of our proposed method of traffic light detection and
recognition, which consists of three stages. Firstly, color extraction and candidates
58
Table 5.2: Information of 23 test sequences
Seq ID Frames Traffic lights Types of traffic lights Description
1 91 182 Green circular×2. Lights in all frames.
2 90 180 Green circular×2. Lights in all frames.
3 61 147 Green arrow left×3. Lights in all frames.
4 48 144 Green circular×3. Lights in all frames.
5 156 312 Red circular×2. Lights in all frames.
6 156 211 Green circular×2. Lights at start, then move out.
7 214 428 Green circular×2. Lights in all frames.
8 76 152 Red circular×2. Lights in all frames.
9 245 305 Green circular×2. Lights at start, then move out.
10 174 177 Green circular×2. Lights at start, then move out.
11 91 348Red circular×3; green arrow left; green arrow right; Red lights at start,
green arrow forward; green circular. then green lights.
12 56 280Red arrow left; green arrow right×2;
Lights in all frames.green arrow forward×2.
13 82 70 Green circular×2. Lights at start, then move out.
14 259 518 Green circular×2. Lights in all frames.
15 65 325Red arrow left; green arrow right×2;
Lights in all frames.green arrow forward×2.
16 185 242 Green circular×2. Lights at start, then move out.
17 93 186 Red circular×2. Lights in all frames.
18 630 0 None. No traffic lights.
19 580 0 None. No traffic lights.
20 416 0 None. No traffic lights.
21 550 0 None. No traffic lights.
22 759 0 None. No traffic lights.
23 3035 0 None. No traffic lights.
Total 8112 4207 - -
59
selection are performed over the input image. Secondly, to determine whether the
selected candidates are traffic lights and what types of lights, they are processed
by PCANet and SVM. Finally, tracking and forecasting techniques are applied to
improve the performance and stabilize the final output.
Figure 5.6: Flowchart of the proposed method of traffic light detection and recogni-tion.
5.3.1 Locating candidates based on color extraction
To locate the traffic lights, color extraction is applied to locate the Region of Interest
(ROI), i.e., the candidates. The images are converted to hue, saturation, and value
(HSV) color space. Comparing with RGB color space, HSV color space is more robust
against illumination variation and is more suitable for segmentation [85]. The desired
color is extracted from an image mainly based on the hue values, which results a
60
binary image. Suppose the HSV value of the ith pixel in an image is
HSVi = {hi, si, vi} (5.1)
In order to extract green pixels, we set the color thresholds based on the empirical
data:
40 ≤ hi ≤ 90 (5.2)
60 ≤ si ≤ 255 (5.3)
110 ≤ vi ≤ 255 (5.4)
In order to extract red pixels, besides (5.3) and (5.4), one of the the following condi-
tions must hold:
165 ≤ hi ≤ 180 (5.5)
0 ≤ hi ≤ 20 (5.6)
These values are adjustable and similar settings can be found in [28]. Note that
the threshold values that we choose work well in OpenCV [84] and may need proper
conversion in order to work with other libraries.
Blob detection can be implemented using flood-fill or contour following. The blobs
can be considered as the potential candidates. However, it is possible that an arrow
light may be labeled as two different regions, because the head and tail of an arrow
61
Figure 5.7: Color extraction, blob detection and closing operation.
are sometimes separated with a gap between them. When the traffic lights are closer
to the camera, it is more likely that the gaps can be clearly seen and thus affect the
result of blob extraction. To solve this problem, the closing operation is performed
on the binary image obtained from color extraction. Closing operation is a typical
morphological operation in image processing. It applies a dilation followed by an
erosion, which eliminates gaps and holes on the binary image. Therefore, the arrow
light can be detected as a whole and the candidates after closing is more reliable than
the original candidates. Fig. 5.7 shows the original result of color extraction and blob
detection (top right), and the result with closing operation (bottom right).
The side-effect of the closing operation is that it might connect a green light
with other green objects in the background such as trees. When the traffic lights
are far away from camera, this problem is more likely to occur because the black
62
borders of traffic light holders are thin. However, when the traffic lights are far away,
the gaps are more likely to be filled by the halo of the lights, or become invisible
due to the limitation of image resolution. Therefore, the original candidates are more
reliable than those after closing. It is difficult to determine whether the morphological
closing operation should be applied. Therefore, we choose to keep both the original
candidates and the candidates after closing operation. In case overlapped candidates
are identified through the classification, the candidate with aspect ratio closest to one
is selected.
The objective of eliminating false positives is considered in the latter part of the
proposed method. Fig. 5.8 shows an example of the road images. In this image,
there are four green traffic lights, but 895 green candidates can be extracted using
the method mentioned above. This requires our classifier to be very strong to filter out
the negative candidates while retain positive ones. However, even if the classifier is
able to filter out 99% of the negative candidates, there are still about 9 false positives
remaining in this image, which is an unacceptable result. Therefore, pre-filtering and
post-validation steps are necessary in addition to the classifier itself. For red traffic
lights, the number of candidates is much smaller than that of the green traffic lights.
For example, there are 19 red candidates in Fig. 5.8 from the color extraction .
In many previous works [21, 22, 25, 29], a variety of morphological filtering tech-
niques were applied to eliminate some candidates for the purpose of reducing false
positives. However, any filtering has a possibility of missing the true traffic lights,
because the traffic lights are not always clear due to their size and the obscure back-
ground in an image. Thus only aspect ratio check is performed in the proposed
method and all blobs that pass the check are kept as candidates. The aspect ratio ar
63
Figure 5.8: A sample frame from our traffic light dataset.
is defined as
ar = w/h (5.7)
where w is the width and h is the height of the candidate. In order to pass the aspect
ratio check, the following inequality must hold:
2/3 ≤ ar ≤ 3/2 (5.8)
The aspect ratio check is to reduce the number of candidates. In Fig. 5.8, the number
of green candidates is reduced to 51 and the number of red candidates in reduced to
9 after the aspect ratio check.
64
5.3.2 Classification
5.3.2.1 PCANet
The PCANet classifier is applied to determine whether a candidate is a traffic light
or not. PCANet classifier consists of a PCA network and a multi-class SVM. The
structure of PCANet is simple, including a number of PCA stages followed by an
output stage. The number of PCA stages can be variant, but the typical value
is 2, making it so called two-stage PCANet. As shown in [32], a two-stage PCANet
outperforms the single stage PCANet in most cases, but further increase of the number
of stages does not necessarily provide better performance, according to their empirical
experience. Therefore, two-stage PCANet is used in our proposed method.
The structure of PCANet is to emulate that of a traditional convolutional neural
network [86]. The convolution filter bank is chosen to be PCA filters. The nonlinear
layer is the binary hashing (quantization). The pooling layer is the block-wise his-
togram of binary vectors. There are two parts in the PCA stage - patch mean removal
and PCA filters convolution. For each pixel of the input image, there are a patch of
pixels in the same size of the filter. Their mean are then removed from each patch,
followed by PCA filter convolution. The PCA filters are obtained by unsupervised
learning during the training process. The number of PCA filters can be variant. The
impact of the number of PCA filters is discussed in [32]. Generally speaking, more
PCA filters lead to better performance. In this chapter, we choose 8 filters for both
PCA stages and we find it is sufficient to deliver good performance.
The output stage consists of binary hashing and block-wise histogram. The output
of PCA stages are converted to binary values, with positive value to one and else to
zero. Thus a binary vector is obtained for each patch, and the length of this vector
65
Figure 5.9: The structure of two-stage PCANet.
is fixed. This binary vector is then converted to decimal value. The block-wise
histogram of these decimal values forms the output features. The SVM is then fed
with the features from PCANet. Fig. 5.9 shows the structure of a two-stage PCANet.
The number of filters in stage 1 is m and in stage 2 is n.
5.3.2.2 Recognizing green traffic lights using PCANet
As mentioned in 5.3.1, due to a large number of green objects in an image such
as trees, street signs, and green vehicles, the classifier must be strong enough to
eliminate the potential false positives while maintain a high detection rate. Using
the green areas as candidates is not sufficient. For example, a fragment of tree leaves
may occasionally look similar to the green lights in some frames, which causes false
positive “flashing” in the video of detection results.
To solve this problem, a validation step is applied to the system. It is assumed
that the traffic lights always appear in a holder. The traffic light holder contains three
66
or four lamps that are vertically aligned in our collected data. Note that horizontal
traffic lights are also often used and can be processed using the same method if the
dataset is available. In addition, these lamps have certain combinations. The traffic
light holder area thus contains important information that can help us detect the
traffic lights. In a vertical traffic light holder, the bottom one is always a green lamp.
Therefore, the position of potential traffic light holder can be located according to
the green area. The aspect ratio of the green area is approximately 1:1 and the green
area is called as ROI-1. The traffic holder area with three lamps is called ROI-3
and the traffic holder area with four lamps is called ROI-4. Suppose the rectangular
bounding box of ROI-1 is RROI−1 where
RROI−1 = {xROI−1, yROI−1, wROI−1, hROI−1} (5.9)
Similarly there are bounding boxes RROI−3 for ROI-3 and RROI−4 for ROI-4 where
RROI−3 = {xROI−3, yROI−3, wROI−3, hROI−3} (5.10)
RROI−4 = {xROI−4, yROI−4, wROI−4, hROI−4} (5.11)
The variables xROI−i, yROI−i are the coordinates of the top-left corner of the bounding
box RROI−i , wROI−i is its width and hROI−i is its height. The RROI−3 can be obtained
based on RROI−1 as follows, where the coefficients are determined empirically based
on the assumption that the lights are vertically aligned and the green light is the
lowest light:
xROI−3 = xROI−1 − 0.1× wROI−1 (5.12)
67
yROI−3 = yROI−1 − 2.5× hROI−1 (5.13)
wROI−3 = 1.2× wROI−1 (5.14)
hROI−3 = 3.6× hROI−1 (5.15)
In the case of horizontally aligned lights, these coefficient should be changed accord-
ingly. Similarly, the RROI−4 can be obtained based on RROI−1 as follows:
xROI−4 = xROI−1 − 0.1× wROI−1 (5.16)
yROI−4 = yROI−1 − 3.9× hROI−1 (5.17)
wROI−4 = 1.2× wROI−1 (5.18)
hROI−4 = 5.1× hROI−1 (5.19)
All samples of ROI-1 are resized to 10 × 10 pixels, all samples of ROI-3 to 10 × 33
pixels and all samples of ROI-4 to 10×43 pixels. Three PCANet classifiers are trained
separately for for ROI-1, ROI-3 and ROI-4. Each classifier is able to perform multi-
class classification, such as distinguishing left arrows, right arrows, circular lights and
negative samples.
In order to combine the results of these three classifiers, several methods are
evaluated using the test dataset. An intuitive solution is the voting strategy. The
results of ROI-1, ROI-3 and ROI-4 are voted to several classes and the class that has
the most votes is selected as the final result. However, this method is not accurate.
The ROI-3 may contain partial area of a traffic light holder if it is actually a four-
light holder. The ROI-4 may contain background if it is actually a three-light holder.
68
Therefore, the positive results of ROI-3 and ROI-4 are both considered as possible
regions. If any positive results of ROI-1 overlap with these regions, it is considered
a true positive green light. This is a more plausible approach because the two cases
mentioned above do contain the traffic light holders that are the possible regions.
Although the class types determined by ROI-3 and ROI-4 may be inaccurate, the
ROI-1 is capable of providing an accurate result.
5.3.2.3 Recognizing red traffic lights using PCANet
Red traffic lights are recognized in a similar way as to green lights. The bounding
boxes of Red ROI-1 and ROI-3 are expressed the same as that of the green lights
shown in (5.9) and (5.10). Assuming the lights are vertically aligned and the red
light is the top light, the RROI−3 can be obtained based on RROI−1 using Equation
5.12, 5.14, 5.15 and
yROI−3 = yROI−1 − 0.1× hROI−1 (5.20)
5.3.3 Stabilizing the detection and recognition output
5.3.3.1 The problem of frame-by-frame detection
Frame-by-frame detection is important, but not sufficient to render stable output.
The reasons are twofold. One aspect is that no detector can perform perfectly under
all possible scenarios. Another aspect is that the input data sometimes are not of good
quality. For example, vehicle vibrations may cause cameras to lose focus, making the
frames vague. An arrow red traffic light in such situation may look identical to a
circular red light and can hardly be recognized even by human eyes, which is shown
on the image in the center of Fig. 5.10. However, the arrow light is clear in other
69
Figure 5.10: An arrow light in three consecutive frames. The middle one is vagueand look similar to a circular light. A detector often fails on such vague frame.
frames. If the detector recognizes this arrow light in previous frames and keeps track
of it, a correct estimation can be provided in the vague frame even if the detector
gives an incorrect result. In addition, there may be multiple lights in a frame, so
multiple lights need to be distinguished and not confused with each other.
The goal of multi-object tracking is to recover the complete tracks of multiple ob-
jects and to give estimation of their current states. There are two categories of multi-
object tracking methods: batch methods and online methods. The batch methods
require the detection results of the entire sequence before analyzing the identity and
constructing the trajectory of each object, which makes it impractical for real-time
applications. The online methods are based on information that is available up to
the current frame, which can provide results in real-time. Traffic light detection is a
time-critical application that needs to give immediate feedback to the driver or con-
troller, therefore multi-object tracking must be done using the online method. The
online methods track objects from previous frames, and associate the tracking result
with detection result of the current frame.
70
5.3.3.2 Tracking and data association
Here we propose an intuitive approach which is optimized for the traffic light detection
application. For video camera at 30 frames per second (FPS), the motion of the
lights between the adjacent frames are of small values. Therefore, an object in the
next frame should be found near its location in the previous frame. Since color is an
important feature of traffic lights, mean shift method is employed to locate the traffic
light based on its previous position. Given a traffic light in the previous frame, the
mean shift procedure calculates the histogram in the hue channel of the HSV color
space, and then calculates histogram back-projection in the current frame in order to
locate the light.
There are other tracking methods such as particle filter, which is proven to work
for multiple people tracking [87]. We do not adopt it for two reasons. One is that
traffic lights are small objects in a high resolution image which has 1920 × 1080
pixels. This makes it difficult for the particles to locate the traffic lights accurately
and may need a large number of particles, which is computationally expensive. The
other reason is that the weights of each particle cannot be evaluated effectively. The
assumption that the detection confidence of each particle is higher when it gets closer
to the actual position of the light is not true. The lights are so small in the image
and a small deviation may lose the target completely. In addition, our detector is
trained based on images of complete traffic lights, thus it cannot distinguish partial
lights from backgrounds nor give higher confidence values for them.
For data association, [87] employs greedy data association and observes similar
result compared with the Hungarian algorithm [88]. In our approach, the tracking
result is simply associated with the detection result when they overlap. The reason is
71
that the traffic lights are motionless in adjacent frames and mean shift performs well in
locating them. In addition, unlike people detection, traffic lights do not intersect with
each other and there is no need to consider the object identities switch problem, which
makes it easier to associate the tracking and detection results. Once the association
is established, the detected regions are used for mean shift tracking in the next frame,
instead of using the regions found by mean shift itself. It solves the scale problem of
mean shift and the detected regions are considered more accurate than the tracking
result.
Building trajectories of the objects can overcome occasional misses, but still can-
not filter out false positives. For example, if a rear light of a car is misclassified
as a red traffic light in several frames, its trajectory is very likely to be built by
multi-object tracking algorithms. However, the time series data for each object can
be obtained from online multi-object tracking. Since the time series data consist of
classification results over time, they can be used to generate the final output using
forecasting and time series analysis.
5.3.3.3 Forecasting
Given the previous detection or recognition result of a target, the estimation of its
current state is the final output. Such process is called forecasting and time series
analysis. Multi-object tracking algorithms focus on building the trajectories and
pay little attention to filtering out false positives. The idea is that the accumulated
classification results of a false object often have different patterns compared to that
of a true object, which can be used to filter out false positives. It is based on an
assumption that the detector has the ability to distinguish the true positives and
false positives to some extent, at least better then random guessing. Otherwise, it is
72
impossible to filter out the false positives. Some methods can be used to address the
false positives problem. In [87], a tracker is only initialized in certain regions of the
image, and is deactivated or terminated when there is no associated detection in a
certain number of frames. Tracklet confidence is introduced in [89], which is influence
by factors such as length, occlusion and affinity between tracking and detection.
In this chapter, we employ a simple forecasting technique after online multi-object
tracking, aiming at stabilizing the imperfect output of traffic light detection and
recognition. For each object, there is a binary time series where 1 denotes that the
detection result is true and 0 otherwise. The simple moving average (SMA) of the
time series is then calculated. Let n be the window size of the SMA, bi be the value
of the time series in the ith frame, and Sm be the SMA value in frame m, the formula
is
Sm =bm−(n−1) + bm−(n−2) + · · ·+ bm−1 + bm
n(5.21)
or alternatively
Sm = Sm−1 −bm−nn
+bmn
(5.22)
It can be interpreted that the Sm is propagated from Sm−1 while replacing the oldest
value with the newest value in the sliding window. The Sm can be used to determine
whether the object is considered positive, and a threshold t is set to determine the
final output bm as
bm =
1 Sm ≥ t
0 Sm < t
(5.23)
When bm is positive, a majority voting scheme is used to determine the type of the
traffic light. The history labels of this particular light are voted to corresponding
73
bins, and the one bin which has the most votes tells the type of the traffic light.
5.3.3.4 Minimizing delays
Forecasting and time series analysis usually have delays. As the window size m
grows, the delays become more severe. The delays at the head of a trajectory helps
avoid picking up false positives, because false positives are expected to be occasional
and inconsistent. However, slowly picking up true positives produces misses or false
negatives. On the other hand, the delays at the tail of a trajectory helps avoid
dropping off true positives, because true positives are expected to be consistent with
minimal and temporal errors. However, slowly dropping off false positives produces
erroneous output and increase the total number of false positives in the sequence. The
delays must be balanced so that their side effects are minimized while their useful
functionalities are not compromised.
At the head of trajectories, a dynamic threshold and modified moving average are
employed. Suppose in frame m, the moving average Sm is modified as
Sm =
bm−(n−1)+bm−(n−2)+···+bm−1+bm
nm ≥ n
b1+b2+···+bm−1+bmm
m < n
(5.24)
and set the threshold tm with a positive constant value α as
tm =
t m ≥ n
t+ α(1− mn
) m < n
(5.25)
At the beginning, the threshold is high and it drops slowly when more frames are
available. The output from the first n frames is suppressed because of insufficient
74
information to make a reliable decision. In a video at 30 FPS, 5 frames correspond
to about 167 ms. According to [90], the reaction time of human is over a second. So
such delays are acceptable. As a result, a true object with high confident is picked
up quickly and the false positives can still be filtered out.
At the tail of trajectories, the object that no longer exists need to be dropped
quickly. Traffic lights may change their states or move out of image during the
tracking process. The transition of state is sudden. It usually has at most 1 frame
that shows both lights are on or both are off, indicating the transition is taking place.
This particular frame does not exist in many cases, so it is not reliable to tell when
the transition occurs. However, traffic lights are motionless in adjacent frames and
the last valid position of a currently off light is still useful. When transition happens,
it can be determined if a detected light belongs to the same traffic light holder with
a different colored light. Subsequently, the transition is identified and the expired
information is dropped. On the other hand, when positive detections of a light around
the edge of an image for a few consecutive frames are lost, the object is dropped to
avoid erroneous output. Occlusion is not considered in this chapter, because it is not
safe to predict the state of the light without actually seeing it completely.
75
5.4 Performance Evaluation
5.4.1 Detection and recognition
Fig. 5.11 shows an example frame with detected traffic lights. Here two metrics
named precision and recall are used, where
precision =true positives
true positives+ false positives(5.26)
recall =true positives
true positives+ false negatives(5.27)
The true positives (TP) are samples that belong to this class, and are recognized as
this class correctly. The false positives (FP) are samples not belong to this class,
but are incorrectly recognized as this class. The false negatives (FN) are samples
that belong to this class, but are recognized as the other classes erroneously.
76
Figure 5.11: All traffic lights are detected and recognized correctly in the frame.
The true positives here must be detected and recognized correctly. A detected
but misclassified light does not provide correct identity of the actual light, which is a
falsenegative. Meanwhile, it provides a false identity of another type of light, which
is a false positive. Therefore, a detected but misclassified light is considered both a
false positive and a false negative. For example, if a red left arrow light is detected
but recognized as a red circular light, then the number of false negatives and the
number of false positives are both incremented by 1. Table 5.3 shows the results
of the test sequences with different configurations, such as using HOG or PCANet,
with or without tracking. It is clear that the PCANet outperforms HOG and tracking
77
Table 5.3: Test result of 17 sequences that contain traffic lights
Seq. IDHOG HOG + Tracking PCANet PCANet + Tracking
TP FN FP Precision Recall TP FN FP Precision Recall TP FN FP Precision Recall TP FN FP Precision Recall
1 182 0 9 95.3% 100% 162 12 13 92.6% 93.1% 182 0 6 96.8% 100% 162 12 6 96.4% 93.1%
2 179 1 13 93.2% 99.4% 171 1 4 97.7% 99.4% 180 0 13 93.3% 100% 172 0 15 92.0% 100%
3 143 4 48 74.9% 97.3% 135 4 8 94.4% 97.1% 145 2 3 98.0% 98.6% 135 4 0 100% 97.1%
4 140 4 10 93.3% 97.2% 132 0 0 100% 100% 139 5 3 97.9% 96.5% 132 0 0 100% 100%
5 102 210 0 100% 32.7% 154 150 0 100% 50.7% 298 14 0 100% 95.5% 304 0 0 100% 100%
6 211 0 51 80.5% 100% 186 17 41 81.9% 91.6% 211 0 42 83.4% 100% 186 17 32 85.3% 91.6%
7 411 17 15 96.5% 96.0% 420 0 11 97.4% 100% 428 0 6 98.6% 100% 420 0 0 100% 100%
8 136 16 6 95.8% 89.5% 420 0 11 97.4% 100% 428 0 6 98.6% 100% 144 0 0 100% 100%
9 302 3 374 44.7% 99.0% 297 0 128 69.9% 100% 303 2 99 75.4% 99.3% 297 0 37 88.9% 100%
10 168 9 14 92.3% 94.9% 169 0 10 94.4% 100% 140 37 6 95.9% 79.1% 160 9 5 97.0% 94.7%
11 325 23 18 94.8% 93.4% 306 30 22 93.3% 91.1% 329 19 2 99.4% 94.5% 314 22 3 99.1% 93.5%
12 218 62 33 86.9% 77.9% 232 28 11 95.5% 89.2% 211 69 33 86.5% 75.4% 201 59 29 87.4% 77.3%
13 67 3 5 93.1% 95.7% 54 8 17 76.1% 87.1% 66 4 1 98.5% 94.3% 54 8 17 76.1% 87.1%
14 485 33 83 85.4% 93.6% 510 0 144 78.0% 100% 493 25 34 93.5% 95.2% 510 0 13 97.5% 100%
15 282 43 21 93.1% 86.8% 295 10 7 97.7% 96.7% 280 45 0 100% 86.2% 271 34 0 100% 88.9%
16 231 11 44 84.0% 95.4% 230 4 35 86.8% 98.3% 201 41 19 91.4% 83.1% 220 14 16 93.2% 94.0%
17 186 0 144 56.4% 100% 178 0 110 61.8% 100% 186 0 12 93.9% 100% 178 0 1 99.4% 100%
Total 3586 439 879 80.3% 89.1% 3612 253 548 86.8% 93.45% 3752 273 276 93.1% 93.2% 3698 167 168 95.7% 95.7%
technique improves the performance. The results are not perfect due to the lack of
more training data and/or the occasional quality issue of captured video as shown in
Fig. 5.10.
5.4.2 False positives evaluation
The number of false positives is evaluated over several traffic-light-free sequences
as shown in Table 5.4. Again, PCANet outperforms HOG and tracking technique
improves the performance. The number of false positives is rapidly increased if there
are mis-recognized objects. A single mis-recognized object produces 30 false positives
in one second, if the video frame rate is 30 FPS.
The false positives are not eliminated completely in our proposed method simply
because the trade-off between precision and recall. Eliminating more false positives
may cause more false negatives, making precision increase and recall decrease, or
vice versa. Reference [27] argues that false-positive green lights are dangerous and
78
Table 5.4: Number of false positives in traffic-light-free sequences
Seq. IDHOG HOG + Tracking PCANet PCANet + Tracking
No. No. per frame No. No. per frame No. No. per frame No. No. per frame
18 150 0.2381 12 0.0190 39 0.0619 0 0
19 45 0.0776 35 0.0603 56 0.0966 26 0.0448
20 11 0.0264 0 0 18 0.0433 12 0.0288
21 127 0.2309 23 0.0418 37 0.0673 9 0.0164
22 280 0.3689 125 0.1647 40 0.0527 6 0.0079
23 179 0.0590 85 0.0280 105 0.0346 80 0.0264
Total 792 0.1327 280 0.0469 295 0.0494 133 0.0223
should be eliminated as much as possible, yielding 99% precision and 62% recall.
While such argument is reasonable for practical applications, we do not perform
such adjustments in this chapter. Instead, we demonstrate highly accurate and well-
balanced precision and recall results to validate our proposed approach as well as the
performance improvements by the introduction of PCANet and tracking.
5.5 Discussion
5.5.1 Comparison with related work
Table 5.5 compares several recent papers on traffic light detection and recognition.
However, it is difficult to compare them directly, because different testing data and
different evaluation metrics were used. There are benchmarks for object detection
and image classification like ImageNet [91], but no benchmark has yet been created
for multi-class traffic light detection and classification. Researchers use their own
collected data in their respective papers. Some papers [25–27] utilize the information
other than images, such as GPS data and prior knowledge of traffic light locations.
79
Some focus on a specific type of traffic lights, while others try to solve multiple colors
and types at the same time. These factors make it difficult for us to compare their
performance appropriately.
On the other hand, the efficiency is also hard to compare. The image sizes in
these papers vary. In general, traffic lights can be seen even if they are still far
away using higher resolution cameras. Instead, a far away traffic light may appear
only as a few pixels in a lower resolution image. A higher resolution camera can
provide clear images of traffic lights when they are further away. So the system may
detect the traffic light slightly earlier, which provides the driver additional time to
respond. However, large image size leads to higher computational cost and longer
processing time. Another factor is that different hardware platforms were used in
their implementations, such as desktop computers and on-board systems. Additional
hardware modules may also be involved such as GPS and inertial measurement unit
(IMU) [26].
5.5.2 Limitation and plausibility
This chapter presents a prototype system that can effectively detect several common
types of traffic lights in a vertical aligned setting. We would like to emphasize that
the proposed system is extendable. The ROI selection can be modified for other
types of traffic lights such as horizontally aligned lights. The multi-class classification
can be trained if sufficient data are provided. We feel confident that the proposed
system can be extended to detect all type of traffic lights and even for other pattern
recognition tasks with some modification.
Different light condition, color distortion, motion blur and variance of scenes may
compromise the system performance in the real world. Thus the robustness of the
80
Tab
le5.
5:R
esult
sof
seve
ral
rece
nt
wor
ks
ontr
affic
ligh
tsdet
ecti
on
Pap
erY
ear
Met
hod
Lig
ht
typ
esIm
age
size
Tim
ing
Per
form
ance
Ou
rap
pro
ach
2016
PC
AN
et;
Mu
lti-
obje
cttr
ackin
gG
reen
circ
ula
r;R
edci
rcu
lar;
1920×
1080
3H
zP
reci
sion
95.7
%;
Rec
all
95.7
%G
reen
arro
w;
Red
arro
w
[23]
2014
Sp
otli
ght
det
ecti
on;
Ad
apti
vete
mp
late
mat
chin
g;G
reen
circ
ula
r;R
edci
rcu
lar;
Am
ber
circ
ula
r-
-A
vera
geac
cura
cy97
.6%
;M
ult
iple
mod
elfi
lter
;S
ingl
eob
ject
trac
kin
gF
alse
alar
ms
ign
ored
ind
etec
tion
[28]
2014
Imag
ep
roce
ssin
g;H
idd
enM
arko
vm
od
els
Gre
enci
rcu
lar;
Red
circ
ula
r;A
mb
erci
rcu
lar
648×
488
25fr
ames
per
seco
nd
Ove
rall
det
ecti
onra
te98
.33%
and
91.3
4%in
diff
eren
tsc
enar
ios
[24]
2014
Fas
tra
dia
lsy
mm
etry
tran
sfor
mR
edci
rcu
lar;
Am
ber
circ
ula
r24
0×32
0M
ost
tim
eco
nsu
min
gp
art
˜1.8
2s
Pre
cisi
on84
.93%
;R
ecal
l87
.32%
[25]
2013
Fil
teri
ng
sch
eme
wit
hG
PS
info
rmat
ion
Gre
enci
rcu
lar;
Red
circ
ula
r72
0×48
015
.7m
sp
erfr
ame
Pre
cisi
on88
.2%
;R
ecal
l81
.5%
[26]
2011
Tra
ffic
ligh
tm
app
ing
and
loca
liza
tion
usi
ng
GP
Sin
form
atio
n;
Gre
enci
rcu
lar;
Red
circ
ula
r;A
mb
erci
rcu
lar
1.3
meg
apix
elR
eal-
tim
e;15
Hz
fram
ein
pu
tA
ccu
racy
:91
.7%
Sev
eral
pro
bab
ilis
tic
stag
es
[27]
2011
Tra
ffic
ligh
tm
app
ing
and
loca
liza
tion
usi
ng
GP
Sin
form
atio
n;
Gre
enci
rcu
lar;
Red
circ
ula
r;A
mb
erci
rcu
lar;
2040×
1080
4H
zP
reci
sion
99%
;R
ecal
l62
%O
nb
oard
per
cep
tion
syst
emG
reen
arro
w;
Red
arro
w;
Am
ber
arro
w
81
trained model is a key factor in addition to detection accuracy. The robustness of
our trained models can be improved by training with more data collected under all
kinds of conditions using different cameras. Researchers in machine learning are
often focused on investigating better algorithms, but sometimes getting more data
beats a clever algorithm [92]. However, detecting traffic lights in severe weather or
night condition may require different algorithms or even additional sensors and little
research has been done on such topics. This will be part of our future work as more
data become available.
The processing time depends on the image size as well as the number of candidates
in an image. The image size in our dataset is 1920×1080, which is considerably larger
than most of the other papers in Section 5.5.1. Our implementation is currently a
single-thread version running at approximately 3 Hz on a CPU. Our implementation
can be accelerated by using multiple CPU threads, GPUs or FPGA hardware. Previ-
ously we have successfully employed GPU to accelerate a traffic sign detection system
in [93] and a fast deep learning system in [94]. Using hardware is another option to
accelerate such systems. The most time consuming part is the PCANet classification
part, which has been accelerated on an FPGA in our latest work [95].
Since the proposed system is based on a camera sensor, its reliability is directly
affected by the video quality. There are many factors that can affect the output
images, such as the camera sensors, configurations, post-processing procedures, and
etc. An example of the data quality problem is shown in Fig. 5.10. Therefore, the
proposed method is not expected to work at night. The traffic lights at night appear
in different ways depending on the camera and its configurations. There may be halo
effect around the lights, or the lights turn to be white at center and only have thin
colored rings at the edge. A solution on one camera may not suitable for another
82
camera. Therefore we decide not to investigate the problem at night.
5.6 Conclusions
In this chapter, we propose a system that can detect multiple types of green and red
traffic lights accurately and reliably. Color extraction and blob detection are applied
to locate the candidates with proper optimization. A classification and validation
method using PCANet is then used for frame-by-frame detection. Multi-object track-
ing method and forecasting technique are employed to improve accuracy and produce
stable results. As an additional contribution, we build a traffic light dataset from
the videos captured by a camera mounted behind the windshield. This dataset has
been released to the public for computer vision and machine learning research and is
available online at http://computing.wpi.edu/Dataset.html.
83
Chapter 6
Pedestrian Detection for
Autonomous Vehicle Using
Multi-spectral Cameras
Pedestrian detection is a critical feature of autonomous vehicle or advanced driver as-
sistance system. This chapter presents a novel instrument for pedestrian detection by
combining stereo vision cameras with a thermal camera. A new dataset for vehicle ap-
plications is built from the test vehicle recorded data when driving on city roads. Data
received from multiple cameras are aligned using trifocal tensor with pre-calibrated
parameters. Candidates are generated from each image frame using sliding windows
across multiple scales. A reconfigurable detector framework is proposed, in which fea-
ture extraction and classification are two separate stages. The input to the detector
can be the color image, disparity map, thermal data, or any of their combinations.
When applying to convolutional channel features, feature extraction utilizes the first
three convolutional layers of a pre-trained convolutional neural network cascaded with
84
an AdaBoost classifier. The evaluation results show that it significantly outperforms
the traditional histogram of oriented gradients features. The proposed pedestrian
detector with multi-spectral cameras can achieve 9% log-average miss rate. The ex-
perimental dataset is made available at http://computing.wpi.edu/dataset.html.
6.1 Introduction
Automatic and reliable detection of pedestrians is an important function of an au-
tonomous vehicle or advanced driver assistance system (ADAS). Research works on
pedestrian detection are heavily depended on data, as different data and methods may
yield different evaluation results. The most commonly used sensor in data collection is
a regular color camera, and many datasets have been built such as the INRIA person
dataset [9] and the Caltech Pedestrian Detection Benchmark [10]. Thermal cameras
have also been considered lately, and different methods of pedestrian detection were
developed based on the thermal data [44]. It is worth investigating whether the meth-
ods developed from one type of sensor data are applicable to other types of sensors.
A method may not work anymore since the nature of data has changes, e.g., finding
certain hot objects by intensity value threshold on thermal image is not applicable
to a regular color image. Some methods such as gradient and shape based feature
extraction may still be applicable since an object has similar silhouettes in both color
and thermal images. In addition, data from different sensors may contain comple-
mentary information and combining them may result better performance. Multiple
cameras can form stereo vision, which provides additional disparity and depth infor-
mation. An example of combining stereo vision color cameras and a thermal camera
for pedestrian detection can be found in [56].
85
The data collection environment is also very important. Unlike static cameras for
surveillance applications, cameras mounted on a moving vehicle may observe much
more complex background and distance-varied pedestrians. Therefore, it calls for
different pedestrian detection algorithms from the surveillance camera applications.
To use multiple sensors on a vehicle, a cooperative multi-sensor system need to be
designed and new algorithms that can coherently process multi-sensor data need to
be investigated. The contributions of this chapter are listed as follows:
1. A multi-spectral camera instrument is designed and assembled on a moving
vehicle to collect data for pedestrian detection.
2. A new dataset for multi-spectral pedestrian detection is built from on-road
driving data. These data contain many complex scenarios that are challenging
for detection and classification.
3. We propose a machine learning based algorithm for pedestrian detection by
combining stereo vision and thermal images. Evaluation results show satisfac-
tory performance.
The rest of the chapter is organized as follows. Section 6.2 describes our instrumental
setup for data collection. In Section 6.3, we propose a framework that combines stereo
vision color cameras and a thermal camera for pedestrian detection using different
feature extraction methods and classifiers. Performance evaluations are presented in
Section 6.4, followed by further discussion in Section 6.5 and conclusions in Section
6.6.
86
6.2 Data Collection and Experimental Setup
6.2.1 Data Collection Equipment
To collect on-road data for pedestrian detection, we design and assemble a custom
test equipment rig. This design enables the data collection system to be mobile on
the test vehicle as well as maintaining calibration between data collection runs. The
completed system can be seen in Figure 6.1.
The stereo vision cameras called ZED StereoLabs are chosen for providing color
images as well as disparity information. The ZED cameras can capture high resolution
side-by-side video that contains synchronized left and right video streams, and can
create a disparity map of the environment in real-time using the graphics processing
unit (GPU) in the host computer. Furthermore, an easy to use SDK is provided,
which allows for camera controls and output configuration. In addition, the on-board
cameras are pre-calibrated and come with known intrinsic parameters. This makes
image rectification and disparity map generation easier.
The thermal camera is called FLIR Vue Pro, which is a long wavelengths infrared
(LWIR) camera. The IR camera is an uncooled vanadium-oxide microbolometer
touting a 640 × 512 resolution at a full 30 Hz and paired with a 13 mm germanium
lens providing a 45 × 35 field of view (FOV). This IR camera has a wide −20 to
50 operation range which allows for rugged outdoor use. The thermal camera also
provides Bluetooth wireless control and video data recording via its on-board microSD
card as well as an analog video output.
Both stereo vision and thermal cameras must remain fixed relative to each other
for consistency of data collection. A threaded rod is custom cut to length and each
end is threaded into the respective cameras tripod mounting hole. This provides
87
Figure 6.1: Instrumentation setup with both thermal and stereo cameras mounted onthe roof of a vehicle.
88
a rigid connection between the color and thermal cameras. An electrical junction
box is utilized as an appropriately sized, water proof box that provides high impact
resistance. The top lid is replaced with an impact resistant clear acrylic sheet such
that the stereo vision cameras can be situated safely behind it. A circular hole is
cut into the top lid to accommodate for the thermal camera lens to fit through and
mounted via the lens barrel. This is essential, as even clear acrylic would block most,
if not all the IR spectrum that is used by the thermal camera.
The mounting system is designed, modeled, and built utilizing aluminum extru-
sions. The entire structure is completely portable and can be mounted to any vehicle
with a ski rack. The aluminum extrusions can sit between the front and back ski
rack hold-downs. On the other hand, cable management is crucial in our design as
long cables are needed for communication between the laptop inside the vehicle and
the cameras on the roof. To avoid interference and safety issues, the cables must run
down the back of the vehicle, through the trunk and into the vehicle cabin, which
needs approximately 20 feet of cable. This creates an issue for the ZED stereo vision
cameras, as it operates on high speed USB 3.0 protocol that allows for a 10 feet
maximum length due to signal degradation and loss. To resolve this issue, an active
USB extension cable is used. A total of four cables terminated from the camera
setup are wrapped together with braided cable sleeves to prevent tangling and ensure
robustness.
An analog frame grabber is employed to capture the real-time analog output of
the IR camera instead of directly recording to the on-board microSD card. It is to
ensure proper synchronization between the thermal camera and stereo vision cameras.
With analog frame grabber, we are able to precisely capture at 30 FPS. AVI files are
generated using software provided along with the frame grabber. These AVI files are
89
then converted into image sequences.
6.2.2 Data Collection and Experimental Setup
Our dataset is made available online at http://computing.wpi.edu/dataset.html.
The data are collected while driving on city roads. Highway driving data are not
collected since pedestrians are hardly seen on highways. A total number of 58 data
sequences are extracted from approximately three hours of driving on city roads
across multiple days and lighting conditions. There are 4330 frames in total, in which
a person or multiple people are in clear view and un-occluded, similar to the Caltech-
USA reasonable set [36]. However, unlike the Caltech-USA reasonable set, we do not
discard small samples. In fact, more than half of the pedestrian samples in our dataset
are no more than 50 pixels in height due to image resolution and their distances to
cameras, which make our dataset more challenging. Each frame contains the stereo
color images, thermal image and disparity map. Since cameras have different angle
of view and field of view, the 58 usable sequences are rather short, ensuring the
pedestrians are within the view of all cameras. Furthermore, video frames without
any pedestrians are not included in our dataset.
6.3 Proposed Method
6.3.1 Overview
Figure 6.2 shows the flowchart of our proposed pedestrian detection method. Dispar-
ity data are generated from stereo color data. Thermal data are obtained from the
thermal cameras and reconstructed according to the point registration using trifocal
90
tensor. Instead of concatenating the features of different data sources and training a
single classifier, feature extraction and classification are performed independently for
each data source before the decision fusion stage. The decision fusion stage uses the
confidence scores of the classifiers, along with some additional constraints to make
the final decision. The proposed detector system can be reconfigured using differ-
ent feature extraction and classification methods, such as HOG with SVM or CCF
with AdaBoost. The decision fusion stage can utilize information from one or mul-
tiple classifiers. The performance of different configurations can be evaluated and
compared.
6.3.2 Trifocal tensor
These three cameras have different angle of view and field of view, making the point
registration (pixel level alignment) essential to windowed detection method cross
multi-spectral images. Simple overlay with fixed pixel offsets does not work because
every object has its own offset values depending on the distance to camera. There-
fore, trifocal tensor [56,96] is used for pixel level alignment over the color and thermal
images. The trifocal tensor T is a set of three 3× 3 matrices that can be denoted as
{T1,T2,T3} in matrix notation, or T jki in tensor notation [96] with two contravariant
and one covariant indices. The idea of the trifocal tensor is that given a view point
correspondence x↔ x′ ↔ x′′, there is a relation
[x′]×
(∑i
xiTi
)[x′′]× = 03×3. (6.1)
One method to compute the trifocal tensor T is by using the normalized linear
91
Figure 6.2: Framework of the proposed pedestrian detection method.
92
algorithm. Given a point-point-point correspondence x↔ x′ ↔ x′′, there is a relation
xix′jx′′kεjqsεkrtT qri = 0st
where 4 out of 9 equations are linearly independent for all choices of s and t. Therefore
at least 7 point-point-point correspondences are needed to compute the 27 elements
of the trifocal tensor. The trifocal tensor can be computed from a set of equations in
the form of At = 0, using the algorithm for least-squares solution of a homogeneous
system of linear equations.
Given the correct correspondence x ↔ x′, it is possible to determine the corre-
sponding point x′′ in the third view without reference to image content. It can be
denoted as x′′k = xil′jTjki and can be obtained by using the trifocal tensor and fun-
damental matrix F21. The line l′ goes through x′ and is perpendicular to l′e = F21x.
Both the trifocal tensor and fundamental matrix F21 can be pre-computed and only
need to be computed once as long as the placement of the cameras remains un-
changed. An alternative method is epipolar transfer x′′ = (F31x)× (F32x′). However,
this method has a serious problem that it fails for all points lying on the trifocal
plane. Therefore, trifocal tensor is a practical solution for point registration. In our
experiment, cameras are calibrated using a checkerboard. The pattern is made of
different materials, making it visible in both color and thermal camera. Figure 6.3
shows the usage of trifocal tensor in aligning color and thermal images.
6.3.3 Sliding windows vs. region of interest
There are two main methods to locate a pedestrian: sliding window detection and
Region of Interest (ROI) extraction. In sliding window detection, it applies a small
93
(a) Color image. (b) Thermal image.
(c) Reconstructed thermal image using tri-focal tensor and disparity information.
(d) Red-cyan anaglyph of color and recon-structed thermal images.
Figure 6.3: Proper alignment of color and thermal images using trifocal tensor.
94
sliding window over the entire image, often in different scales, to perform an ex-
haustive search. Each window is classified followed by some post-processing, such
as bounding box grouping. The ROI extraction finds out the potential candidates
first by some pre-processing techniques such as color and pixel intensity to filter out
negatives from these candidates by using a classifier or some other constraints. It is
often more efficient, as the number of candidates is much less than the amount of
sliding windows.
For pedestrian detection, both ROI extraction and sliding window detection have
been employed in the literature. The sliding window detection method is an universal
approach but is computationally expensive. On the other hand, ROI extraction is
often used for thermal images, because pedestrians are often hotter than the sur-
rounding environment. The ROIs are segmented based on the pixel intensity values.
However, we find that the ROI extraction on thermal images does not always work
well. The assumption that the pedestrians are hotter is not always true for various
reasons. For instance, a pedestrian wearing heavy layers of clothing does not appear
with distinctively high pixel intensity values in a thermal image, and thus a pedes-
trian can not be located by simple morphological operations. As another example,
the temperature of the road surface exposed to intense sunlight has higher tempera-
ture than the human bodies. Although false positives introduced by hot objects such
as vehicle engines can be filtered in later steps, the losses of true positives become a
serious problem. As a result, we feel the sliding window detection method is more
reliable in case of these complex scenarios. The classifier can analyze the windowed
samples thoroughly and make an accurate decision. Figure 6.4 shows some examples
of our pedestrian samples in color images and corresponding thermal images, where
row 1 and 3 are color samples and corresponds to thermal samples in row 2 and 4,
95
Figure 6.4: Examples of pedestrians in color and thermal images.
respectively.
However, sliding window detection method also has its own drawbacks, besides
much higher computational cost. The total number of windows in an images often
reaches 105 or more. Even a fair classifier with False Positives Per Window (FPPW) of
10−4 would still result 10 False Positives Per Image (FPPI). Since 2009, the evaluation
metric has been changed from FPPW to FPPI [38]. To solve this problem, many state-
of-the-art CNN-based classifier have been proposed in recent years. An alternative
approach is to combine information from additional sensors. Our proposed approach
96
of multi-spectral cameras is along this line.
6.3.4 Detection
In this chapter, we only compare the HOG and CCF methods for the task of pedes-
trian detection for the follow reasons:
1. The HOG method was always included as a baseline in Caltech-USA dataset.
Among 44 methods reported on the Caltech-USA dataset [38], 30 of them em-
ployed HOG or HOG-like features.
2. The CCF is one of the best performed methods reported on Caltech-USA
dataset as of May 2016. The idea of combining low level CNN feature and
a boosting forest model is promising.
3. The goal of this chapter is to investigate the combination of multi-spectral
cameras and its improvement on pedestrian detection. We publicize our dataset,
so other researchers can continue this study to discover many better solutions
in the future.
The HOG features have been widely used in object detection. It defines overlapped
blocks in a windowed sample, and cells within blocks. The histogram of the unsigned
gradients of several different directions are computed in all blocks, and are concate-
nated as features. The HOG features are often combined with SVM and sliding
window method for detection on different scaling levels.
At the training stage, the positive samples are manually labeled. The initial
negative samples are randomly selected on training images as long as they do not
overlap with the positive samples. All samples are scaled to a standard window size
97
of 20× 40 for training. The size of the minimum sample in our data is 11× 22. After
the initial training, the detector is tested on the training set and more false positives
are added back to the negative samples set. These false positives are often called hard
negatives and this procedure is often called hard negatives mining. This procedure
can be repeated for a few times until the performance improvement becomes marginal.
Once the detector is trained, it is ready to perform detection on the test dataset
and give a decision score for each window. Each frame with original size of 640× 480
is scaled into different sizes. The detector with a fixed size of 20× 40 is then applied
to the scaled images to find pedestrians of various sizes at different locations in a
frame.
CCF uses low level features from a pre-trained CNN model, cascaded with a
boosting forest model such as Real AdaBoost [97] as a classifier. The lower level
features from CNN are considered generic descriptors for objects, which contain richer
information than channel features. Meanwhile, the boosting forest model replaces
the remaining parts of CNN. Thus we avoid training a complete end-to-end CNN
model for a specific object detection application which would require large resources
of computation, storage and time. In our experiment, we apply similar settings as
described in [43], except for the parameters of the scales and number of octaves, in
order to detect pedestrians far away that are as small as 20×40 pixels. The conv3−3
layer in VGG-16 model are used as feature extraction. The windowed sample size is
128×64 instead of 20×40. The feature dimension of the 20×40 sample is 1296. The
training samples of CCF are from the training stage of HOG, similar to the method
described in [43] which use aggregated channel features (ACF) [98] to select training
samples for CCF. Caffe [99] is used for feature extraction of CCF on a GPU-based
computer platform. At the test stage, CCF method runs on the GPU platform is
98
considerably faster than the HOG method, but it requires more memory and disk
space for data storage.
6.3.5 Information fusion
The idea of combing the information from color image, disparity map and thermal
data for decision making is referred as information fusion. One approach is to con-
catenate these features together [56]. A single classifier can be trained on the con-
catenated features and the final decisions of the test instances can be obtained from
the classifier. This approach has an disadvantage that the classifier training becomes
a challenge as the dimension of features increases. Furthermore, if a new type of
feature needs to be added or an existing feature needs to be removed, the classifier
need to be re-trained, which is time consuming.
An alternative approach of information fusion is to employ multiple classifiers
and an example can be found in [100]. Each classifier makes decision on a certain
type or subset of features and the final result is obtained by using a decision fusion
technique such as majority voting or sum rule [101]. This approach has an advantage
that the structure of a system is reconfigurable. Without re-training the classifiers,
adding or removing different types of features becomes very convenient. Therefore,
we choose the later approach to make our system reconfigurable so that it evaluates
various settings and methods. Specifically, an SVM is used at the decision fusion
stage and its inputs are confidence scores from classifiers in the previous stage, which
is more appropriate than commonly used statistical decision fusion method in the
case of multi-source data [102, 103]. The data from different sources are often not
equally reliable, and so are the classifiers. The confidence scores must be weighted
when obtaining the final decision from information fusion.
99
6.3.6 Additional constraints
6.3.6.1 Disparity-size
Besides the extracted features from an image frame, additional constraints can be
incorporated into the decision fusion stage to further improve the detector perfor-
mance. An example is the disparity-size relationship. Figure 6.5 shows the disparity
and height relationship of the positive samples in the form of a linear regression line
d =
[h 1
]×B , where d is mean disparity, h is the height of the sample, and B is
a 2× 1 coefficient matrix. Given a pair of mean disparity d and height h of a sample,
the residual r = |d −[h 1
]× B| can be used to estimate whether this sample is
possibly a pedestrian or not.
From Figure 6.5 we can see a number of samples have very small mean disparity
and are far below the regression line. This is because the disparity information is not
accurate when an object is far away from camera. In fact, the stereo vision camera
we use automatically clamps the disparity value at certain distance. Object beyond
that distance results zero disparity, which makes the estimation for small size samples
inaccurate.
6.3.6.2 Road horizon
During detection, a few reasonable assumptions can be made to filter out more false
positives while retaining the true positives. The assumptions vary depending on
the application, including color, shape, position, etc. One assumption here is that
pedestrians stand on the road, i.e., the lower bound of a pedestrian must below the
road horizon. The road horizon can be automatically detected in an image. This
kind of simple constraint may or may not improve the detector performance, and
100
Figure 6.5: The relationship between the mean disparity and the height of an object.
101
experiments should be carried out to determine its effectiveness.
6.4 Performance Evaluation
There are a total of 58 labeled video sequences in our dataset. We use 39 of them for
training and the remaining 19 for test. Figure 6.6 shows the performance of different
settings, including disparity map, color image, thermal data, and their combinations,
all based on HOG features. Generally, the more types of information are used, the
better performance is achieved. The disparity-only setup performs the worst. The
color image only is better, followed by the combination of color and disparity. Note
that the thermal-only setup outperforms the combination of color and disparity. The
heat signature of pedestrians seems more recognizable in thermal images. The com-
bination of color, thermal and disparity information achieves the best performance,
with about 36% log-average miss rate (MR).
Figure 6.7 shows the performance of the HOG features, added with disparity-size
information and road horizon constraint. The road horizon improves the log-average
MR by about 5%. Despite little improvement provided by adding the disparity-size
information alone, the combination both provides nearly 7% improvement in log-
average MR.
Figure 6.8 shows the performance of different settings using CCF. Performance
of disparity only is the worst. Thermal image performs very well. However, it is
interesting to see the disparity does not provide any improvement when combined
with color or thermal. In fact, combing with disparity results lower performance.
This is due to the fact that CCF implementation accepts 8-bit image as input, thus
the precision of the disparity is not accurate. In comparison, CCF outperforms HOG
102
Figure 6.6: Performance of different input data combinations, all using HOG features.
103
Figure 6.7: Performance improvement by adding disparity-size and road horizon con-straints.
104
Figure 6.8: Performance of different input data combinations, all using CCF.
almost on all settings except for disparity. The best performance comes from CCF
with the combination of color and thermal, which achieves 9% log-average MR. Sim-
ilarly, we also attempt to add disparity-size information and road horizon constraint
to the CCF method, but the performance changes are negligible.
6.5 Discussion
Although the combination of multi-spectral cameras can improve the performance in
pedestrian detection, the performance is still highly dependent on the instrument.
Our thermal camera has a resolution of 640× 480, which is relatively low. To accom-
105
Figure 6.9: A pedestrian is embedded in the shadow of a color image.
modate the resolution and FOV of the thermal camera, the color cameras have to be
set to the same resolution. In addition, color cameras are sensitive to the lighting
condition, therefore the quality of the image sometimes cannot be guaranteed. Figure
6.9 shows an example, with bounding box drawn on the detected pedestrian in both
color and thermal images. It is obvious that the thermal image provide much better
information about the presence of the pedestrian, while it is hardly identifiable in the
color image due to the shadow.
Although thermal images seem to be dominant in our experiment, its reliability
still needs improvement. Figure 6.10 shows a thermal image taken on a hot sunny
day. Two pedestrians circled are not bright enough compared to the surroundings,
which is contradictory to the assumption of distinct thermal intensity in many existing
research works. In this case, the methods or operations based on pixel intensity values
become unreliable, such as intensity thresholding, head recognition using hot spot,
etc. On the contrary, some shape or gradient based methods may still perform well,
such as HOG and CCF described in this chapter.
106
Figure 6.10: An example thermal image with two pedestrians.
107
6.6 Conclusions
In this chapter, a novel pedestrian detection instrumentation is designed using both
thermal and RGB-D stereo cameras. Data are collected from on-road driving and an
experimental dataset is built with pedestrians labeled as ground truth. A reconfig-
urable multi-stage detector frame is proposed. Both HOG and CCF based detection
methods are evaluated using the multi-spectral dataset with various combinations of
thermal, color, and disparity information. The experimental results show that CCF
significantly outperforms the HOG features. The combination of color and thermal
images using CCF method results the best performance of 9% log-average miss rate.
For the future work, other advanced feature extraction and classification methods will
be considered to further improve the pedestrian detector performance.
108
Chapter 7
End-to-End Learning for Lane
Keeping of Self-Driving Cars
Lane keeping is an important feature for self-driving cars. This chapter presents an
end-to-end learning approach to obtain the proper steering angle to maintain the
car in the lane. The convolutional neural network (CNN) model takes raw image
frames as input and outputs the steering angles accordingly. The model is trained
and evaluated using the comma.ai dataset, which contains the front view image frames
and the steering angle data captured when driving on the road. Unlike the traditional
approach that manually decomposes the autonomous driving problem into technical
components such as lane detection, path planning and steering control, the end-to-end
model can directly steer the vehicle from the front view camera data after training.
It learns how to keep in lane from human driving data. Further discussion of this
end-to-end approach and its limitation are also provided.
109
7.1 Introduction
Lane keeping is a fundamental feature for self-driving cars. Despite many sensors
installed on autonomous cars such as radar, LiDAR, ultrasonic sensor and infrared
cameras, the ordinary color cameras are still very important for their low cost and
ability to obtain rich information. Given an image captured by camera, one of the
most important tasks for a self-driving car is to find the proper vehicle control input
to maintain it in lane. The traditional approach divides the task into several parts
such as lane detection [104,105], path planning [106,107] and control logic [108,109],
and they are often researched separately. The lane markings are usually detected by
some image processing techniques such as color enhancement, Hough transform, edge
detection, etc. Path planning and control logic are then performed based on the lane
markings detected in the first stage. In this approach, its performance highly relies
on the feature extraction and interpretation of the image data. Often the manually
defined features and rules are not optimal. Errors can also accumulate from previous
processing stage to next stage, leaving the final result inaccurate. On the other
hand, an end-to-end learning approach for self-driving cars has been demonstrated
in [70] using convolutional neural networks (CNNs). The end-to-end learning takes
the raw image as input and outputs the control signal automatically. The model
is self-optimized based on the training data and there is no manually defined rules.
These become the two major advantages of end-to-end learning: better performance
and less manual effort. Because the model is self-optimized based on the data to
give maximum overall performance, the intermediate parameters are self adjusted to
be optimal. Moreover, there is no need to detect and recognize certain categories of
pre-defined objects, to label those objects during training or to design control logic
110
Figure 7.1: Comparison between the traditional approach and end-to-end learning.
based on observation of these objects. As a result, less manual efforts are required.
Figure 7.1 compares the traditional approach with the end-to-end learning approach.
This chapter presents the end-to-end learning approach to produce the proper
steering angle from camera image data aimed at maintaining the self-driving car in
lane. The model is trained and evaluated using comma.ai dataset, which contains
image frames and the steering angle data captured when driving. The rest of the
chapter is organized as follows. Section 7.2 provides the details of our implementa-
tion, including data pre-processing and CNN architecture. The evaluation results are
presented in Section 7.3, followed by discussions in Section 7.4 and conclusions in
Section 7.5.
7.2 Implementation Details
7.2.1 Data pre-processing
The data used in this chapter are from comma.ai driving dataset. The dataset con-
tains 7.25 hours of driving data, including 11 video clips recorded at 20 Hz and some
other measurements such as steering angle, speed, GPS data, etc. The image frames
111
Figure 7.2: An example of image frame from the dataset.
are of size 320× 160 pixels, and are cropped from original video frames. The original
frames are not provided by the dataset. An example of the frame from the dataset is
shown in Figure 7.2. For lane keeping, only the image frames and the steering angle
data are used. The steering angle data are recorded at 100 Hz, and they are aligned
with the image frames using the alignment stamps provided by the dataset. In case
there are multiple steering angle instances correspond to the same image frame, their
average is used to form an one-to-one mapping between each image frame and its
corresponding steer angle.
Before training the CNN model, the data need to be further processed. First of
all, to simplify the problem, driving at night is not considered in this chapter and
all four clips recorded at night are not considered. Second, the data contains many
scenarios such as driving forward, changing lanes, making turns, driving on straight or
curved roads, driving in normal speed or moving slowly in a traffic jam, etc. To train
a lane keeping model, the data that meet the following criteria are selected: driving
112
in normal speed, no lane changes or turns, and both straight and curved roads. After
data selection, the remaining data are from 7 video clips with a total of about 2.5
hours. At last, five video clips containing 152K frames are used for training and two
video clips containing 25K frames are used for test.
During the training stage, one important issue needs to be addressed that the
data used for training is highly unbalanced as shown in Figure 7.3. As highway roads
tend to be mostly straight and the portion of curved road is at a small percentage,
the trained model based on these unbalanced data may tend to driving straight while
still have low losses. To remove such bias, the data of curved roads are up-sampled
by five, where curved roads are defined by where the absolute steering angle values
are larger than five degrees. The data are then randomly shuffled before training.
7.2.2 CNN implementation details
The CNN architecture that we proposed is shown in Figure 7.4, which is similar to
that in [70] and [110] but is much simpler. The loss layer used during training is
Euclidean loss, which computes the sum of squares of differences between predicted
steering angle and ground truth steering angle: 12N
∑Ni=1 ||x1i −x2i ||22. The CNN model
is trained using Caffe [99].
The CNN model consists of three convolutional layers and two fully connected
layers. The input layer is raw RGB image, and output layer is the predicted steering
angle for the input image. The first convolutional layer uses a 9×9 kernel and a 4×4
stride. The following two convolutional layers use a 5×5 kernel and a 2×2 stride. The
convolutional layers are mainly for feature extraction and the fully connected layers
are mainly for steering angle prediction, but there is no clear boundary between
them since the model is trained end-to-end. Dropout layers are used for preventing
113
Figure 7.3: Histogram of steering angles in training data.
114
Figure 7.4: The proposed CNN architecture for deep learning.
115
over-fitting. There are no pooling layers because the feature maps are small.
The CNN architecture, as well as the hyper-parameters used, can be further tuned
through more experiments. Overall, the CNN architecture is not the major concern
of this work for the following of two reasons. First, we feel the dataset is too small.
Despite the training and testing data contain more than 170K frames that equals to
about 2.5 hours driving, it is actually insufficient to train a generic lane keeping model
that uses raw image as input. The appearance of the roads can be very complex due
to different curves, road markings, lighting conditions, etc. In fact, the proportion
of data for curved roads is relatively small, with only about 20 minutes of driving.
Training a model that gives a continuous value as predicted steering angle, these
amount of data is not sufficient. The other reason is that tuning a model requires
proper evaluation metric, which is also limited by the current dataset. The details of
the evaluation method will be discussed in Section 7.4.
7.3 Evaluation
The trained model is evaluated using two test video clips containing 25K frames. For
each frame, the predicted steering angle is compared with the ground truth value. The
histogram of the error is shown in Figure 7.5. The standard deviation of the error is
3.26, the mean absolute error is 2.42, and the unit is degree. To better understand
the errors, the predicted angle and ground truth angle are compared in each frame
and the results cam be visualized.
Figure 7.6 shows an example frame along with the ground truth angle and pre-
dicted angle. The projected paths for both angles are plotted using the same approx-
imation as in [110]. The path using ground truth angle is in blue and the path using
116
Figure 7.5: Histogram of error of predicted steering angles during test.
117
Figure 7.6: An example frame with the ground truth angle, predicted angle and theirrespective projected path
predicted angle is in green. The simulated steering wheels for both angles are also
drawn for better visualization.
Figure 7.7 also visualizes the feature maps from the first two convolutional lay-
ers. The top-right 4 × 4 cells are results from the first convolutional layer, and the
bottom 4× 8 cells are results from the second convolutional layer. As expected, the
convolutional layers automatically learned to extract the lane markings as a kind of
feature during training. The model does not use any manually defined or hand-crafted
features, since it can learn useful features from the data automatically .
118
Figure 7.7: Visualization of the results from first two convolutional layers.
7.4 Discussion
7.4.1 Evaluation
As an evaluation metric, computing the differences of ground truth angle and pre-
dicted angle is actually questionable. Firstly, the ground truth provided by the human
driver is not globally optimal. The human driver cannot maintain the vehicle in the
center of the lane all the time. As long as the vehicle stays in lane, the predicted
angles are fine and do not have to be exactly the same as the human driver. Secondly,
both the vehicle movement and the steering control are continuous, thus the frame
by frame evaluation is not appropriate. Let’s consider two scenarios if the road is
straight. In the first scenarios, the steering angle turns to the left a bit, then quickly
turns to the right a bit to maintain the vehicle in the lane. This process can be
repeated. In the second scenario, the steering angle turns to the left a bit and stay at
119
that angle for a period of time, then it turns to the right a bit and stays for a while. In
the second scenario, the vehicle actually would drive out of the lane most of the time.
In these two scenarios, the histogram of the errors, mean absolute error, standard de-
viation of the error are the same. However, the first scenario is fine while the second
one is completely unacceptable. Figure 7.8 shows an example of the disadvantage of
this type of frame by frame evaluation. The frames and their predicted angles are
from the test dataset. These 5 frames are put in chronological order. We can see that
the middle frame has a huge error of 10 degrees. However, the recorded ground truth
does not seem correct in this frame. By looking at the previous and following frames,
we find out that the ground truth in this frame is transitioning from left to right.
This example shows that evaluating the error frame by frame is not appropriate.
To solve this problem, a simulator is needed to provide feedback based on the
predicted angle. The simulator should be able to generate the frames and simulate
the vehicle movement realistically. The frames should be generated according to the
vehicle position and orientation. One way to do so is using a virtual game engine,
such as described in [11, 111]. The advantage of using a virtual engine is that there
are built-in physics simulation and 3D rendering mechanism. The vehicle movement
simulation and frame generation can be done realistically. Besides, the ground truth
information is very rich in the virtual world. Information such as vehicle position,
orientation and velocity can be easily obtained, so do other objects. The disadvantage
is that the frames are computer generated graphics, but not real images captured from
driving in the real world. Although they look very realistic with the state of the art
game engine, the details and variations they provide still cannot match the data from
real world.
Alternatively, we can generate the next frames according to controls inputs using
120
Fig
ure
7.8:
An
exam
ple
ofth
edis
adva
nta
geof
fram
eby
fram
eev
aluat
ion
wit
h5
conse
cuti
vefr
ames
:th
eer
ror
inth
em
iddle
fram
eis
fals
e
121
recorded frames, i.e., data captured in real world. This can be achieved by either
learning approach [112] or 3D image projection approach [70]. The learning approach
learns auto-encoders to embedding road frames, and learns a transition model in the
embedded space. The next few frames can be generated based on the current frame
image and the current control inputs. On the other hand, the 3D image projection
approach assumes the ground is a flat surface, and solves the 3D geometry [113]
to generated the next frame based on the actual recorded frame, through predicted
camera shift and rotation. The camera shift and rotation can be obtained from vehicle
movement simulation, which can be computed using vehicle kinematic or dynamic
models [108,109].
7.4.2 Data augmentation
Since we are not supposed to drive off the lane when recording, the data obtained
from human driving are lack of error correction process. The human driver is able to
maintain the vehicle within the lane, but a model trained on such data is not robust
to errors and the vehicle may slowly drift away. To train a model that can correct
small errors such as vehicle shifts and rotations, the error correction data must be
provided during training. One solution is to perform data augmentation by randomly
creating some shifts and rotations, which generates corresponding frames based on
the 3D geometry described above. The correction control input can be computed
again using the vehicle kinematic or dynamic models. The comma.ai dataset does
not contain the original sized frames or camera calibration parameters. Therefore
simulator and data augmentation is not included in this chapter. Our current work
collects real world data using multiple camera. All the aforementioned techniques
will be incorporated in the future work.
122
7.5 Conclusions
This chapter presents the end-to-end learning approach to lane keeping for self-driving
cars that can automatically produce proper steering angles from image frames cap-
tured by the front-view camera. The CNN model is trained and evaluated using
comma.ai dataset, which contains image frames and the steering angle data captured
from road driving. The test results show that the model can produce relatively ac-
curate steering of vehicle. Further discussions on evaluation and data augmentations
are also presented for future improvement.
123
Chapter 8
Building an Autonomous Lane
Keeping Simulator Using
Real-World Data and End-to-End
Learning
Autonomous lane keeping is an important safety feature for intelligent vehicles. This
chapter presents a state-of-the-art end-to-end learning method using convolutional
neural network (CNN) that takes front view camera data as input and produces the
proper steering wheel angle to keep the vehicle in lane. A novel method of data
augmentation is proposed using vehicle dynamic model and vehicle trajectory track-
ing, which can create addition training data as if a vehicle drives off-lane at random
displacement and orientation. Real-world driving data is recorded from three front-
view cameras on left, center, and right. A lane keeping simulator is built using the
recorded data in conjunction with image projection and vehicle dynamics estima-
124
tion. Experimental results demonstrate that the end-to-end learning method with
augmented data can achieve high accuracy for autonomous lane keeping and very low
failure rate. The simulator can serve as a platform for both training and evaluation
of vision-based autonomous driving algorithms. The experimental dataset is made
available at http://computing.wpi.edu/dataset.html.
8.1 Introduction
Lane keeping is a fundamental feature for intelligent and autonomous vehicles. De-
spite many sensors installed on autonomous cars such as radar, LiDAR, ultrasonic
sensor and infrared cameras, the ordinary color cameras are still very popular owing
to their low cost and ability to obtain rich information. Given the video images from
the front-view camera, an vision-based lane keeping system can automatically output
the proper steering angles to maintain the vehicle in lane. A traditional framework
divides the task into several stages including lane detection [104, 105], path plan-
ning [106, 107] and control logic [108, 109]. Applying image processing techniques
such as color enhancement, Hough transform and edge detection, the lane detection
system is to identify the lane markings on the road. Path planning and control logic
are then employed to provide the proper steering angle adjustment for the vehicle.
In this approach, performance of lane detection heavily relies on the feature extrac-
tion and interpretation of image data. Errors can also accumulate from previous
processing stage to the next, leaving the final control output less accurate.
In contrast, an end-to-end learning method has the advantages of better perfor-
mance and less manual effort. End-to-end learning for self-driving cars has been
successfully demonstrated in [70] using convolutional neural networks (CNNs), which
125
Figure 8.1: Comparison between the traditional framework and end-to-end learning.
takes the images from cameras as input and produce the vehicle control output auto-
matically. The model is self-optimized based on the training data and does not need
manually defined features. User does not need to label the detected objects and their
categories during the training process. Figure 8.1 is a comparison between the tra-
ditional framework and the end-to-end learning approach for vision-based automatic
lane keeping.
Although the approach of end-to-end learning for lane keeping is not new, the
existing work has several deficiencies. For instance, the error difference between
the recorded “ground truth” and predicted steering angle is not the best evaluation
metric. Since it is hardly possible for a human driver to maintain the vehicle perfectly
in the center of the lane at all time, the recorded angles are not optimal. Thus, the
predicted angles do not have to be exactly the same as the ground truth angles
recorded from the human driving experience. It is more important to predict the
position and orientation of the vehicle in the very next time step given current vehicle
speed and steering angle control. As long as the vehicle stays in lane, the steering
angle is acceptable. By using a simulator, the effects of the control input can be
simulated and monitored, and therefore providing a more reliable evaluation metric.
126
Furthermore, we need to provide data to train the deep neural network to take
appropriate steering angle actions when the vehicle drifts away from the center of
the lane. However, the recorded driving data are lack of this type of actions since
it is unsafe to drive off the lanes during data collection. To solve this dilemma, we
propose a data augmentation method based on vehicle dynamic model and vehicle
trajectory tracking. Given any displacement and orientation, the model can generate
a projected trajectory and a sequence of steering angle controls. Correspondingly, we
can also create the augmented front views using image projection based on the shift
location and orientation. Therefore, the system becomes a simulator that can not
only generate augmented data for training the convolutional neural network but also
be used as a platform to evaluate the performance of other vision-based lane keep
algorithms.
The main contributions of this chapter are listed as follows:
1. This chapter presents a simulator for vision-based autonomous lane keeping.
Although there are many recent works on lane keeping algorithms, it is hard to
compare and evaluate them. Built on the recorded driving data, this simulator
employs image projection, vehicle dynamics modeling, and vehicle trajectory
tracking to predict vehicle movement and its corresponding camera views. The
simulator can be used for both training and evaluation of lane keeping algo-
rithms.
2. An end-to-end learning method is to proposed that can generate proper steering
angles from front-view camera data, which can maintain the vehicle in lane.
A highly effective end-to-end learning system is demonstrated using the the
aforementioned simulator. The CNN model trained with augmented data from
127
the simulator performs significantly better than the model trained with recorded
data only.
3. A completely new dataset for autonomous lane keeping is developed and was
made available at http://computing.wpi.edu/dataset.html. The dataset
contains recorded video frames from three forward facing cameras (left, center,
and right) as well as steering wheel angles and vehicle speed information.
The rest of the chapter is organized as follows. Section 8.2 provides the implemen-
tation details of our simulator, including image projection, vehicle dynamics, vehicle
trajectory tracking as well as the CNN architecture. The experiment and evalua-
tion results are presented in Section 8.3, followed by discussions in Section 8.4 and
conclusions in Section 8.5.
8.2 Building a Simulator
8.2.1 Overview
For evaluation of vision-based lane keeping algorithms, a simulator is needed to pro-
vide feedback based on the predicted angle. The simulator can generate image frames
to the vehicle position and orientation, and it can also simulate the vehicle movement
giving a steer angle input. Therefore, a simulator for self-driving cars has two im-
portant components: graphic engine and physics engine. The graphic engine utilizes
the information of the surrounding environment, as well as the pose of the camera to
generate images. The physical engine simulates vehicle movement based on the input
control actions. A virtual game engine usually contains both graphic and physics
engine, and some autonomous driving simulators were built upon it [11,111]. Vehicle
128
movement simulation and frame generation can be integrated into the game engine.
Besides, the ground truth information is very rich in the virtual world. Information
such as vehicle position, orientation and velocity can be easily obtained, so do other
objects. Despite these advantages, a significant drawback of these virtual simula-
tors is that the generated images are still quite different from the real world data.
Although they look very realistic with advanced graphic techniques, the details and
variations of virtual images still cannot match the data from real world. It is risky to
train a model using virtual game engines and then deploy the model for real-world
driving. It would be better to build a simulator from the real world data.
Different camera views can be generated from recorded video frames by learning
approach [112] or 3D image projection approach [70]. The learning approach learns
auto-encoders to embedding road frames, and learns a transition model in the em-
bedded space. The next few frames can be generated based on the current frame
image and the current control inputs. On the other hand, the 3D image projection
approach assumes the ground is a flat surface, and solves the 3D geometry [113] to
generated the next frame based on the actual recorded frame. The camera shift and
rotation can be obtained from vehicle movement simulation, which can be estimated
using vehicle kinematic or dynamic models [108,109].
In our simulator, the image projection approach is employed for rendering the
images. The CNN takes the image as input and the vehicle dynamics is used to
simulate vehicle movement given the control action. Figure 8.2 shows the detail
operations of the simulator when testing the CNN-based lane keeping algorithm.
The predicted position is constantly validated against the ground truth position. A
failure is recorded if the error exceeds a threshold value. More importantly, the
simulator can be very useful when training the neural network by providing a large
129
amount of additional training through augmentation. When using the simulator for
training, the vehicle trajectory tracking replaces the CNN controller to provide the
control actions that can gradually correct initial position shift and/or orientation
rotation. Practically, assuming an arbitrary shift and rotation of the vehicle from
the ground truth, vehicle trajectory tracking block can produce the proper steering
angle control actions. Combined together with the generated camera view from the
image projection process, augmented data can be generated. Figure 8.3 shows the
operation flow of the simulator at training phase during that many augmented data
can be generated from each ground truth image by arbitrary shift and rotation of the
vehicle.
8.2.2 Image projection
Rendering the image according to the vehicle position and orientation is required by
the simulator, in order to provide more instances for machine learning and better
evaluation metric. However, without using a gaming engine, data collected in real
world are sparse, often along a single trajectory as the car goes. These data themselves
are far from enough to cover all possible positions and orientations. Therefore, these
data must be transformed for an arbitrary position and orientation, using computer
vision knowledge of image projection base on 3D geometry. Given a point in the
world coordinates which is Xw = (xw, yw, zw) and the corresponding point in image
coordinates which is p = (p1, p2), there are relations that
ph = XhwMexMin (8.1)
Xhw = c(xw, yw, zw, 1)
130
Figure 8.2: The flowchart of test phase.
131
Figure 8.3: The flowchart of training phase, using original data and augmented data.
132
ph = d(p1, p2, 1)
where ph and Xhw are 1× 3 and 1× 4 homogeneous coordinates, c and d are arbitrary
nonzero constants, Mex is the 4 × 3 extrinsic matrix and Min is the 3 × 3 intrinsic
matrix. The extrinsic matrix contains a rotate matrix and a translation vector, which
defines the camera’s position and orientation in the world coordinates. Therefore
the extrinsic matrix is changed if the camera is shifted or rotated. The intrinsic
matrix defines the transformation from camera coordinates to image coordinates,
including parameters such as focal length, aspect ratio, location of principle point,
etc. The intrinsic matrix stays the same even if the camera is shifted or rotated. The
extrinsic matrix and intrinsic matrix can be obtained through calibration procedure.
Given the image taken in the real world with known calibration parameters Mex,
Min and its pixels coordinates p, the new pixels coordinates p need to be found with
a new extrinsic matrix Mex, when the camera is shifted and rotated. The physical
dimensions of the 3D scene are required in order to find the projection parameters. In
the case of highway lane keeping simulation, we made an assumption that the ground
surface is flat, e.g., zw = 0. According to formula 8.1, the mapping of p to p then can
be obtained as follows:
Xhw = phM−1
in M−1ex (8.2)
ph = XhwMexMin (8.3)
Note that the lens distortion, if any, needs to be corrected before performing such
image projection. Figure 8.4 shows some examples of transforming an original image
according to camera’s virtual position and orientation. The additive black area on
133
the generated image is usually not an issue for vehicle simulation, since the captured
images from front-view cameras are often cropped to retain only the middle section
as the region of interest.
Another challenging task is ground surface estimation during calibration. To es-
timate the calibration parameters, especially Mex in formula 8.1 with the assumption
zw = 0 for the ground surface, these three cameras used in our system need to be
deployed on vehicle and world coordinates need to be established properly. When
calibrating the cameras in the lab, a checkerboard pattern is usually used, as shown
in Figure 8.4. However, estimating the ground surface needs a very large pattern,
which is hard to craft and deploy. In our experiment, a flat parking lot with existing
markings is used for ground surface estimation. Physical dimensions of the markings
are measured manually while the corresponding images are captured by the cameras
installed on the vehicle. Figure 8.5 shows the selected points in the image taken
by the center camera during the calibration. The physical locations of the cameras
and the selected points in the world coordinates are also shown in Figure 8.5. Three
cameras are installed on the left, center and right of the vehicle, all facing forward,
because they can provide better field of view than a single camera. In fact, the nearest
camera to the vehicle’s virtual position is selected as the source in equations 8.2 and
8.3. Therefore, the generated images have better quality and less additive black areas
after projection.
8.2.3 Vehicle dynamics and vehicle trajectory tracking
According to [108], the bicycle vehicle dynamics shown in Figure 8.6 is captured by
the following equations:
134
(a) (b)
(c) (d)
Figure 8.4: Example of original image and generated images given arbitrary cameraposes. (a) Original image. A checkerboard pattern on a flat surface. (b) Generatedimage as if the camera is shifted left by 50 mm. (c) Generated image as if the camerais rotated right by 15.25 degrees. (d) Generated image as if the camera is shifted leftby 50 mm and rotated right by 15.25 degrees.
135
(a)
X coordinates (Meters)
-15 -10 -5 0 5 10 15 20
Y c
oo
rdin
ate
s (
Me
ters
)
-20
-15
-10
-5
0
5
10
15Multi-camera calibration and ground surface estimation.
Selected Points
All Points
Right Cam
Middle Cam
Left Cam
(b)
Figure 8.5: Camera calibration and ground surface estimation. (a) Selected points inthe image taken by the center camera. (b) Cameras and selected points in the worldcoordinates.
136
x = v cos θ
y = v sin θ
θ = ω
θ = ψ + β
ψ =v
lrsin β
v = a
β = arctan
(lr
lf + lrtan (σf )
)
where P = (x, y, θ) ∈ R2×S1 is the state of the position and orientation, v and ω are
the linear velocity and angular velocity respectively that are also the control input.
a is the acceleration and σf is the turning angle. lf and lr are the distance from the
vehicle’s mass center to the front and rear axles. In our test vehicle, we use estimated
values lf=1 m and lr=1.7 m.
The dynamics in Figure 8.6 is feedback linearized by introducing a nonlinear
mapping from the current nonlinear system to a new linear system and a new state
variable z = [x, y, x, y]:
z = Az +Bu
x
y
x
y
= A
x
y
x
y
+Bu
137
Figure 8.6: A virtual bicycle vehicle dynamics.
138
where the state matrix A =
0 0 1 0
0 0 0 1
0 0 0 0
0 0 0 0
, the input matrix B =
0 0
0 0
1 0
0 1
and the
input vector u =
xy
in the new linear system. After the feedback linearization, the
whole problem is transformed into searching the proper gain K for the linear system.
To solve this optimal control problem, Linear Quadratic Regulator (LQR) is used to
acquire the optimal gain K. The quadratic cost is defined as the following:
J =
ˆ ∞0
(xᵀQx+ uᵀRu)dt (8.4)
where Q =
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
and R =
1 0
0 1
, x and u are the state and control effort
respectively. Practically, Q and R matrices do not have to be identity matrices
but positive definite, and the entries can be tuned to achieve required performance
accordingly. Once the gain K is computed, the feedback control law and the ordinary
differential equation (ODE) of the new linear system are described as follows:
e = z − zd
u = −Ke+ ud
e = (A−BK)e
z = zd + e
139
where e is the error between the true state and desired state, K is the gain computed
based on the defined cost equation 8.4 with A and B matrices, u ∈ R2 is the input
vector, ud = (xd, yd) is the referenced input given by the ground truth, e, z, zd is
changing of the error, state, desired state, respectively. A is the 4 × 4 state matrix,
and B is the 4× 2 input matrix.
v = x cos θ + y sin θ (8.5)
ω =1
v(y cos θ − x sin θ) (8.6)
The control input for the nonlinear system can then be calculated by remapping the
new input variables of linear system back to the original input of the nonlinear system
shown in equations 8.5 and 8.6, which are linear velocity v and angular velocity ω.
The results in Figure 8.7 demonstrate the effectiveness and correctness of the
vehicle trajectory tracking controller design. A vehicle with feedback control law
has the the capability of converging to and following the desired trajectory, even
though there exists initial error. At the beginning, owing to some errors between the
predicted and actual orientations, the steering angle is positive and large, which helps
the vehicle to correct its orientation in a short time. After 2 seconds, the predicted
orientation and the ground truth converges. The vehicle orientation does not change
rapidly for the next few seconds, which matches the fact that the steering angle of
the vehicle remains in a small range near zero.
140
X coordinates (Meters)
1400 1405 1410 1415 1420 1425 1430
Y c
oo
rdin
ate
s (
Me
ters
)
-4620
-4600
-4580
-4560
-4540
-4520
-4500
-4480
-4460
-4440
-4420Trajectory
Ground truth
Predicted
(a)
Time (Seconds)
0 1 2 3 4 5 6
An
gle
(D
eg
ree
s)
88
90
92
94
96
98
100
102
104Orientation
Ground truth
Predicted
(b)
Time (Seconds)
0 1 2 3 4 5 6
An
gle
(D
eg
ree
s)
-5
0
5
10
15
20
25
30Steering Wheel Angle
Ground truth
Predicted
(c)
Figure 8.7: Correction of vehicle’s position and orientation using vehicle trajectorytracking. (a) Ground truth and predicted trajectory. (b) Ground truth and predictedorientation. (c) Ground truth and predicted steering wheel angle.
141
8.2.4 CNN implementation
Convolutional neural networks (CNNs) [40–42] has achieved impressive performance
in image classification. In this chapter, learning the human driver’s control is not a
classification problem but a regression problem, therefore the loss layer during training
is Euclidean loss, which computes the sum of squares of differences between predicted
steering angle and ground truth steering angle: 12N
∑Ni=1 ||x1i − x2i ||22, where N is the
number of instances, x1i is the ith predicted value and x2i is the ith ground truth value.
The CNN is used as a steering angle predictor given the input image. It does not take
the entire image frame as input since only the center section is the region of interest
for lane keeping. The images are cropped before fed to CNN, as shown in Figure
8.8. The proposed CNN architecture is shown in Figure 8.9, and it is based on the
PilotNet [70,72]. It has 5 convolutional layers and 3 fully-connected layers. There are
no pooling layers because the feature maps are small. The convolutional layers are
mainly for feature extraction and the fully connected layers are mainly for steering
angle prediction, but there is no clear boundary between them since the model is
trained end-to-end. Unlike the PilotNet, our input image size is 400× 150 instead of
200× 66. The first convolutional layer is 4× 4 stride and 9× 9 kernel instead of 2× 2
stride and 5 × 5 kernel. The system of PilotNet uses the vehicle’s turning radius r
as steering command, and makes the inverse-turning-radius 1r
as the output to avoid
infinite numbers when driving straight. Our CNN uses the steering wheel angle as
the output, which is more intuitive. The proposed CNN model is trained using our
own dataset on Caffe [99] and Matlab software platform.
142
Figure 8.8: An example of cropped image frame from the dataset.
8.3 Experiment
8.3.1 Data collection
To capture images, three forward facing cameras are mounted on the dashboard of
the car, from left to right. Because the cameras are not water-proof, installing them
on top of the vehicle can be inappropriate. To avoid re-calibration each time, the
cameras remain stationary once installed. Multi-thread programming and software
triggers are used to synchronize the three cameras to capture images at 10 Hz. The
shutter time is set to auto with an upper-bound value to avoid extremely low frame
rate when the light condition is too dark. The image resolution is set to1288 × 968,
and captured images are stored as color image sequences. Meanwhile, the steering
angle and speed information are recorded by accessing the CAN BUS via OBD-II
port. The data from OBD-II port are decoded by our customized program, and then
saved with time stamps, in order to synchronize with the image data. The steering
wheel angle decoded from the OBD-II port has a precision of 0.07 degree and the
speed data has precision of 1 km/h or approximately 0.28 m/s. The steering wheel
143
Figure 8.9: The CNN structure used, slightly modified from NVIDIA’s PilotNet.
144
Figure 8.10: Our data collection system, including three forward facing cameras, aUSB hub, a laptop and access to OBD-II port.
angle s need to be converted to vehicle’s turning angle σf in Figure 8.6 by dividing the
steering ratio k as σf = sk, where k has an estimated value of 17.8 in our experiment.
Figure 8.10 shows our data collection system on a vehicle, including three forward
facing cameras, a USB hub, a laptop computer and an interface to OBD-II port.
The experimental data were collected on 7 occasions at 6 different days, approx-
imately 1 hour each. Different lighting and weather conditions are included, such as
sunny, cloudy and foggy, as shown in Figure 8.11. Night time driving is not included
in our data. The collected data are then refined to be used for the task of lane keep-
ing. Some recorded data that meet the following criteria are discarded: non-highway
145
(a) (b)
(c) (d)
Figure 8.11: Example frames under different weather or lighting condition. (a)Cloudy. (b) Shadowed. (c) Foggy. (d) Sunny.
driving, speed lower then 40 mph, change of lane, extreme lighting condition, equip-
ment failure, and sequences that is shorter than 1 minute. After refinement, about
3 hours driving data are valid. Among the 7 groups of collected data, 4 groups were
used for training and the other 3 groups are for testing. This is to prevent overlaps
between training and test data. Overall, the train data contain 68082 frames, nearly
2 hours at 10 Hz. The test data contain 32053 frames, nearly 1 hour at 10 Hz. The
training data sequences are randomly shuffled before applied to the CNN model.
146
8.3.2 Data augmentation
Ideally, the training dataset should contain some error correction scenario such that
the trained CNN model is capable of handling errors. So the vehicle stays in the
lane instead of drifting away. Such error correction data introduce initial errors for
the vehicle’s position and/or orientation, and then provide the proper control action
to correct such errors and guide the vehicle back to the lanes. The original data
collected from highway driving are lack of such error correction data, because of
the safety concern to perform such dangerous maneuvers on highway. Therefore, we
propose to apply data augmentation technique that can generate this type of error
correction data virtually. This is one of the important benefits of building a simulator.
Once the data are collected and the world coordinates established, it is possible to
obtain the ground truth of vehicle’s position and orientation at any given time. For
each frame, errors can be added manually into the vehicle’s position and orientation.
By using the knowledge of image projection of 3D geometry, the augmented images
can be generated accordingly. At the same time, correct control action is provided by
the vehicle trajectory tracking algorithm. Therefore, the augmented data can be used
as part of the training data to improve the model’s robustness. In our experiment,
each frame is randomly augmented 10 times by shifting the vehicle positions and
change orientations. Figure 8.3 shows the entire process of data augmentation. Figure
8.12 shows examples of augmented images.
8.3.3 Evaluation using simulator
In our previous work [73], it shows that the differences of ground truth angle and pre-
dicted angle is not an effective metric for evaluating the performance of lane keeping
147
(a) (b)
(c) (d)
Figure 8.12: Example of original image and augmented images given arbitrary vehicleposes. (a) Original image. (b) Augmented image as if the vehicle is shifted right by 0.5m. (c) Augmented image as if the vehicle is rotated left by 7 degrees. (d) Augmentedimage as if the vehicle is shifted right by 0.5 m and rotated left by 7 degrees.
systems. Hereby we propose a new metric by measuring the percentage of driving
time when the vehicle is in lane. Our simulator cane be employed as an evaluation
platform for autonomous lane keeping. The process flow of using the simulator for
evaluation is illustrated in Figure 8.2.
Giving the initial steering angle provided by the CNN model, vehicle positions
and orientations are updated by vehicle dynamics. Subsequently a front-view camera
image is generated through image projection according to the current vehicle position
and orientation. The new image is then fed to the CNN model and it produces the
steering angle for the next time-step. The same process repeats for all frames in a
test sequence. At each time step, the amount of position difference to the ground
truth is calculated. For the purpose of simplicity, the longitude difference is fixed to
zero, and the horizontal is compared with a threshold value. If the horizontal shift
is larger than the threshold, it is considered a lane keeping failure. The threshold
148
is set to 1 meter in our experiment. For each failure occurrence, the next 60 frames
are automatically marked as manual driving period. All other frames without failure
are considered autonomous driving period. The final criteria is the percentage of
autonomous driving time (autonomy):
A =ta
ta + tm(8.7)
where ta and tm represent the autonomous time and manual controlled time, respec-
tively. Figure 8.13 shows an example of the simulation results when comparing the
vehicle positions with the ground truth. The steering angles are produced by the
CNN model trained with data augmentation.
In our experiment, the CNNs trained with and without augmented data are both
evaluated using the simulator, and the results are shown in Table 8.1. The error of
position is only evaluated when the vehicle is in autonomous driving mode. The data
during manual controlled time in simulation are not evaluated. The percentage of
autonomous driving time using the model trained with augmented data is 98.32%
and number of failures is 9, which are significantly better than the result of 82.09%
and 98 without augmented data.
Table 8.1: Evaluation result using the simulator, with and without augmented data.
AugmentedAutonomy
No. of Error of Position (Meters)
Data Failures Mean Standard Deviation
Yes 98.32% 9 0.2179 0.1813
No 82.09% 98 0.2670 0.2071
In addition, the simulation results also show that the error of steering wheel
angle is not an effective metric for performance evaluation. The model trained with
149
X coordinates (Meters)
1400 1600 1800 2000 2200 2400 2600 2800Y
co
ord
ina
tes (
Me
ters
)
2600
2800
3000
3200
3400
3600
3800
4000
Trajectory
CNN
Ground Truth
Lane Limits
(a)
X coordinates (Meters)
2300 2350 2400 2450 2500
Y c
oo
rdin
ate
s (
Me
ters
)
3050
3100
3150
3200
3250Trajectory
CNN
Ground Truth
Lane Limits
(b)
X coordinates (Meters)
2392 2394 2396 2398 2400 2402
Y c
oo
rdin
ate
s (
Me
ters
)
3135
3136
3137
3138
3139
3140
3141
3142
3143
3144
3145Trajectory
CNN
Ground Truth
Lane Limits
(c)
Figure 8.13: An example of the simulation result, produced by the CNN trainedwith data augmentation. (a) Overview of the trajectory in a test sequence. (b)Trajectory zoomed-in in the black rectangle in (a). (c) Trajectory zoomed-in in theblack rectangle in (b).
150
augmented data has mean error of 0.3042 degrees and standard deviation of 1.6029
degrees. The model trained without augmented data has mean error of 0.3118 degrees
and standard deviation of 1.2043 degrees. It can hardly tell which model is better
from the mean error and standard derivation of steering angles.
The deployed simulator with CNN predictor runs at approximately 13 frames
per second (FPS). Considering the input data at 10 Hz, the end-to-end lane keeping
system is able to run at real-time. The hardware platform is a desktop computer with
Intel i5 3570K processor running at 3.4 GHz, 32 GB DDR3 RAM and one NVIDIA
GTX 1080 GPU.
8.4 Discussion
It is worth investigating the causes of some failures during evaluation. For example,
a failure case is shown in Figure 8.14. The vehicle is moving out of lane to the right
because the front vehicle is changing lane and lane markings are partially blocked.
Another case is shown in Figure 8.15 with casting shadow on the road. In most cases,
we believe the quality of the input data play a role in those failures, which can be
attributed to factors such as shadows on the road, extreme lighting conditions, camera
exposure settings and etc. Because of the complicated scenarios in the real world,
the robustness of a model needs be fully examined prior to deployment. Therefore, a
simulator built on the real world data becomes very useful.
151
Figure 8.14: An example of failure. The vehicle is going out of lane to the rightbecause another vehicle is changing lane, and lane markings are partially blocked.
Figure 8.15: An example of failure. The vehicle is going out of lane to the rightbecause of unclear lane markings.
152
8.5 Conclusions
This chapter presents an autonomous driving simulator that is built on the real-world
data with recording from three front-view cameras, steering wheel angles and vehicle
speed information. Vehicle dynamic model and trajectory tracking are incorporated
in the simulator to predict the vehicle movement. With proper calibrations, 3D
image projection technique can be applied to generate updated front-view images
at the current vehicle position and orientation. The simulator can be used both
training and evaluation of vision-based lane keeping algorithms. Moreover, an end-
to-end learning lane keep system is proposed using a CNN model to predict the
steering angle from front-view camera input. The CNN model trained with augmented
data results significantly better performance than using only the original recorded
data, when measured by percentage of automated diving time. This new real-world
driving dataset is shared online and can bring benefits to research and education of
autonomous vehicle technology.
153
Chapter 9
Conclusions
This dissertation presents the design and implementation of a group of systems for
autonomous vehicles.
The real-time GPU-based traffic sign detection and recognition system is capable
of detecting and recognizing 48 classes of traffic signs in any size on each image frame.
The detection rate is about 91.69% and the recognition rate is about 93.77%. The
system can process 27.9 fps video with the active pixels of a 1,628 Ö 1,236 resolution.
Because each frame is processed individually, no information from previous frames
is required. As part of our future work, information from previous frames will be
considered for tracking traffic signs which is expected to further improve the detection
accuracy.
Two traffic light detection and recognition systems are presented. The first system
detects and recognizes red circular lights only, using image processing and SVM. The
performance is better than that of traditional detectors. The second system is more
complicated. It detects and classifies different types of traffic lights, including green
and red lights in both circular and arrow forms. Color extraction and blob detection
154
are applied to locate the candidates with proper optimization. A classification and
validation method using PCANet is then used for frame-by-frame detection. The
multi-object tracking method and forecasting technique are employed to improve the
accuracy and produce stable results. As an additional contribution, we build a traffic
light dataset from the videos captured via a camera mounted behind the windshield.
A novel pedestrian detection instrumentation is designed using both thermal and
RGB-D stereo cameras. Data are collected from on-road driving and an experimental
dataset is built with the bounding box labeling of pedestrians as the ground truth.
A reconfigurable multi-stage detector frame is proposed. Both HOG and CCF based
detection methods are evaluated using data from multi-spectral cameras and their
various combinations. The experimental result indicates that the approach using
CCF outperforms that involving HOG features. The combination of color and ther-
mal images using the CCF method can achieve the best performance of about 9%
log-average miss rate. For future work, other advanced feature extraction and classi-
fication methods will be considered to further improve the detector performance.
The lane keeping system employs an end-to-end learning approach to obtain the
proper steering angle for maintaining the car in the lane. The CNN model is trained
and evaluated using comma.ai dataset, which contains image frames and the steering
angle data captured from road driving. The test results show that the model can
produce the relatively accurate steering of vehicle. Further discussions on evaluation
and data augmentations are also presented for future improvement.
A simulator for the lane keeping system is built using image projection, vehicle
dynamics and vehicle trajectory tracking. This is important for data augmentation
and evaluation. The test results show that the model trained with augmented data
using the simulator has better performance.
155
Our on-vehicle data collection systems are also implemented and deployed, and
our own datasets are built from recorded driving videos. These datasets are used in
most of our projects and can benefit other researchers in the future. Our experimental
datasets are available at http://computing.wpi.edu/Dataset.html.
156
Bibliography
[1] “Red light running,” Insurance Institute of Highway Safety. [Online]. Available:
http://www.iihs.org/iihs/topics/t/red-light-running/topicoverview
[2] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the
kitti vision benchmark suite,” in Conference on Computer Vision and Pattern
Recognition (CVPR), 2012.
[3] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The
kitti dataset,” International Journal of Robotics Research (IJRR), 2013.
[4] J. Fritsch, T. Kuehnl, and A. Geiger, “A new performance measure and eval-
uation benchmark for road detection algorithms,” in International Conference
on Intelligent Transportation Systems (ITSC), 2013.
[5] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2015.
[6] M. Mathias, R. Timofte, R. Benenson, and L. V. Gool, “Traffic sign recognition
- how far are we from the solution?” in Proceedings of IEEE International Joint
Conference on Neural Networks (IJCNN 2013), August 2013.
157
[7] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, “Detection of
traffic signs in real-world images: The German Traffic Sign Detection Bench-
mark,” in International Joint Conference on Neural Networks, no. 1288, 2013.
[8] “Traffic Lights Recognition public benchmarks.” [Online]. Available: http:
//www.lara.prd.fr/benchmarks/trafficlightsrecognition
[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec-
tion,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, vol. 1, June 2005, pp. 886–893 vol. 1.
[10] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A bench-
mark,” in CVPR, June 2009.
[11] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground
truth from computer games,” in European Conference on Computer Vision
(ECCV), ser. LNCS, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol.
9906. Springer International Publishing, 2016, pp. 102–118.
[12] J. Greenhalgh and M. Mirmehdi, “Real-time detection and recognition of
road traffic signs,” Intelligent Transportation Systems, IEEE Transactions on,
vol. 13, no. 4, pp. 1498–1506, 2012.
[13] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer:
Benchmarking machine learning algorithms for traffic sign recognition,” Neural
Networks, no. 0, pp. –, 2012. [Online]. Available: http://www.sciencedirect.
com/science/article/pii/S0893608012000457
158
[14] C. Keller, C. Sprunk, C. Bahlmann, J. Giebel, and G. Baratoff, “Real-time
recognition of u.s. speed signs,” in Intelligent Vehicles Symposium, 2008 IEEE,
June 2008, pp. 518–523.
[15] W. Liu, Y. Wu, J. Lv, H. Yuan, and H. Zhao, “U.s. speed limit sign detection
and recognition from image sequences,” in Control Automation Robotics Vision
(ICARCV), 2012 12th International Conference on, Dec 2012, pp. 1437–1442.
[16] F. Zaklouta, B. Stanciulescu, and O. Hamdoun, “Traffic sign classification us-
ing k-d trees and random forests,” in Neural Networks (IJCNN), The 2011
International Joint Conference on, July 2011, pp. 2151–2155.
[17] P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale convolu-
tional networks,” in Neural Networks (IJCNN), The 2011 International Joint
Conference on, July 2011, pp. 2809–2813.
[18] E. Herbschleb and P. H. N. de With, “Real-time traffic sign detection
and recognition,” pp. 72 570A–72 570A–12, 2009. [Online]. Available:
http://dx.doi.org/10.1117/12.806171
[19] A. D. L. Escalera, L. E. Moreno, M. A. Salichs, and J. M. Armingol, “Road
traffic sign detection and classification,” IEEE Transactions on Industrial Elec-
tronics, vol. 44, pp. 848–859, 1997.
[20] K. Par and O. Tosun, “Real-time traffic sign recognition with map fusion on
multicore/many-core architectures,” Acta Polytechnica Hungarica, vol. 9, no. 2,
2012.
159
[21] R. de Charette and F. Nashashibi, “Real time visual traffic lights recognition
based on spot light detection and adaptive traffic lights templates,” in Intelligent
Vehicles Symposium, 2009 IEEE, June 2009, pp. 358–363.
[22] ——, “Traffic light recognition using image processing compared to learning
processes,” in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ
International Conference on, Oct 2009, pp. 333–338.
[23] G. Trehard, E. Pollard, B. Bradai, and F. Nashashibi, “Tracking both pose and
status of a traffic light via an interacting multiple model filter,” in Information
Fusion (FUSION), 2014 17th International Conference on, July 2014, pp. 1–7.
[24] S. Sooksatra and T. Kondo, “Red traffic light detection using fast radial symme-
try transform,” in Electrical Engineering/Electronics, Computer, Telecommu-
nications and Information Technology (ECTI-CON), 2014 11th International
Conference on, May 2014, pp. 1–6.
[25] T.-P. Sung and H.-M. Tsai, “Real-time traffic light recognition on mobile devices
with geometry-based filtering,” in Distributed Smart Cameras (ICDSC), 2013
Seventh International Conference on, Oct 2013, pp. 1–7.
[26] J. Levinson, J. Askeland, J. Dolson, and S. Thrun, “Traffic light mapping, local-
ization, and state detection for autonomous vehicles,” in Robotics and Automa-
tion (ICRA), 2011 IEEE International Conference on, May 2011, pp. 5784–
5791.
[27] N. Fairfield and C. Urmson, “Traffic light mapping and detection,” in Robotics
and Automation (ICRA), 2011 IEEE International Conference on, May 2011,
pp. 5421–5426.
160
[28] A. Gomez, F. Alencar, P. Prado, F. Osorio, and D. Wolf, “Traffic lights detec-
tion and state estimation using hidden markov models,” in Intelligent Vehicles
Symposium Proceedings, 2014 IEEE, June 2014, pp. 750–755.
[29] S. Salti, A. Petrelli, F. Tombari, N. Fioraio, and L. Di Stefano, “Traffic sign
detection via interest region extraction,” Pattern Recognition, vol. 48(4), pp.
1039–1049, 2015.
[30] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief
nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, July 2006.
[31] I. Arel, D. Rose, and T. Karnowski, “Deep machine learning - a new frontier
in artificial intelligence research [research frontier],” Computational Intelligence
Magazine, IEEE, vol. 5, no. 4, pp. 13–18, Nov 2010.
[32] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “Pcanet: A simple deep
learning baseline for image classification?” arXiv preprint arXiv:1404.3606,
2014.
[33] S. Lafuente-Arroyo, S. Maldonado-Bascon, P. Gil-Jimenez, H. Gomez-Moreno,
and F. Lopez-Ferreras, “Road sign tracking with a predictive filter solution,” in
IEEE Industrial Electronics, IECON 2006 - 32nd Annual Conference on, Nov
2006, pp. 3314–3319.
[34] S. Lafuente-Arroyo, S. Maldonado-Bascon, P. Gil-Jimenez, J. Acevedo-
Rodriguez, and R. Lopez-Sastre, “A tracking system for automated inventory
of road signs,” in Intelligent Vehicles Symposium, 2007 IEEE, June 2007, pp.
166–171.
161
[35] S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “How far
are we from solving pedestrian detection?” CoRR, vol. abs/1602.01237, 2016.
[Online]. Available: http://arxiv.org/abs/1602.01237
[36] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An
evaluation of the state of the art,” PAMI, vol. 34, 2012.
[37] P. Viola and M. J. Jones, “Robust real-time face detection,” International
Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004. [Online].
Available: http://dx.doi.org/10.1023/B:VISI.0000013087.49260.fb
[38] R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “Ten years of
pedestrian detection, what have we learned?” CoRR, vol. abs/1411.4304, 2014.
[Online]. Available: http://arxiv.org/abs/1411.4304
[39] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,” pp.
91.1–91.11, 2009, doi:10.5244/C.23.91.
[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-
tion with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc.,
2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online].
Available: http://arxiv.org/abs/1409.1556
162
[42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June
2015.
[43] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features
for pedestrian, face and edge detection,” CoRR, vol. abs/1504.07339, 2015.
[Online]. Available: http://arxiv.org/abs/1504.07339
[44] R. Gade and T. B. Moeslund, “Thermal cameras and applications: a survey,”
Machine Vision and Applications, vol. 25, no. 1, pp. 245–262, 2014. [Online].
Available: http://dx.doi.org/10.1007/s00138-013-0570-5
[45] W. Li, D. Zheng, T. Zhao, and M. Yang, “An effective approach to pedestrian
detection in thermal imagery,” in Natural Computation (ICNC), 2012 Eighth
International Conference on, May 2012, pp. 325–329.
[46] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, “Pedestrian detec-
tion using infrared images and histograms of oriented gradients,” in 2006 IEEE
Intelligent Vehicles Symposium, 2006, pp. 206–212.
[47] C. Dai, Y. Zheng, and X. Li, “Pedestrian detection and tracking in
infrared imagery using shape and appearance,” Computer Vision and
Image Understanding, vol. 106, no. 2-3, pp. 288 – 299, 2007, special
issue on Advances in Vision Algorithms and Systems beyond the Visible
Spectrum. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S1077314206001925
163
[48] J. W. Davis and M. A. Keck, “A two-stage template approach to person de-
tection in thermal imagery,” Applications of Computer Vision and the IEEE
Workshop on Motion and Video Computing, IEEE Workshop on, vol. 1, pp.
364–369, 2005.
[49] F. Xu, X. Liu, and K. Fujimura, “Pedestrian detection and tracking with night
vision,” IEEE Transactions on Intelligent Transportation Systems, vol. 6, no. 1,
pp. 63–71, March 2005.
[50] D. Olmeda, A. de la Escalera, and J. M. Armingol, “Contrast invariant features
for human detection in far infrared images,” in Intelligent Vehicles Symposium
(IV), 2012 IEEE, June 2012, pp. 117–122.
[51] W. Wang, J. Zhang, and C. Shen, “Improved human detection and classifi-
cation in thermal images,” in 2010 IEEE International Conference on Image
Processing, Sept 2010, pp. 2313–2316.
[52] M. Bertozzi, A. Broggi, C. H. Gomez, R. I. Fedriga, G. Vezzoni, and M. DelRose,
“Pedestrian detection in far infrared images based on the use of probabilistic
templates,” in 2007 IEEE Intelligent Vehicles Symposium, June 2007, pp. 327–
332.
[53] T. T. Zin, H. Takahashi, and H. Hama, “Robust person detection using far
infrared camera for image fusion,” in Innovative Computing, Information and
Control, 2007. ICICIC ’07. Second International Conference on, Sept 2007, pp.
310–310.
164
[54] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, “Survey of pedestrian de-
tection for advanced driver assistance systems,” IEEE Transactions on Pattern
Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1239–1258, July 2010.
[55] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedes-
trian detection: Benchmark dataset and baseline,” in 2015 IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1037–1045.
[56] S. J. Krotosky and M. M. Trivedi, “On color-, infrared-, and multimodal-stereo
approaches to pedestrian detection,” IEEE Transactions on Intelligent Trans-
portation Systems, vol. 8, no. 4, pp. 619–629, Dec 2007.
[57] K. H. Lee and J. N. Hwang, “On-road pedestrian tracking across multiple driv-
ing recorders,” IEEE Transactions on Multimedia, vol. 17, no. 9, pp. 1429–1438,
Sept 2015.
[58] W. Liu, R. W. H. Lau, X. Wang, and D. Manocha, “Exemplar-amms: Recog-
nizing crowd movements from pedestrian trajectories,” IEEE Transactions on
Multimedia, vol. 18, no. 12, pp. 2398–2406, Dec 2016.
[59] R. Risack, N. Mohler, and W. Enkelmann, “A video-based lane keeping assis-
tant,” in Proceedings of the IEEE Intelligent Vehicles Symposium 2000 (Cat.
No.00TH8511), 2000, pp. 356–361.
[60] S. Ishida and J. E. Gayko, “Development, evaluation and introduction of a lane
keeping assistance system,” in IEEE Intelligent Vehicles Symposium, 2004, June
2004, pp. 943–944.
165
[61] J. F. Liu, J. H. Wu, and Y. F. Su, “Development of an interactive lane keep-
ing control system for vehicle,” in 2007 IEEE Vehicle Power and Propulsion
Conference, Sept 2007, pp. 702–706.
[62] A. H. Eichelberger and A. T. McCartt, “Toyota drivers’ experiences with
dynamic radar cruise control, pre-collision system, and lane-keeping assist,”
Journal of Safety Research, vol. 56, pp. 67 – 73, 2016. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0022437515001061
[63] Y. Li, “Deep reinforcement learning: An overview,” CoRR, vol. abs/1701.07274,
2017. [Online]. Available: http://arxiv.org/abs/1701.07274
[64] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “End-to-end deep
reinforcement learning for lane keeping assist,” CoRR, vol. abs/1612.04340,
2016. [Online]. Available: http://arxiv.org/abs/1612.04340
[65] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent,
reinforcement learning for autonomous driving,” CoRR, vol. abs/1610.03295,
2016. [Online]. Available: http://arxiv.org/abs/1610.03295
[66] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement
learning framework for autonomous driving,” CoRR, vol. abs/1704.02532,
2017. [Online]. Available: http://arxiv.org/abs/1704.02532
[67] S. Sharifzadeh, I. Chiotellis, R. Triebel, and D. Cremers, “Learning to
drive using inverse reinforcement learning and deep q-networks,” CoRR, vol.
abs/1612.03653, 2016. [Online]. Available: http://arxiv.org/abs/1612.03653
166
[68] D. A. Pomerleau, “Advances in neural information processing systems 1,” D. S.
Touretzky, Ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,
1989, ch. ALVINN: An Autonomous Land Vehicle in a Neural Network, pp.
305–313. [Online]. Available: http://dl.acm.org/citation.cfm?id=89851.89891
[69] Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp, “Off-road
obstacle avoidance through end-to-end learning,” in Proceedings of the 18th
International Conference on Neural Information Processing Systems, ser.
NIPS’05. Cambridge, MA, USA: MIT Press, 2005, pp. 739–746. [Online].
Available: http://dl.acm.org/citation.cfm?id=2976248.2976341
[70] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D.
Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba,
“End to end learning for self-driving cars,” CoRR, vol. abs/1604.07316, 2016.
[Online]. Available: http://arxiv.org/abs/1604.07316
[71] M. Bojarski, A. Choromanska, K. Choromanski, B. Firner, L. D.
Jackel, U. Muller, and K. Zieba, “Visualbackprop: visualizing cnns for
autonomous driving,” CoRR, vol. abs/1611.05418, 2016. [Online]. Available:
http://arxiv.org/abs/1611.05418
[72] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner, L. D.
Jackel, and U. Muller, “Explaining how a deep neural network trained with
end-to-end learning steers a car,” CoRR, vol. abs/1704.07911, 2017. [Online].
Available: http://arxiv.org/abs/1704.07911
[73] Z. Chen and X. Huang, “End-to-end learning for lane keeping of self-driving
cars,” in 2017 IEEE Intelligent Vehicles Symposium (IV), June 2017.
167
[74] J. Hardy and M. Campbell, “Contingency planning over probabilistic obstacle
predictions for autonomous road vehicles,” IEEE Transactions on Robotics,
vol. 29, no. 4, pp. 913–929, 2013.
[75] E. Frazzoli, M. A. Dahleh, and E. Feron, “Real-time motion planning for agile
autonomous vehicles,” in American Control Conference, 2001. Proceedings of
the 2001, vol. 1. IEEE, 2001, pp. 43–49.
[76] M. Likhachev and D. Ferguson, “Planning long dynamically feasible maneu-
vers for autonomous vehicles,” The International Journal of Robotics Research,
vol. 28, no. 8, pp. 933–945, 2009.
[77] R. Y. Hindiyeh, “Dynamics and control of drifting in automobiles,” , March,
2013.
[78] E. Galceran, R. M. Eustice, and E. Olson, “Toward integrated motion planning
and control using potential fields and torque-based steering actuation for au-
tonomous driving,” in Proceedings of the IEEE Intelligent Vehicle Symposium,
Seoul, Korea, June 2015, pp. 304–309.
[79] R. DeSantis, “Path-tracking for articulated vehicles via exact and jacobian lin-
earization,” IFAC Proceedings Volumes, vol. 31, no. 3, pp. 159–164, 1998.
[80] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec-
tion,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE
Computer Society Conference on, vol. 1, June 2005, pp. 886–893 vol. 1.
[81] “BelgiumTS Dataset,” 2010. [Online]. Available: http://btsd.ethz.ch/
shareddata/
168
[82] F. Zaklouta and B. Stanciulescu, “Real-time traffic sign recognition
in three stages,” Robotics and Autonomous Systems, vol. 62, no. 1,
pp. 16 – 24, 2014, new Boundaries of Robotics. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0921889012001236
[83] S. Suzuki and K. Abe, “Topological structural analysis of digitized binary im-
ages by border following.” Computer Vision, Graphics, and Image Processing,
vol. 30, no. 1, pp. 32–46, 1985.
[84] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools,
vol. 25, no. 11, pp. 120–126, 2000.
[85] H. Cheng, X. Jiang, Y. Sun, and J. Wang, “Color image segmentation: advances
and prospects,” Pattern Recognition, vol. 34, no. 12, pp. 2259 – 2281, 2001.
[86] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with
deep convolutional neural networks,” in Advances in Neural Information Pro-
cessing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds.
Curran Associates, Inc., 2012, pp. 1097–1105.
[87] M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “On-
line multiperson tracking-by-detection from a single, uncalibrated camera,” Pat-
tern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 9,
pp. 1820–1833, Sept 2011.
[88] H. W. Kuhn, “The hungarian method for the assignment problem,” in 50 Years
of Integer Programming 1958-2008. Springer, 2010, pp. 29–47.
169
[89] S.-H. Bae and K.-J. Yoon, “Robust online multi-object tracking based on track-
let confidence and online discriminative appearance learning,” in Computer Vi-
sion and Pattern Recognition (CVPR), 2014 IEEE Conference on, June 2014,
pp. 1218–1225.
[90] K. Basak, S. N. Hetu, ZheminLi, C. L. Azevedo, H. Loganathan, T. Toledo,
RunminXu, YanXu, Li-ShiuanPeh, and M. Ben-Akiva, “Modeling reaction time
within a traffic simulation model,” in 16th International IEEE Conference on
Intelligent Transportation Systems (ITSC 2013), Oct 2013, pp. 302–309.
[91] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,
A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet
Large Scale Visual Recognition Challenge,” International Journal of Computer
Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.
[92] P. Domingos, “A few useful things to know about machine learning,” Commun.
ACM, vol. 55, no. 10, pp. 78–87, Oct. 2012.
[93] Z. Chen, X. Huang, Z. Ni, and H. He, “A gpu-based real-time traffic sign
detection and recognition system,” in Computational Intelligence in Vehicles
and Transportation Systems (CIVTS), 2014 IEEE Symposium on, Dec 2014,
pp. 1–5.
[94] Z. Chen, J. Wang, H. He, and X. Huang, “A fast deep learning system using
gpu,” in 2014 IEEE International Symposium on Circuits and Systems (IS-
CAS), June 2014, pp. 1552–1555.
170
[95] Y. Zhou, W. Wang, and X. Huang, “FPGA design for pcanet deep learning
network,” in Field-Programmable Custom Computing Machines (FCCM), 2015
IEEE 23rd Annual International Symposium on, May 2015, pp. 232–232.
[96] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cam-
bridge university press, 2003.
[97] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-
rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297–336, 1999.
[Online]. Available: http://dx.doi.org/10.1023/A:1007614523901
[98] P. Dolle°©r, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids
for object detection,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 36, no. 8, pp. 1532–1545, Aug 2014.
[99] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-
rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature em-
bedding,” arXiv preprint arXiv:1408.5093, 2014.
[100] M. Rohrbach, M. Enzweiler, and D. M. Gavrila, “High-level fusion of depth and
intensity for pedestrian classification,” in Joint Pattern Recognition Symposium.
Springer, 2009, pp. 101–110.
[101] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,”
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 3,
pp. 226–239, Mar 1998.
171
[102] B. Waske and J. A. Benediktsson, “Fusion of support vector machines for clas-
sification of multisensor data,” IEEE Transactions on Geoscience and Remote
Sensing, vol. 45, no. 12, pp. 3858–3866, Dec 2007.
[103] R. Pouteau, B. Stoll, and S. Chabrier, “Support vector machine fusion of mul-
tisensor imagery in tropical ecosystems,” in Image Processing Theory Tools and
Applications (IPTA), 2010 2nd International Conference on, July 2010, pp.
325–329.
[104] J. Zhao, B. Xie, and X. Huang, “Real-time lane departure and front collision
warning system on an fpga,” in 2014 IEEE High Performance Extreme Com-
puting Conference (HPEC), Sept 2014, pp. 1–5.
[105] A. J. Humaidi and M. A. Fadhel, “Performance comparison for lane detection
and tracking with two different techniques,” in 2016 Al-Sadeq International
Conference on Multidisciplinary in IT and Communication Science and Appli-
cations (AIC-MITCSA), May 2016, pp. 1–6.
[106] C. Li, J. Wang, X. Wang, and Y. Zhang, “A model based path planning algo-
rithm for self-driving cars in dynamic environment,” in 2015 Chinese Automa-
tion Congress (CAC), Nov 2015, pp. 1123–1128.
[107] S. Yoon, S. E. Yoon, U. Lee, and D. H. Shim, “Recursive path planning us-
ing reduced states for car-like vehicles on grid maps,” IEEE Transactions on
Intelligent Transportation Systems, vol. 16, no. 5, pp. 2797–2813, Oct 2015.
[108] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dynamic
vehicle models for autonomous driving control design,” in 2015 IEEE Intelligent
Vehicles Symposium (IV), June 2015, pp. 1094–1099.
172
[109] D. Wang and F. Qi, “Trajectory planning for a four-wheel-steering vehicle,”
in Proceedings 2001 ICRA. IEEE International Conference on Robotics and
Automation, vol. 4, 2001, pp. 3320–3325 vol.4.
[110] The comma.ai driving dataset. [Online]. Available: https://github.com/
commaai/research
[111] S. Minhas, A. Hernandez-Sabate, S. Ehsan, K. Dıaz-Chito, A. Leonardis, A. M.
Lopez, and K. D. McDonald-Maier, LEE: A Photorealistic Virtual Environ-
ment for Assessing Driver-Vehicle Interactions in Self-driving Mode. Cham:
Springer International Publishing, 2016, pp. 894–900.
[112] E. Santana and G. Hotz, “Learning a driving simulator,” CoRR, vol.
abs/1608.01230, 2016. [Online]. Available: http://arxiv.org/abs/1608.01230
[113] R. Szeliski, Computer vision: algorithms and applications. Springer Science &
Business Media, 2010.
173