Computer Vision and Machine Learning for Autonomous …...computer vision and machine learning techniques. Convolutional channel features (CCF) and the traditional HOG+SVM approach

Computer Vision and Machine Learning for AutonomousVehicles

by

Zhilu Chen

A Dissertation

Submitted to the Faculty

of the

WORCESTER POLYTECHNIC INSTITUTE

in partial fulfillment of the requirements for the

Degree of Doctor of Philosophy

in

Electrical and Computer Engineering

August 2017

APPROVED:

Prof. Xinming Huang, Major Advisor

Prof. Lifeng Lai

Prof. Haibo He

Abstract

Autonomous vehicle is an engineering technology that can improve transporta-

tion safety, alleviate traffic congestion and reduce carbon emissions. Research on

autonomous vehicles can be categorized by functionality, for example, object detec-

tion or recognition, path planning, navigation, lane keeping, speed control and driver

status monitoring. The research topics can also be categorized by the equipment or

techniques used, for example, image processing, computer vision, machine learning,

and localization. This dissertation primarily reports on computer vision and machine

learning algorithms and their implementations for autonomous vehicles. The vision-

based system can effectively detect and accurately recognize multiple objects on the

road, such as traffic signs, traffic lights, and pedestrians. In addition, an autonomous

lane keeping system has been proposed using end-to-end learning. In this disserta-

tion, a road simulator is built using data collection and augmentation, which can be

used for training and evaluating autonomous driving algorithms.

The Graphic Processing Unit (GPU) based traffic sign detection and recogni-

tion system can detect and recognize 48 traffic signs. The implementation has three

stages: pre-processing, feature extraction, and classification. A highly optimized and

parallelized version of Histogram of Oriented Gradients (HOG) and Support Vector

Machine (SVM) is used. The system can process 27.9 frames per second with the

active pixels of a 1,628 Ö 1,236 resolution, and with the minimal loss of accuracy.

In an evaluation using the BelgiumTS dataset, the experimental results indicate that

the detection rate is about 91.69% with false positives per window of 3.39×10−5, and

the recognition rate is about 93.77%.

We report on two traffic light detection and recognition systems. The first sys-

i

tem detects and recognizes red circular lights only, using image processing and SVM.

Its performance is better than that of traditional detectors and it achieves the best

performance with 96.97% precision and 99.43% recall. The second system is more

complicated. It detects and classifies different types of traffic lights, including green

and red lights in both circular and arrow forms. In addition, it employs image process-

ing techniques, such as color extraction and blob detection to locate the candidates.

Subsequently, a pre-trained PCA network is used as a multi-class classifier for obtain-

ing frame-by-frame results. Furthermore, an online multi-object tracking technique is

applied to overcome occasional misses and a forecasting method is used to filter out

false positives. Several additional optimization techniques are employed to improve

the detector performance and to handle the traffic light transitions.

A multi-spectral data collection system is implemented for pedestrian detection,

which includes a thermal camera and a pair of stereo color cameras. The three cameras

are first aligned using trifocal tensor, and the aligned data are processed by using

computer vision and machine learning techniques. Convolutional channel features

(CCF) and the traditional HOG+SVM approach are evaluated over the data captured

from the three cameras. Through the use of trifocal tensor and CCF, training becomes

more efficient. The proposed system achieves only a 9% log-average miss rate on our

dataset.

Autonomous lane keeping system employs an end-to-end learning approach for

obtaining the proper steering angle for maintaining a car in a lane. The convolutional

neural network (CNN) model uses raw image frames as input, and it outputs the

steering angles corresponding to the input frames. Unlike the traditional approach,

which manually decomposes the problem into several parts, such as lane detection,

path planning, and steering control, the model learns to extract useful features on

ii

its own and learns to steer from human behavior. More importantly, we find that

having a simulator for data augmentation and evaluation is important. We then

build the simulator using image projection, vehicle dynamics, and vehicle trajectory

tracking. The test results reveal that the model trained with augmented data using

the simulator has better performance and achieves about a 98% autonomous driving

time on our dataset.

Furthermore, a vehicle data collection system is developed for building our own

datasets from recorded videos. These datasets are used in the above studies and

have been released to the public for autonomous vehicle research. The experimental

datasets are available at http://computing.wpi.edu/Dataset.html.

iii

http://computing.wpi.edu/Dataset.html

Acknowledgements

I would like to express my gratitude to my advisor, Professor Xinming Huang, for

the opportunity to do research at WPI and his guidance in my research.

Thanks for Professor Haibo He, Lifeng Lai and many other professors for their

help. I’ve learned a lot from them.

Thanks to my families and my friends for giving me the courage and confidence.

iv

Contents

Abstract i

Acknowledgements iv

Contents ix

List of Tables x

List of Figures xv

List of Abbreviations xvii

1 Introduction 1

1.1 Motivations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 10

2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2 Object detection and recognition . . . . . . . . . . . . . . . . . . . . 13

2.2.1 Traffic sign . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

v

2.2.2 Traffic light . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.3 Pedestrian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.3 Lane keeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3 A GPU-Based Real-Time Traffic Sign Detection and Recognition

System 21

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2 Traffic Sign Detection and Recognition System . . . . . . . . . . . . . 23

3.2.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2.2 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2.3 Traffic Sign Detection . . . . . . . . . . . . . . . . . . . . . . 26

3.2.4 Traffic Sign Recognition . . . . . . . . . . . . . . . . . . . . . 29

3.3 Parallelism on GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4 Automatic Detection of Traffic Lights Using Support Vector Ma-

chine 36

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Proposed Method for Traffic Light Detection . . . . . . . . . . . . . . 38

4.2.1 Locating candidates based on color extraction . . . . . . . . . 38

4.2.2 Traffic light detection using template matching . . . . . . . . . 38

4.2.3 An improved method using SVM . . . . . . . . . . . . . . . . 40

4.3 Data Collection and Performance Evaluation . . . . . . . . . . . . . . 43

4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

vi

5 Accurate and Reliable Detection of Traffic Lights Using Multi-Class

Learning and Multi-Object Tracking 48

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2 Data Collection and Experimental Setup . . . . . . . . . . . . . . . . 51

5.2.1 Training data . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2.2 Test data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Proposed Method of Traffic Light Detection and Recognition . . . . . 58

5.3.1 Locating candidates based on color extraction . . . . . . . . . 60

5.3.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3.2.1 PCANet . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.3.2.2 Recognizing green traffic lights using PCANet . . . 66

5.3.2.3 Recognizing red traffic lights using PCANet . . . . . 69

5.3.3 Stabilizing the detection and recognition output . . . . . . . 69

5.3.3.1 The problem of frame-by-frame detection . . . . . . 69

5.3.3.2 Tracking and data association . . . . . . . . . . . . . 71

5.3.3.3 Forecasting . . . . . . . . . . . . . . . . . . . . . . . 72

5.3.3.4 Minimizing delays . . . . . . . . . . . . . . . . . . . 74

5.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 76

5.4.1 Detection and recognition . . . . . . . . . . . . . . . . . . . . 76

5.4.2 False positives evaluation . . . . . . . . . . . . . . . . . . . . . 78

5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.5.1 Comparison with related work . . . . . . . . . . . . . . . . . 79

5.5.2 Limitation and plausibility . . . . . . . . . . . . . . . . . . . . 80

5.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

vii

6 Pedestrian Detection for Autonomous Vehicle Using Multi-spectral

Cameras 84

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

6.2 Data Collection and Experimental Setup . . . . . . . . . . . . . . . . 87

6.2.1 Data Collection Equipment . . . . . . . . . . . . . . . . . . . 87

6.2.2 Data Collection and Experimental Setup . . . . . . . . . . . . 90

6.3 Proposed Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.3.2 Trifocal tensor . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

6.3.3 Sliding windows vs. region of interest . . . . . . . . . . . . . 93

6.3.4 Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6.3.5 Information fusion . . . . . . . . . . . . . . . . . . . . . . . . 99

6.3.6 Additional constraints . . . . . . . . . . . . . . . . . . . . . . 100

6.3.6.1 Disparity-size . . . . . . . . . . . . . . . . . . . . . . 100

6.3.6.2 Road horizon . . . . . . . . . . . . . . . . . . . . . . 100

6.4 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 102

6.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

7 End-to-End Learning for Lane Keeping of Self-Driving Cars 109

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2.1 Data pre-processing . . . . . . . . . . . . . . . . . . . . . . . . 111

7.2.2 CNN implementation details . . . . . . . . . . . . . . . . . . . 113

7.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

viii

7.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.4.1 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.4.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 122

7.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

8 Building an Autonomous Lane Keeping Simulator Using Real-World

Data and End-to-End Learning 124

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

8.2 Building a Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.2.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

8.2.2 Image projection . . . . . . . . . . . . . . . . . . . . . . . . . 130

8.2.3 Vehicle dynamics and vehicle trajectory tracking . . . . . . . . 134

8.2.4 CNN implementation . . . . . . . . . . . . . . . . . . . . . . . 142

8.3 Experiment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.3.1 Data collection . . . . . . . . . . . . . . . . . . . . . . . . . . 143

8.3.2 Data augmentation . . . . . . . . . . . . . . . . . . . . . . . . 147

8.3.3 Evaluation using simulator . . . . . . . . . . . . . . . . . . . . 147

8.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153

9 Conclusions 154

Bibliography 157

ix

List of Tables

3.1 HOG parameters in our system . . . . . . . . . . . . . . . . . . . . . 28

4.1 Evaluation result based on Rin/Rout for different p values . . . . . . . 45

4.2 Evaluation result: precision and recall . . . . . . . . . . . . . . . . . 46

5.1 Number of training samples of Green ROI-n and Red ROI-n . . . . . 58

5.2 Information of 23 test sequences . . . . . . . . . . . . . . . . . . . . . 59

5.3 Test result of 17 sequences that contain traffic lights . . . . . . . . . . 78

5.4 Number of false positives in traffic-light-free sequences . . . . . . . . 79

5.5 Results of several recent works on traffic lights detection . . . . . . . 81

8.1 Evaluation result using the simulator, with and without augmented data.149

x

List of Figures

2.1 Performance results from the Caltech Pedestrian Detection Benchmark. 17

3.1 Three stages in our proposed system. . . . . . . . . . . . . . . . . . . 24

3.2 48 classes of traffic signs can be detected and recognized in our system. 25

3.3 An example of color enhancement. . . . . . . . . . . . . . . . . . . . . 26

3.4 Selecting ROI from the original image. . . . . . . . . . . . . . . . . . 27

3.5 Grouping detected windows. . . . . . . . . . . . . . . . . . . . . . . . 29

3.6 Normal CUDA kernel launches. . . . . . . . . . . . . . . . . . . . . . 30

3.7 CUDA kernel launches using CUDA streams. . . . . . . . . . . . . . . 30

3.8 HOG computing time on CPU and GPU. . . . . . . . . . . . . . . . . 32

3.9 The total processing time when HOG is computed using OpenCV on

GPU. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.10 The total processing time when using our optimized GPU code. . . . 34

4.1 Applying traffic light detector on a candidate. . . . . . . . . . . . . . 40

4.2 Are they traffic lights or not? Dark background on the top and bright

background at the bottom. . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 The left traffic light has bright background and the right traffic light

has dark background. . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

xi

4.4 Rin/Rout values for true positive candidates (left) and true negative

candidates (right). Y-axis is from 0 to 2000. . . . . . . . . . . . . . . 44

4.5 Rin/Rout values for true positive candidates (left) and true negative

candidates (right). Y-axis is from 0 to 20. . . . . . . . . . . . . . . . 45

4.6 Both traffic lights are detected. . . . . . . . . . . . . . . . . . . . . . 47

5.1 Examples of 5 classes of Green ROI-1. . . . . . . . . . . . . . . . . . 53



5.4 Examples of 3 classes of Red ROI-1. . . . . . . . . . . . . . . . . . . . 56

5.5 Examples of 3 classes of Red ROI-3. . . . . . . . . . . . . . . . . . . . 57

5.6 Flowchart of the proposed method of traffic light detection and recog-

nition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.7 Color extraction, blob detection and closing operation. . . . . . . . . 62

5.8 A sample frame from our traffic light dataset. . . . . . . . . . . . . . 64

5.9 The structure of two-stage PCANet. . . . . . . . . . . . . . . . . . . 66

5.10 An arrow light in three consecutive frames. The middle one is vague

and look similar to a circular light. A detector often fails on such vague

frame. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.11 All traffic lights are detected and recognized correctly in the frame. . 77

6.1 Instrumentation setup with both thermal and stereo cameras mounted

on the roof of a vehicle. . . . . . . . . . . . . . . . . . . . . . . . . . . 88

6.2 Framework of the proposed pedestrian detection method. . . . . . . . 92

6.3 Proper alignment of color and thermal images using trifocal tensor. . 94

6.4 Examples of pedestrians in color and thermal images. . . . . . . . . . 96

xii

6.5 The relationship between the mean disparity and the height of an object.101

6.6 Performance of different input data combinations, all using HOG fea-

tures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.7 Performance improvement by adding disparity-size and road horizon

constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

6.8 Performance of different input data combinations, all using CCF. . . 105

6.9 A pedestrian is embedded in the shadow of a color image. . . . . . . . 106

6.10 An example thermal image with two pedestrians. . . . . . . . . . . . 107

7.1 Comparison between the traditional approach and end-to-end learning. 111

7.2 An example of image frame from the dataset. . . . . . . . . . . . . . 112

7.3 Histogram of steering angles in training data. . . . . . . . . . . . . . 114

7.4 The proposed CNN architecture for deep learning. . . . . . . . . . . . 115

7.5 Histogram of error of predicted steering angles during test. . . . . . . 117

7.6 An example frame with the ground truth angle, predicted angle and

their respective projected path . . . . . . . . . . . . . . . . . . . . . . 118

7.7 Visualization of the results from first two convolutional layers. . . . . 119

7.8 An example of the disadvantage of frame by frame evaluation with 5

consecutive frames: the error in the middle frame is false . . . . . . . 121

8.1 Comparison between the traditional framework and end-to-end learning.126

8.2 The flowchart of test phase. . . . . . . . . . . . . . . . . . . . . . . . 131

8.3 The flowchart of training phase, using original data and augmented data.132

xiii

8.4 Example of original image and generated images given arbitrary camera

poses. (a) Original image. A checkerboard pattern on a flat surface.

(b) Generated image as if the camera is shifted left by 50 mm. (c)

Generated image as if the camera is rotated right by 15.25 degrees.

(d) Generated image as if the camera is shifted left by 50 mm and

rotated right by 15.25 degrees. . . . . . . . . . . . . . . . . . . . . . . 135

8.5 Camera calibration and ground surface estimation. (a) Selected points

in the image taken by the center camera. (b) Cameras and selected

points in the world coordinates. . . . . . . . . . . . . . . . . . . . . . 136

8.6 A virtual bicycle vehicle dynamics. . . . . . . . . . . . . . . . . . . . 138

8.7 Correction of vehicle’s position and orientation using vehicle trajectory

tracking. (a) Ground truth and predicted trajectory. (b) Ground truth

and predicted orientation. (c) Ground truth and predicted steering

wheel angle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

8.8 An example of cropped image frame from the dataset. . . . . . . . . . 143

8.9 The CNN structure used, slightly modified from NVIDIA’s PilotNet. 144

8.10 Our data collection system, including three forward facing cameras, a

USB hub, a laptop and access to OBD-II port. . . . . . . . . . . . . . 145

8.11 Example frames under different weather or lighting condition. (a)

Cloudy. (b) Shadowed. (c) Foggy. (d) Sunny. . . . . . . . . . . . . . 146

8.12 Example of original image and augmented images given arbitrary vehi-

cle poses. (a) Original image. (b) Augmented image as if the vehicle is

shifted right by 0.5 m. (c) Augmented image as if the vehicle is rotated

left by 7 degrees. (d) Augmented image as if the vehicle is shifted right

by 0.5 m and rotated left by 7 degrees. . . . . . . . . . . . . . . . . . 148

xiv

8.13 An example of the simulation result, produced by the CNN trained

with data augmentation. (a) Overview of the trajectory in a test se-

quence. (b) Trajectory zoomed-in in the black rectangle in (a). (c)

Trajectory zoomed-in in the black rectangle in (b). . . . . . . . . . . 150

8.14 An example of failure. The vehicle is going out of lane to the right be-

cause another vehicle is changing lane, and lane markings are partially

blocked. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

8.15 An example of failure. The vehicle is going out of lane to the right

because of unclear lane markings. . . . . . . . . . . . . . . . . . . . . 152

xv

List of Abbreviations

ACF Aggregated Channel Features

CCF Convolutional Channel Features

CNN Convolutional Neural Network

FN False Negatives

FOV Field of View

FP False Positives

FPPI False Positives Per Image

FPPW False Positives Per Window

FPS Frames per Second

GPU Graphics Processing Unit

HOG Histograms of Oriented Gradients

LKAS Lane Keeping Assist System

LQR Linear Quadratic Regulator

xvi

MOT Multi-object Tracking

MR Miss Rate

ODE Ordinary Differential Equatio

PCA Principal Component Analysis

PCANet PCA network

RBF Radial Basis Function

ROI Region of Interest

SLAM Simultaneous Localization and Mapping

SMA Simple Moving Average

SVM Support Vector Machine

TP True Positives

xvii

Chapter 1

Introduction

In this chapter, we first introduce the background and discuss the motivations of

our work in Section 1.1. The major contributions of our work are summarized in

Section 1.2. Finally, the organization of this dissertation is presented in Section 1.3.

1.1 Motivations

Road safety is an important topic. After all, data from the Insurance Institute of

Highway Safety (IIHS) revealed that in the year of 2012, red-light-running crashes

caused around 133,000 injuries and 683 deaths on US roads [1]. These injuries and

deaths may be reduced or avoided with the introduction of more advanced technolo-

gies. Many researchers are dedicated to the area of autonomous vehicles. Therefore,

we believe that this topic is meaningful and important.

Releated to the topic of autonomous vehicles are cameras, which are common in

our daily lives, andare much cheaper than some sensors, such as LiDAR. In light of

this, vision-based systems are more intuitive, as humans use their eyes to understand

1

the surrounding environment. In addition, humans can easily interpret the informa-

tion obtained from images or videos, which makes building manually labeled datasets

easier. Therefore, we believe that the vision-based approach is reasonable. In addi-

tion to some public datasets, we design and deploy our own data collection system to

build our own datasets, especially when the public datasets are limited or not ideal.

Object detection and recognition are important for understanding a road scene.

Traffic signs, traffic lights, pedestrians, and many other objects on the road need to

be detected and recognized to guide drivers or autonomous driving systems. Our

projects witness the evolution of object detection and recognition in computer vision.

Initially, hand-crafted features (e.g., HOG) proved their effectiveness in detecting

objects with certain shapes or patterns. A classifier, such as SVM or AdaBoost,

is often used upon the extracted features. Image processing is often used as a pre-

processing or post-processing step, and certain assumptions are often made to improve

the detector’s performance. Later, researchers found a more generic way of detecting

objects, without using hand-crafted features. It is called two-stage training. The

first stage performs unsupervised training on all of the training data to determine

the bestmethod for extracting features, and the second stage performs supervised

training to train the classifiers based on these features. After the two-stage training

approach, the one-stage approach became popular again, but with an end-to-end

learning Convolutional Neural Network (CNN) instead of hand-crafted features. The

CNN takes raw images as input and outputs the classified labels. As the CNN being

trained, it learns how to extract information from the raw images and how to classify

them. The training is one stage and supervised, and no clear boundary for the feature

extractor and classifier in the model exists. The CNN can deliver state-of-the-art

performance in object detection and recognition as of now.

2

Besides object detection and recognition, we are also motivated to look at the

lane keeping problem, which is an essential part of autonomous cars. The CNN is

futhermore used and it takes raw image frames as input, and outputs the steering

angles corresponding to the input frames, to keep the vehicle within the lane. This

is a regression problem instead of a classification problem. A simulator is then built

to provide augmented training data and a proper evaluation metric. The knowledge

of computer vision 3D geometry, vehicle dynamic, and vehicle trajectory tracking are

also used in the simulator.

1.2 Summary of Contributions

We design and implement a group of systems for autonomous vehicles. Our contri-

butions are listed as follows:

� Design and implement a traffic sign detection and recognition system.

Traffic sign detection and recognition are important functions for autonomous

vehicles. The detection process identifies the existence of traffic signs and their

locations in an image, and the recognition process identifies the types of the

detected signs. Our GPU-based traffic signs detection and recognition system is

able to detect and recognize 48 traffic signs. The implementation features three

stages: pre-processing, feature extraction and classification. A highly optimized

and parallelized version of HOG+SVM was used. The system can process 27.9

frames per second with the active pixels of a 1,628 Ö 1,236 resolution, and with

the minimal loss of accuracy. Evaluating using the BelgiumTS dataset, the

experimental results indicate that the detection rate is about 91.69% with false

positives per window of 3.39× 10−5 and the recognition rate is about 93.77%.

3

We emphasize our contributions in the following aspects:

– Our system is able to detect and recognize 48 traffic signs, with a good

detection rate and recognition rate.

– We optimized and parallelized the computation of HOG on GPU, as well

as some pre-processing steps and the deployed SVM classifier.

– Our system achieves real-time on large resolution images.

� Design and implement two traffic light detection and recognition systems.

Two traffic light detection and recognition systems are presented. The first

system detects and recognizes red circular lights only, using image processing

and SVM. Its performance is better than traditional detectors’s performance is.

The second system detects and classifies different types of traffic lights, including

green and red lights in both circular and arrow forms. It combines computer

vision and machine learning techniques. Color extraction and blob detection

are used to locate the candidates, followed by the PCA network (PCANet)

classifiers. The PCANet classifier consists of a PCANet and a linear SVM. Our

experimental results suggest that the proposed method is highly effective for

detecting both green and red traffic lights.


– For the first system, we demonstrate that detection using a fixed threshold

ratio is not very effective and the SVM-based classification has much better

performance.

– For the first system, we empirically add more parameters of a candidate

to the SVM input, and this can achieve better performance.

4

– For the first system, we build a traffic light dataset from the original videos

captured while driving on the streets.

– For the second system, we demonstrate that combining image processing

and PCANet can help with detecting and recognize various types of traffic

lights, including green and red lights in both circular and arrow forms.

– For the second system, an online multi-object tracking technique is applied

to overcome occasional misses, and a forecasting method is used to filter

out false positives.

– For the second system, several additional optimization techniques are em-

ployed to improve the detector performance and to handle the traffic light

transitions.

– For the second system, we build our own dataset of traffic lights from

recorded driving videos, including circular lights and arrow lights in various

directions.

� Design and implement a pedestrian detection system.

Pedestrian detection is a critical feature for self-driving cars or advanced driver

assistance systems. Our system consists of a thermal camera and a color stereo

camera. Data received from multiple cameras are aligned using trifocal tensor

based on pre-calibrated parameters. In addition, candidates are generated us-

ing sliding windows at multiple scales. A reconfigurable detector framework is

proposed, in which feature extraction and classification are two separate stages.

The input to the detector can be the color image, disparity map, thermal data,

or any combination of these. When applied to convolutional channel features,

feature extraction uses the first three convolutional layers of a pre-trained con-

5

volutional neural network cascaded with an AdaBoost classifier. The evaluation

results indicate that it significantly outperforms the traditional histogram of ori-

ented gradients features. When combining the color and thermal images, the

proposed detector can achieve a 9% log-average miss rate.


– We design and assemble a multi-spectral camera system mounted on a

vehicle to collect data for pedestrian detection.

– We build a dataset for multi-spectral pedestrian detection from on-road

driving data. These data contain many complex scenarios that are chal-

lenging for detection and classification.

– We propose a machine learning based algorithm for pedestrian detection

by combining stereo vision and thermal images. The evaluation results

show satisfactory performance.

– An experimental dataset is built by labeling the data collected when driv-

ing on the city roads.

� Design and implement a lane keeping system.

We present an end-to-end learning approach for obtaining the proper steering

angle to maintain the car in the lane. The CNN model uses raw image frames

as input and outputs the steering angles accordingly. The model is trained

and evaluated using the comma.ai dataset, which contains the front view image

frames and the steering angle data captured when driving on the road. Unlike

the traditional approach, which manually decomposes the autonomous driving

problem into technical components such as lane detection, path planning and

6

steering control, the end-to-end model can directly steer the vehicle from the

front view camera data after training. It learns how to keep the car in the lane

from human driving data. Further discussion of this end-to-end approach and

its limitation are also provided.


– We present a working system for lane keeping using the end-to-end learning

approach.

– We provide the evaluation results and discussion of this system. The need

for building a simulator is discussed.

� Design and implement a simulator for the lane keeping system.

In addition to the state-of-the-art end-to-end learning method that predicts

the steering wheel angle for the purpose of staying in the lane, a simulator is

built using image projection, vehicle dynamics and vehicle trajectory tracking,

which can be helpful in both training and evaluation. The simulation results

demonstrate the effectiveness and accuracy of the end-to-end learning method

and the benefits of using the simulator.


– We describe the implementation details of building a simulator for vision-

based autonomous lane keeping. Although many recent works exist on lane

keeping algorithms, comparing and evaluating them are difficult. Built on

real-world data, this simulator employs image projection, vehicle dynam-

ics modeling, and vehicle trajectory tracking to predict vehicle movement

and its corresponding camera views. The simulator can be used for both

training and the evaluation of lane keeping algorithms.

7

– The end-to-end learning approach is to produce the proper steering angle

from camera image data aimed at maintaining the self-driving vehicle in

a lane. A highly effective end-to-end learning system is demonstrated

using the aforementioned simulator for both training and evaluation. The

CNN model trained with augmented data from the simulator performs

significantly better than the model trained with recorded data only does.

– We build a dataset for autonomous vehicle research. The dataset contains

recorded video frames from three forward facing cameras (left, center, and

right) as well as a steering wheel angle and vehicle speed information.

1.3 Outline

This dissertation is organized as follows.

Chapter 2 summarizes the background of autonomous vehicles, especially com-

puter vision and machine learning techniques related to this dissertation.

Chapter 3 presents a GPU-based system for real-time traffic sign detection and

recognition that can classify 48 traffic signs included in the library.

Chapter 4 presents a method for the automatic detection of circular red traffic

lights that integrates both image processing and support vector machine techniques.

Chapter 5 presents a novel approach that combines computer vision and machine

learning techniques for the accurate detection and classification of different types of

traffic lights, including green and red lights in both circular and arrow forms.

Chapter 6 presents a novel instrument for pedestrian detection by combining a

thermal camera with a color stereo camera.

Chapter 7 presents an end-to-end learning approach for obtaining the proper steer-

8

ing angle to maintain the car in the lane.

Chapter 8 presents the implementation of a simulator for the lane keeping system,

using image projection, vehicle dynamics and vehicle trajectory tracking, which can

be helpful for both training and evaluation.

Chapter 9 draws the conclusions.

9

Chapter 2

Background

Carnegie Mellon University completed the first project involving autonomous vehicles

in the US in 1995, which included autonomous driving from Pittsburgh, PA, to San

Diego, CA. The vehicle was equipped with a computer, a camera, and a GPS. In

2004, the American Defense Advanced Research Projects Agency (DARPA) started a

competition for autonomous vehicles, but none of the teams completed the 150-mile

course. In 2005, five teams completed the DARPA challenge, and Stanford Univer-

sity’s autonomous car called Stanley took first place. In 2007, the DARPA challenge

involved a 60-mile course in an urban environment and the Carnegie Mellon Univer-

sity’s autonomous car called Boss took first place. In 2016, Stanford University’s au-

tonomous car called Shelley ran on the track in a speed of nearly 120 mph. Nowadays,

many vehicle manufacturers are developing their own autonomous vehicles, including

Ford, Mercedes Benz, Volkswagen, Audi, and BMW. In addition, many IT companies

have also joined this area, including Google, Uber, NVIDIA, and Tesla. For example,

Google started a project for self-driving car in 2009, which is now called Waymo.

It claims that it drives more than 25,000 autonomous miles each week, and mostly

10

on complex city streets. In other words, autonomous vehicles are being developed

rapidly, including both their hardware and their software.

This dissertation focuses on computer vision and machine learning techniques

used in this field, such as the detection and recognition of traffic sign, traffic light and

pedestrian, as well as lane keeping for self-driving cars. Many other topics not covered

in this dissertation are also important, such as pixel level segmentation, 3D recon-

struction, motion estimation, and Simultaneous Localization and Mapping (SLAM).

2.1 Datasets

Machine learning techniques rely heavily on data. Datasets are often built using real-

world data, with manually labeled ground truth. For example, the KITTI dataset

[2–5] uses the autonomous driving platform of Annieway to capture data from the real

world. The sensors mounted on the car are cameras, 360 Velodyne Laser-scanner and

a GPS. The data are manually processed and are divided into several subsets, such

as stereo, flow, object, tracking, and road. Furthermore, many datasets are built

while aiming for specific tasks. For example, the Belgium Traffic Sign Dataset [6]

and German Traffic Sign Benchmark [7] aim for detecting and recognizing a group

of European traffic signs in images. The Traffic Lights Recognition (TLR) public

benchmarks [8] are for the detection of green or red circular traffic lights in images.

The INRIA person dataset [9] and the Caltech Pedestrian Detection Benchmark [10]

are for the detection of upright persons in images. The comma.ai dataset contains

images captured from a forward facing camera, as well as the vehicle status such as

the speed, gear, and steering. It is used for end-to-end learning for the functionality

of lane keeping.

11

The datasets built from real-world data are extremely useful for researchers. How-

ever, collecting and labeling these data is tedious and time consuming, and the in-

formation obtained is limited to the types of sensors used. Therefore, real-world

datasets often have limited amounts of data and focus on certain functionalities. On

the other hand, some datasets are built using simulators or game engines, and they

can provide much more information with little human effort. For example, a dataset

generated from a computer game has been proposed for road scene segmentation [11].

The researchers claim that generating the annotation takes seven seconds per im-

age on average, whereas the human annotator takes 90 minutes per image. In such

datasets, the rich information of the 3D scene and object movements is helpful to

researchers, and these data can be generated easily. However, whether the models

trained on virtual data can be applied in the real world is questionable, as the im-

ages from game engines and the real world have inherent differences. Nevertheless,

these virtual datasets provide solid alternatives for researchers to try out their new

algorithms.

An increasing number of datasets are becoming available as researchers keep col-

lecting data and building their own datasets. Using the existing datasets reduces the

time and efforts needed to verify an algorithm, as collecting and labeling data are

very time consuming. It also makes it easier to compare one’s work with the existing

work of other researchers who use the same dataset [12, 13], because works done on

different datasets cannot be compared directly. However, sometimes researchers must

collect their own data, if the existing datasets are not ideal or are not available. In

addition, the newly built datasets can benefit other researchers.

12

2.2 Object detection and recognition

Object detection and recognition are important aspects of autonomous vehicles. This

dissertation focuses on the detection and recognition of traffic signs, traffic lights and

pedestrians. In addition, many other objects not covered in this dissertation can also

be detected and recognized to guide drivers or autonomous driving systems, such as

vehicles, road markings, and traffic cones.

2.2.1 Traffic sign

There are several existing works focused on detecting and recognizing a particular

class of traffic signs such as stop sign or speed limit sign [14, 15]. The designs were

optimized and can be highly efficient for detecting and recognizing a specific class

of signs, but they were hardly useful for other types of signs. Other research papers

attempted to detect and recognize multiple signs and used the common features such

as shapes and colors [6,16,17]. Advanced image processing algorithms were proposed

and analyzed thoroughly in order to obtain accurate results. However, the previous

works primarily focused on the algorithms, and the computing time is less a concern,

which prevents those designs from becoming practically useful. There are some other

works which investigate the trade-off between accuracy and computing time [18–20].

Many of them claimed to achieve real-time performance at a high accuracy, but the

datasets that they used were varied. Without using the same data set, it is unfair

to compare the accuracy among different designs. It is also worth mentioning that

the image resolution is another important factor that can affect the processing time

as well as accuracy. A higher resolution image can reveal small objects in it. As a

result, traffic signs can be detected and recognized even when they are far away and

13

thus leaves more time for drivers to respond.

2.2.2 Traffic light

Spot light detection [21, 22] is a method based on the fact that a traffic light is

much brighter than the lamp holder usually in black color. A morphological top-hat

operator was used to extract the bright areas from gray-scale images, followed by a

number of filtering and validating steps. In [23], an interactive multiple-model filter

was used in conjunction with the spot light detection. More information was used

to improve its performance, such as status switching probability, estimated position

and size. Fast radial symmetry transform is a fast variation of the circular Hough

transform, which can be used to detect circular traffic lights as demonstrated in [24].

Several other methods also combined the vehicle GPS information. A geometry-

based filtering method was proposed to detect traffic lights using mobile devices at

low computational cost [25]. The GPS coordinates of all traffic lights were presumably

available, and a camera projection model was used. Mapping traffic light locations

was introduced in [26] by using tracking, back-projection and triangulation. Google

also presented a mapping and detection method in [27] which was capable of recog-

nizing different types of traffic lights. It predicted when traffic lights should become

visible with the help of GPS data, followed by classifying possible candidates. Geo-

metric constraints and temporal filtering were then applied during the detection. The

inter-frame information was also helpful for detecting traffic lights. A method that

used Hidden Markov Model to improve the accuracy and stability of the results was

demonstrated in [28]. The state transition probability of traffic lights was considered,

and information from several previous frames was used. Reference [29] introduced a

traffic light detector based on template matching. The assumption was that the two

14

off lamps in the traffic light holder are similar to each other and neither of them look

similar with the surrounding background.

Deep learning [30, 31] is a class of machine learning algorithms that has many

layers to extract hidden features. Unlike hand-crafted features such as Histograms of

Oriented Gradients (HOG) features [9], it learns features from training data. PCANet

is a simple, yet effective deep learning network proposed by [32]. Principal Component

Analysis (PCA) is employed to learn the filter banks. It can be used to extract features

of faces, hand written digits and object images. It has been tested on several datasets

and delivers surprisingly good results [32]. Using PCANet in traffic light detection or

other similar applications has not been researched thus far.

Integration of detection and tracking has been used in a few autonomous vehicles

related works. The trajectory of traffic light was used to validate the theoretical result

in [23]. Kalman filter was employed to predict the traffic sign positions. It claimed

that tracking algorithm was able to improve the overall system reliability [33, 34].

Utilizing accumulated classifier decisions from a tracked speed limit sign, a majority

voting scheme was proven to be very robust against accidental mis-classifications [14].

2.2.3 Pedestrian

The Caltech Pedestrian Detection Benchmark [10] has been widely used by the re-

searchers. It contains frames from a single vision camera with pedestrians annotated.

Based on the CVPR2015 snapshot of the results on the Caltech-USA pedestrian

benchmark, it was stated in [35] that at ˜95% recall, the state-of-the-art detectors

made ten times more errors than the human-eye baseline, which is still a huge gap that

calls for research attentions. Figure 2.1(a) shows some top quality detection meth-

ods presented in [36]. Overall, the detector performance has been improved as new

15

methods were introduced in recent years. Traditional methods such as Viola–Jones

(VJ) [37] and Histogram of Oriented Gradients (HOG) [9] were often included as the

baseline. A total of 44 methods were listed in [38] for Caltech-USA dataset, and 30 of

them made the use of HOG or HOG-like features. Channel features [39] and Convo-

lutional Neural Networks [40–42] also achieved impressive performance on pedestrian

detection. The Convolutional Channel Features (CCF) [43], which combines a boost-

ing forest model and low level features from CNN, is one of top performers listed in

Caltech Pedestrian Detection Benchmark as shown in Figure 2.1(b) . Despite the

progressive improvement of detection results on the datasets, color cameras still have

many limitations. For instance, color cameras are sensitive to the lighting condition.

Most of these detection methods may fail if the image quality is impaired under poor

lighting condition.

Thermal cameras can be employed to overcome some limitations of color cameras,

because they are not affected by lighting condition. Several research works using ther-

mal data for pedestrian detection and tracking were summarized in [44]. Background

subtraction was applied in [45] for people detection, since the camera was static. HOG

features and Support Vector Machine (SVM) were employed for classification [46]. A

two-layered representation was described in [47], where the still background layer and

the moving foreground layer were separated. The shape cue and appearance cue were

used to detect and locate pedestrians. In [48], a window based screening procedure

was proposed for potential candidate selections. The Contour Saliency Map (CSM)

was used to represent the edges of a pedestrian, followed by AdaBoost classification

with adaptive filters. Assuming the region occupied by a pedestrian has a hot spot,

candidates were selected based on thermal intensity value [49] and then classified

by a SVM. In addition, both Kalman filter prediction and mean shift tracking were

16

(a) Benchmark results of different methods as reported in [36].

(b) Benchmark results of different methods as of May 2016.

Figure 2.1: Performance results from the Caltech Pedestrian Detection Benchmark.

17

incorporated for further improvement. A new contrast invariant descriptor [50] was

introduced for far infrared images, which outperformed HOG features by 7% at 10−4

FPPW for people detection. The Shape Context Descriptor (SCD) was also used for

pedestrian detection in [51], followed by AdaBoost classifier. The HOG features were

considered not suitable for this task because of the small size of the target, variations

of pixel intensities and lack of texture information. Probabilistic models for pedes-

trian detection in far infrared images was presented in [52]. The method in [53] found

the head regions at the initial stage, then confirmed the detection of a pedestrian by

the histograms of Sobel edges in the region.

For ADAS applications, several pedestrian detection research works were summa-

rized in [54], including the use of color cameras and thermal cameras, as well as sensor

fusion such as radar and stereo vision cameras. A benchmark for multispectral pedes-

trian detection was presented in [55] and several methods were analyzed. However,

the color-thermal pairs were manually annotated and it is unclear if any automatic

point registration algorithms were used. The combination of stereo vision cameras

and a thermal camera was used in [56]. Trifocal tensor was used to align the thermal

image with color and disparity images. Candidates were selected based on disparity,

and HOG features were extracted from color, thermal and disparity images. Con-

catenated HOG features were then fed to radio basis function (RBF) SVM classifier

to obtain the final decision. Furthermore, more sophisticated applications or systems

can be built upon pedestrian detection, such as pedestrian tracking across multiple

driving recorders [57] and crowd movement analysis [58].

18

2.3 Lane keeping

Maintaining the vehicle within the lane is important for driving safety. Lane keeping

assist system (LKAS) has been studied by many researchers previously. The lane

keeping assist systems [59–62] are able to provide torque to maintain the vehicle

within the lane, and often alert the driver with warning messages or sound. Cameras

are usually used in the system, and lane markings must be recognized. In addition,

the systems also distinguish intended and unintended lane departure, by utilizing

more information such as blinker state, braking or steering angle.

The LKAS needs to be accurate and robust for autonomous cars. Although

Industrial companies have achieved a lot in this area, they seldom publicize their

technologies. It is necessary for researchers to study the theories, algorithms and

implementations of the LKAS. Deep reinforcement learning [63] was used in several

research works on autonomous driving [64–66]. The systems learned the optimal pol-

icy function given the feedback of the reward. These systems went beyond the basic

lane keeping feature, and were able to direct the vehicle to stay in path and avoid

collision. The vehicle was not necessarily to be in lane, and other vehicles on road

were often involved. The learning and evaluation were often done in a virtual simula-

tor, because the learning requires rich ground truth information and needs to interact

with the environment. Inverse reinforcement learning [67], on the other hand, was

used to estimate the reward from the expert demonstrations.

For real world systems, sensors and algorithms are employed to interpret the

surrounding environment, without having the rich ground truth information in the

simulator. The vision-based approaches use cameras because they are cost effective.

An early research work demonstrated a autonomous vehicle ALVINN [68] using neu-

19

ral network to find the proper direction. The input data came from a camera and

a laser range finder, and the input resolution was very small. For large resolution

color images, an end-to-end learning approach using convolutional neural network

was demonstrated in [69]. The system was designed for off-road mobile robots, not

the autonomous vehicles on the road. An end-to-end learning approach using convolu-

tional neural network for self driving cars was demonstrated in [70], and the network

was trained and evaluated with the help of a simulator. The idea of building the

simulator using image projection and vehicle dynamics was described, but there were

little technical details. The network was later named as PilotNet, and the effective-

ness was validated and visualized in [71, 72]. Our previous work [73] followed this

approach using a different dataset and network, and demonstrated the necessity of

building the simulator in both the training and evaluation stage.

Building the simulator requires the knowledge of computer vision, vehicle dynam-

ics and vehicle trajectory tracking. Most autonomous vehicle driving frameworks

present a consistent decoupling between low-level control and path planning, while

constraining the dynamics of the system to satisfy the vehicle’s motion. Typically

nominal path is obtained by optimization-based methods [74], sampling-based ap-

proaches [75] and notable searching algorithms [76]. In terms of the system dynamics

and control, Rami et al. [77] proposed a linear system dynamics and the control for

high speed drifting. Galceran et al. [78] adopted proportional-derivative (PD) feed-

back controller for torque-based steering. Approximating the non-linearity of the

vehicle dynamics, DeSantis et al. [79] Jacobian linearized the vehicle dynamics for

designing a path-tracking controller, but this approximation ignored the high order

of the polynomial of the system dynamics, which led to a potential problem in the

controlling a vehicle when the error is large.

20

Chapter 3

A GPU-Based Real-Time Traffic

Sign Detection and Recognition

System

This chapter presents a GPU-based system for real-time traffic sign detection and

recognition which can classify 48 different traffic signs included in the library. The

proposed design implementation has three stages: pre-processing, feature extraction

and classification. For high-speed processing, we propose a window-based histogram

of gradient algorithm that is highly optimized for parallel processing on a GPU. For

detecting signs in various sizes, the processing was applied at 32 scale levels. For

more accurate recognition, multiple levels of supported vector machines are employed

to classify the traffic signs. The proposed system can process 27.9 frames per second

video with active pixels of 1,628 Ö 1,236 resolution. Evaluating using the BelgiumTS

dataset, the experimental results show the detection rate is about 91.69% with false

positives per window of 3.39× 10−5 and the recognition rate is about 93.77%.

21

3.1 Introduction

Traffic sign detection and recognition are important functions in an Advanced Driver

Assistance System (ADAS). The detection process has two aspects, including the

existence of traffic signs in an image and their locations. Accurately detecting the

signs also improves the recognition rate by filtering out redundant information while

retaining the useful information on an image. Recognition identifies the signs from

the detection result. In the real world, knowing the content of the sign is much more

important than simply knowing the existence of a sign. Many existing works have

been carried out to improve the accuracy of detection and recognition. In practice,

processing time and hardware efficiency also need to be considered.

A traffic sign detection and recognition system often contains three stages: pre-

processing, detection and recognition. The pre-processing stage is optional, but it is

usually included in a real-time system. It identifies and selects the regions of interest

in the original image frame that often contains a large number of pixels. Effectively, it

reduces the computational tasks and improves the efficiency of the subsequent stages.

The second stage detects and locates traffic signs in the selected regions produced

by the pre-processing stage. In some systems, the detection stage also identifies the

categories of the signs based on shapes, such as round, rectangle, triangle, etc. These

categories are called super-classes. The final stage recognizes the detected signs and

sends the processing results (i.e., the types of signs and their locations) to the display

and control units of an ADAS system.

Typically feature extraction and pattern classification algorithms are computa-

tionally intensive. Much research has been done to optimize the algorithms themselves

to improve the accuracy, but very little research has been focused on the implemen-

22

tation to improve the efficiency. In this chapter, we propose to utilize the many-core

architecture in a GPU to accelerate the traffic sign detection and recognition algo-

rithms through massive parallel processing. The objective is to reduce the computing

time considerably such that the GPU implementation can detect and recognize traffic

signs in real-time.

3.2 Traffic Sign Detection and Recognition System

3.2.1 System Overview

The proposed system contains three main stages: pre-processing, detection and recog-

nition, as shown in Fig. 3.1. At first, we perform red and blue color extractions

respectively and select the regions of interest (ROI). Next, Histograms of Oriented

Gradients (HOG) [80] features are extracted on the grayscale image and a sliding

window searches the image exhaustively to find the candidates using a linear Support

Vector Machine (SVM). Color-based HOG detectors are then performed on these

candidates to eliminate false positives, followed by a rectangle grouping operation

to locate the detected traffic signs. Finally, the detected signs are delivered to a

cascade classifier which contains several linear SVMs. The recognized traffic sign is

highlighted with a green rectangle on the image. Furthermore, a standard image

of the identified class of traffic sign, scaled to the same size, is placed next to the

rectangle, which is used to indicate the actual position and class of the sign. For the

proposed system, the BelgiumTS dataset are employed for both training and testing.

Our system is able to detect and recognize 48 classes of traffic signs selected from the

BelgiumTS Dataset [81], as shown in Fig. 3.2. These signs have aspect ratio 1:1 with

red or blue colors on them.

23

Figure 3.1: Three stages in our proposed system.

Although HOG and SVM have been commonly used in detecting and recognizing

objects, it is still challenging to find a good balance between accuracy and efficiency.

In order to reduce the computing latency, we employ linear kernel SVMs in our

implementation. In order to obtain better accuracy, we use multiple HOG features

and SVMs in our system in Fig. 3.1.

3.2.2 Pre-processing

Color and shape information are commonly used as features of traffic signs. Although

road images often contain objects with their color and shape information similar to

that of traffic signs, it is still a simple yet effective way to use such information to

identify the ROI. We perform color extraction using an adaptive threshold method

proposed by [82]. By using red color enhancement, we obtain an image whose pixel

24

Figure 3.2: 48 classes of traffic signs can be detected and recognized in our system.

value fR is computed as

fR = max

(0,min (xR − xG, xR − xB)

s

)(3.1)

s = xR + xG + xB (3.2)

where xR, xG, xB is the pixel value of red, green and blue channel, respectively. The

global threshold is then set to µ + 4 · σ, where µ is the mean and σ is the standard

deviation of the red values of the original image pixels. Applying this threshold to

25

the image results in a binary image IR which is used in the following processing steps.

We also perform blue color enhancement and thresholding using the same method

and obtain an blue color enhanced binary image IB. Fig. 3.3 shows an example of

blue color enhancement.

Figure 3.3: An example of color enhancement.

Next, we find contours in the binary images using the algorithm in [83] and then

place a bounding box for the contours of each object. Small rectangles whose width

or height is less than 32 pixels are ignored to minimize the interference of small

objects and color fragments in the image. Bounding boxes that have similar sizes and

locations are combined to avoid overlapping. Fig. 3.4 shows an ROI is selected from

the original image after pre-processing.

3.2.3 Traffic Sign Detection

In many cases, the selected ROI from pre-processing stage often contains no traffic

sign. In order to provide valid inputs to the classification stage, at first traffic signs

must be detected accurately. False positives need to be eliminated as much as possible.

Applying the HOG method, we compute the HOG features on ROI at different scales

and then use a sliding window to search the entire ROI to find traffic signs.

26

Figure 3.4: Selecting ROI from the original image.

The HOG features can be computed from an RGB image or grayscale image. For

RGB image, horizontal and vertical gradients are computed in 3 channels for red,

green and blue respectively. Only those that have the maximal magnitude compared

with the other two channels are selected for HOG processing. Thus its computational

work load is three times that of a gray scale image. We first convert the original RGB

image to a grayscale image IGRAY and use it to compute the HOG features that are

fed to a linear SVM to determine if there are traffic signs in the image. Although most

of the existing work also applied HOG to a gray scale image for traffic sign detection,

this approach had very high false positive rate. In order to reduce the false positive

rate, our system also extracts the HOG features from the red image IR and the blue

image IB, only on the frames that the detection is reported positive on IGRAY . Two

more SVMs are trained for red and blue images respectively that eliminate some false

positive frames. In addition, these SVMs also classify the detected traffic signs into

27

Table 3.1: HOG parameters in our system

Parameters Size

Window Size 32 by 32 pixels

Block Size 8 by 8 pixels

Cell Size 8 by 8 pixels

Window Stride 8 by 8 pixels

Block Stride 8 by 8 pixels

Scaling factor 1.1

Levels 32

several super-classes, such as red circle, red triangle, blue circle, etc.

The HOG parameters in our system are shown in Table 3.1. The window size is

fixed, but the size of the traffic sign in an image is unknown. Thus the original image

has to be scaled at many different levels and then perform HOG feature extraction

and classification at each level. The size of image in each level Sl is computed as

Sl = S0/fl (3.3)

where S0 is the original image size, l is the level number and fl is the level scaling

factor defined as

fl = 1.1l−1 (3.4)

In our design, 32 scaling levels are applied. Thus our system is able to detect traffic

signs sized from 32 by 32 pixels up to 614 by 614 pixels.

As shown in the central figure of Fig. 3.5, the same traffic sign is detected by

multiple windows at different positions and also at different scale levels. To avoid

overlapping, we perform a grouping operation that combines these detected traffic

signs at the same location into a single box as shown in Fig. 3.5.

28

Figure 3.5: Grouping detected windows.

3.2.4 Traffic Sign Recognition

The final step of our design is traffic sign recognition. The SVM method is applied

to classify the detected traffic signs according to the 48 classes listed in Fig. 3.2.

Each of the final detected windows is classify by the SVMs mentioned in 3.2.3 to

determine and re-assure its category. Once its category is determined, it is classified

by a multi-class SVM in that category. SVMs are trained using k -fold cross-validation

to improve the accuracy. It is also worth mentioning that we use the BelgiumTSC

dataset to train the SVMs to classify different classes of traffic signs in each category.

3.3 Parallelism on GPU

Since pre-processing and HOG algorithms are complex and require extensive compu-

tations, in this section we describe the GPU-based acceleration. For pre-processing,

it is a typical point operation which is suitable for GPU implementation. The HOG

computation is more complicated and we develop several special techniques to handle

it.

There exists a GPU version of HOG in the OpenCV library which accelerates the

computation significantly if compared to the CPU version. However, we find that

there is still room to improve its efficiency. As we mentioned in 3.2.3, the HOG

29

features need to be computed in many different scaling levels of the original image,

and gaps between levels can be reduced or eliminated. Once the input data of each

level is prepared, there is no data dependency during HOG computation between

different levels. In OpenCV implementation, each level stalls until the computation

of previous level is done to ensure data synchronization between kernels, as shown in

Fig. 3.6. Such stalls are unnecessary and can be avoided by using CUDA streams.

As illustrated in Fig. 3.7, kernels can run in multiple CUDA streams at the same

time and can be synchronized in a certain stream without affecting others. By using

CUDA streams, we reduce the gaps between levels significantly and thus improve the

efficiency of HOG computation.

Figure 3.6: Normal CUDA kernel launches.

Figure 3.7: CUDA kernel launches using CUDA streams.

For better performance, the GPU version of HOG in OpenCV is highly optimized

for data re-usage. The image is divided into many blocks and the block histograms

30

are computed only once, though a block can belong to multiple windows. When

extracting the HOG feature of a window, we need to find the already computed block

histograms and line them up. However, after we adjust the detect windows in our

system, the locations and sizes of those windows are changed and their HOG features

need to be recomputed. Moreover, those windows can be anywhere in the image thus

it is impossible to reuse the block histograms. Computing the HOG features of those

windows is inefficient even if we use previous GPU design, because there are gaps

between windows and it cannot parallelize those windows massively.

In order to solve this problem, we propose a window-based HOG solution on GPU.

All windows are extracted and put together to form an image whose width is the same

as window width, and whose height is the window height multiplied by the number

of windows. Then the newly constructed image is sent to GPU for block histogram

computation. As a result, HOG computation for multiple windows is now running

in parallel on GPU threads. Furthermore, we optimize this method by filtering out

blocks crossing two windows since these blocks are not useful. In our parameter

settings, we have 9 blocks in a window and there are 3 blocks that cross two windows.

By filtering out these cross-window blocks, the total computation is reduced by 25%.

3.4 Experimental Results

The proposed traffic sign detection and recognition algorithms are evaluated on a

Tesla K20 GPU platform. The pre-processing stage on GPU takes about 13˜17 ms.

The detection and recognition stages account for most of the processing time. At

first, we compare the HOG computing time on CPU and GPU at each scaling level.

As shown in Fig. 3.8, the speedup of GPU acceleration is significant when the scaling

31

level is small. The original size of testing image is 1,628 by 1,326 pixels. Parameter

setting as listed in Table 3.1. The OpenCV library is employed for comparing the

HOG computing time on CPU and GPU.

Figure 3.8: HOG computing time on CPU and GPU.

Secondly, we test our optimized GPU implementation using 2000 images in the

BelgiumTS dataset. Each image is in the size of 1,628 by 1,326 pixels. The total

execution time for all three stages is compared by using original OpenCV HOG GPU

version and our optimized version. Initialization time is ignored such as reading im-

ages and SVMs. Post-processing time is also ignored such as recording and displaying

results. Fig. 3.9 shows the total execution time for each frame by using the OpenCV

32

GPU code for HOG computation. Fig. 3.10 shows the execution time of our opti-

mized GPU code. We can see that the overall computing time is reduced and some

peaks are suppressed. The average frame rate of the OpenCV version on GPU is 21.3

fps. Our optimized GPU code can achieve the average frame rate of 27.9 fps which

is about 31% faster than the OpenCV version.

Figure 3.9: The total processing time when HOG is computed using OpenCV onGPU.

Finally, we evaluate the detection rate and classification rate of our proposed sys-

tem, using the BelgiumTS dataset [6]. Each test image is in the size of 1,628 by 1,236

pixels. We test 1918 images and the detection rate is 91.69%. We also measure the

false positive rate by using background images provided by the BelgiumTS dataset.

33

Figure 3.10: The total processing time when using our optimized GPU code.

Based on our HOG parameters described in Table 3.1, we extract over 20 million

windows from those images in different scaling levels. The number of false positives

is 684. Thus the False Positives Per Window (FPPW) is 3.39 × 10−5. Similarly,

we use the BelgiumTSC dataset for evaluate the classification rate. Each image in

BelgiumTSC dataset contains one traffic sign with some background. We resize each

image to our window size of 32 by 32 pixels before computing HOG and performing

SVM classification. We use 4,492 images for training and 2,520 images for testing.

All training and test images are from BelgiumTSC dataset and the classification rate

is 93.77%.

34

3.5 Conclusions

This chapter presents a real-time traffic sign detection and recognition system on

the GPU. It is capable of detecting and recognizing 48 classes of traffic signs in any

sizes on each image frame. The detection rate is about 91.69% and the recognition

rate is about 93.77%. The system can process 27.9 fps video with active pixels of

1,628 Ö 1,236 resolution. Since each frame is processed individually, no information

from previous frames are required. As part of our future work, information from

previous frames will be considered for tracking traffic signs which is expected to

further improves the detection accuracy.

35

Chapter 4

Automatic Detection of Traffic

Lights Using Support Vector

Machine

Many traffic accidents occurred at intersections are caused by drivers who miss or

ignore the traffic signals. In this chapter, we present a new method for automatic

detection of traffic lights that integrates both image processing and support vec-

tor machine techniques. An experimental dataset with 21299 samples is built from

the captured original videos while driving on the streets. When compared to the

traditional object detection and existing methods, the proposed system provides sig-

nificantly better performance with 96.97% precision and 99.43% recall. The system

framework is extensible that users can introduce additional parameters to further

improve the detection performance.

36

4.1 Introduction

Automatic detection of traffic lights should be an essential feature of advanced driver

assistance systems and future self-driving vehicles. Today it is an important road

safety issue that many traffic accidents occurred at intersection are caused by drivers

running the red lights. Recent data from Insurance Institute of Highway Safety

(IIHS) show that in the year of 2012 on US roads, red-light-running crashes caused

about 133,000 injuries and 683 deaths [1]. The introduction of automatic traffic light

detection, especially red light detection, has important social and economic impacts.

Because road images often contain complex background and many objects in them,

it is a challenge to develop an algorithm that can detect the traffic lights precisely.

Most of the existing algorithms are based on color, shape and gradient information,

but the detections are not very reliable. Since the traffic lights themselves do not have

sufficient features, traditional feature-based object detection algorithm also does not

work well. In this chapter, we propose a new method that combines computer vision

and machine learning techniques in conjunction with inter-frame information. While

we drive on the road, data have been collected by recording video using a camera

mounted behind the front windshield. Then the data sets are labeled for training and

evaluation of the proposed algorithm. Our experimental results suggest the proposed

method is highly effective for detecting red traffic lights.

The rest of the chapter is organized as follows. In Section 4.2, we propose an

improved method that combines the computer vision and machine learning techniques

for traffic light detection. Data collection and performance evaluation are presented

in Section 4.3, followed by conclusions in Section 4.4.

37

4.2 Proposed Method for Traffic Light Detection

4.2.1 Locating candidates based on color extraction

In this chapter, we focus on the detection of red circular traffic lights only. Green

or yellow lights can be detected by applying similar techniques. At first, we apply

color extraction to locate the candidates of traffic lights. Subsequently, the images

are converted to the hue, saturation, and value (HSV) color space. The red color is

extracted based on the hue values. Flood-filling method is applied for region labeling

and blob extraction.

The blobs can be considered as the potential candidates. In many previous works,

a variety of morphological filtering techniques were applied to eliminate some candi-

dates for the purpose of reducing false positives. However, any filtering has a possi-

bility of missing the true traffic lights, because the traffic lights are not always clear

due to their size in images and the obscure background. Thus we simply perform an

aspect ratio check and keep all blobs that pass the check as candidates of potential

traffic lights. The objective of eliminating false positives is considered in the latter

part of the proposed method.

4.2.2 Traffic light detection using template matching

Once the candidates are located, we apply a template matching method to detect the

traffic lights [29]. Here we consider the traditional and the most popular design of a

traffic light in which red, yellow and green lights are in round shapes and vertically

positioned in that order. For horizontally positioned traffic lights, we can apply the

same method with a few modifications. Typically only one of three colorful lights is

turned on at a time. In the previous step, we have located potential candidates of

38

the red lights on the image. When the red light is on, the yellow and green lights are

off. These two off lights are very similar. So we use the yellow light area ROIref as

template that is the yellow rectangular area in Fig. 4.1. Similarly, the green light area

is highlighted as the green rectangular area. We can perform template matching in the

green rectangular with ROIref . In fact, we purposely make the green rectangular

area slightly larger than the yellow one, which can provide more accurate results

for template matching. The minimal value among the template matching results is

recorded as Rin and the corresponding area is recorded as ROIin.

For the three vertical traffic lights, the assumption is that these two off lights are

almost identical and there should not have any similar objects in the neighboring area.

The background areas around the traffic light bounding box is highlighted as blue

rectangular areas as in Fig. 4.1. Using the same reference ROIref as the template,

we perform template matching in the blue rectangular area. The smallest value of

the template matching results is Rout and its corresponding area is ROIout. Since the

yellow and green lights are both off, they appears almost identical and the Rin value

is small. In contrast, the Rout value is often very large. We can set a threshold value

p. If the ratio Rin/Rout < p, a traffic light is detected and otherwise not.

This template matching method does not require high resolution images. It works

well even if the candidates are small in size when the traffic lights are a long distance

away. In addition, it is effective to eliminate some false positives. As an improvement

to the detection method, additional constraints were considered in [29], such as the

mean and variance of the pixel values at the position of the two off lights should be

smaller than a certain threshold because those regions should be dark.

However, the assumption that Rout is much larger than Rin is not always true. For

example, when the traffic lights are small or not so clear in the image, the off lights

39

Figure 4.1: Applying traffic light detector on a candidate.

are seen as dark regions. if background is also dark such as trees or buildings, the

template matching to the background Rout could also be very small. Then Rin/Rout

is likely above the given threshold p. As a result, the true traffic light is missed.

Additional constraints on the mean and variance of the pixel values do not solve this

problem either. In addition, it is difficult to choose a universal value for threshold

p. Thus, we propose an improved method that is integrated with machine learning

algorithms.

4.2.3 An improved method using SVM

Due to various background and object sizes on the image, it is difficult to manually

set a threshold for Rin/Rout ratio obtained from template matching. So we propose

to build a support vector machine (SVM) that can automatically find the optimal

settings for the parameters (or features) extracted from the image through machine

learning. It requires a large dataset, both positive and negative samples, for training

the SVM. For each candidate, we use Rin and Rout values in conjunction with the pixel

values mref , min and mout to form a vector, where mref , min and mout are mean pixel

values of areas correspond to the areas of ROIref ROIin and ROIout, respectively.

40

Each vector becomes a sample S1 for the SVM.

S1 = {Rin, Rout, mref , min, mout} (4.1)

The SVM is able to automatically adjust its parameters through the training process.

As demonstrated in Section 4.3, using the SVM to find parameters makes a huge leap

in term of detection accuracy when compared to manually setting the threshold p.

However, we discover that the bounding box of a candidate itself is not sufficient to

determine whether it is a traffic light or not. If we cut them out from the original

image, sometimes even a human can hardly decide it. Fig. 4.2 shows some examples

of candidates extracted from road images. The candidates on first row have dark

background and those on second row have bright background. As we can see, it is

difficult to determine a traffic light when background is dark, while it is easier to spot

a traffic light with bright background. Fig. 4.3 gives an example with both scenarios.

The left traffic light has bright background while the right one has dark background.

We also find that the brake lights, which are usually red, of black vehicles, are a major

contributor to false positives.

Figure 4.2: Are they traffic lights or not? Dark background on the top and brightbackground at the bottom.

In order to improve the detection performance, we propose to add the location

information of the candidate bounding box as additional inputs to the SVM. The

41

Figure 4.3: The left traffic light has bright background and the right traffic light hasdark background.

idea is that the size and ration of a traffic light as well as its location should be

consistent among all training samples. For instance, the traffic location cannot be as

low as the vehicle brake lights as shown in Figure 4.2. Each bounding box B has four

parameters,

B = {x, y, w, h} (4.2)

where (x, y) are coordinates of the upper-left corner (or origin) of a bounding box,

w and h are width and height respectively. Intuitively, it is impossible for the traffic

lights to appear on the road, therefore y should be within a certain range and so does

x. There are implicit relationships between the size and position of a traffic light in

an image. Again, it is difficult to explore these relationships explicitly through image

processing. We propose to introduce the additional information of the bounding box

B by including them into the SVM input sample. Thus we form a new vector S2 for

42

each candidate, where

S2 = S1 ∪B = {Rin, Rout, mref , min, mout, x, y, w, h} (4.3)

As demonstrated later in Section 4.3, the expanded SVM vector shows significant

improvement on the detection performance. It is worth noting that the proposed

method can be expanded further by including more parameters and features to the

SVM. The propose method is a frame work that utilizes SVM as a machine learning

tool to automatically find optimal parameter settings for traffic light detection.

4.3 Data Collection and Performance Evaluation

As an experimental set up, we mount a camera behind the front windshield and

record videos when driving on the road. We extract traffic light candidates using the

process discussed in 4.2.1. We obtain a data set with 21299 candidates from 2706

images. These images are extracted from actual videos and contain four independent

instances that contain circular traffic lights. Each image has the resolution of 1920-

by-1080 pixels. We compare these candidates with manually labeled ground truth

and find that there are 4526 true traffic lights and 16773 negative candidates.

This new constructed dataset is used to evaluate the proposed detection method.

In order to compare the performance among different method, here we propose two

metrics named precision and recall, where

precision =true positives

true positives+ false positives(4.4)

43

Figure 4.4: Rin/Rout values for true positive candidates (left) and true negative can-didates (right). Y-axis is from 0 to 2000.

recall =true positives

true positives+ false negatives(4.5)

The dataset with 21299 candidates is sorted randomly. When applied to the proposed

SVM, half of them are used as training data and the remaining are used for testing.

There is no overlap between training and test data.

We first evaluate the ratio Rin/Rout and its feasibility to detect the traffic lights on

the image. Fig. 4.4 shows the ratio values for true negative candidates are generally

larger than that of true positive candidates. But if we zoom into the Rin/Rout values

with Y axis from 0 to 20 as in Fig. 4.5, we can see that many true negative candidates

also have small Rin/Rout values. Therefore, choosing a fixed threshold p is not an

effective method to separate the positive or negative candidates, because some true

positive candidates could be classified as negative and vice versa.

Table 4.1 lists the evaluation results based on Rin/Rout for different p values.

TP, FP, TN and FN stand for True Positives, False Positives, True Negatives and

False Negatives, respectively. The results show that it is difficult to balance between

precision and recall with a fixed threshold value. Thus we opt to use SVM for

44

Figure 4.5: Rin/Rout values for true positive candidates (left) and true negative can-didates (right). Y-axis is from 0 to 20.

classifications based on the Rin and Rout values.Table 4.1: Evaluation result based on Rin/Rout for different p values

Threshold Precision Recall TP FP TN FN

p = 1.5 47.52% 96.51% 4368 4824 11949 158

p = 1.0 60.90% 89.79% 4064 2609 14164 462

p = 0.5 78.48% 76.14% 3446 945 15828 1080

p = 0.2 95.95% 45.01% 2037 86 16687 2489

Table 4.2 shows the performance of different detection methods. We use the classic

objection method with Haar-like features and AdaBoost algorithm as a baseline,

which provides the results of 76.89% precision and 73.40% recall. If we set the

threshold p = 0.5 for the Rin/Rout ratio, the detection performance is only slightly

better than the baseline.

Next, a SVM with a radial basis function (RBF) kernel is trained with {Rin,Rout}

as input for traffic light classification. Table 4.2 shows that the SVM can improve

recall by 15.14% but only 2.28% for precision, compared with that using a fixed set

threshold.

As proposed in Section 4.2, we add the pixel values mref , min and mout in addition

45

to Rin and Rout values to the SVM input vector S1, the detection performance is

improved significantly with 89.09% precision and 96.60% recall. Furthermore, the

origin and geometry information {x, y, w, h}of the bounding box are added to form

S2. The improved method can achieve the performance of 96.97% precision and

99.43% recall that is reasonably accurate and reliable.

Table 4.2: Evaluation result: precision and recall

Detection method Precision Recall

Haar, AdaBoost 76.89% 73.40%

Rin/Rout, p = 0.5 78.48% 76.14%

{Rin, Rout}, SVM 80.76% 91.28%

S1, SVM 89.09% 96.60%

S2, SVM 96.97% 99.43%

Fig. 4.6 shows an example of detected traffic lights in an image. Although their

backgrounds are drastically different, both traffic lights are detected and marked in

the image. Our system is implemented in C++ and executed on the Intel i5-3570K

processor at 3.4 GHz. The processing time for each image frame is approximately 60

ms to 90 ms. For real-time implementation, we are currently migrating the design to

an FPGA platform.

4.4 Conclusions

In this chapter, we propose a new method that can detect traffic lights accurately and

reliably. Color extraction is applied to locate the candidates. A template matching

technique is applied to provide quantitative information of the traffic lights and its

sounding areas. We also demonstrate that detection using a fixed threshold ratio is

not very effective and the SVM-based classification has much better performance. In

46

Figure 4.6: Both traffic lights are detected.

addition, we empirically add more parameters of a candidate to the SVM input and

it can achieve the best performance of 96.97% precision and 99.43% recall. As an

additional contribution, we build a traffic light dataset with 21299 samples from the

original videos captured while driving on the streets. This dataset can be used by

others for computer vision and machine learning research.

47

Chapter 5

Accurate and Reliable Detection of

Traffic Lights Using Multi-Class

Learning and Multi-Object

Tracking

Automatic detection of traffic lights has great importance to road safety. This chap-

ter presents a novel approach that combines computer vision and machine learning

techniques for accurate detection and classification of different types of traffic lights,

including green and red lights both in circular and arrow forms. Initially, color ex-

traction and blob detection are employed to locate the candidates. Subsequently, a

pre-trained PCA network is used as a multi-class classifier to obtain frame-by-frame

results. Furthermore, an online multi-object tracking technique is applied to over-

come occasional misses and a forecasting method is used to filter out false positives.

Several additional optimization techniques are employed to improve the detector per-

48

formance and handle the traffic light transitions. When evaluated using the test

video sequences, the proposed system can successfully detect the traffic lights on the

scene with high accuracy and stable results. Considering hardware acceleration, the

proposed technique is ready to be integrated into advanced driver assistance systems

or self-driving vehicles. We build our own dataset of traffic lights from recorded

driving videos, including circular lights and arrow lights in different directions. Our

experimental dataset is available at http://computing.wpi.edu/Dataset.html.

5.1 Introduction

Automatic detection of traffic lights is an essential feature of an advanced driver

assistance system or self-driving vehicle. Today it is a critically important road safety

issue that many traffic accidents occurred at intersection are caused by drivers running

red lights. Recent data from Insurance Institute of Highway Safety (IIHS) show that

in the year of 2012 on US roads, red-light-running crashes caused about 133,000

injuries and 683 deaths [1]. Introduction of automatic traffic light detection system,

especially red light detection, has important social and economic impacts.

In addition to detecting traffic lights, it is also important to recognize the lights

as they appear in circular or as directional arrow lights. For example, a red left arrow

light and a green circular light can appear at the same time. Without recognition, the

detection systems can get confused because valuable information has been lost. There

are few papers in the literature that combine detection and recognition of traffic lights

together.

Based on our survey, there are very few datasets available for traffic lights. The

Traffic Lights Recognition (TLR) public benchmarks [8] contain image sequences with

49


traffic lights and ground truth. However, the images in the dataset do not have high

resolution, and the number of physical traffic lights is limited due to the fact that

the image sequences are converted from a short video. In addition, this dataset

only contains circular traffic lights, which is not always the case in real applications.

Therefore, we opt to build our own dataset for traffic light detection, including circular

lights and arrow lights in all three directions. Our dataset of traffic lights can be used

by many other researchers in computer vision and machine learning.

In this chapter, we propose a new method that combines computer vision and

machine learning techniques. Color extraction and blob detection are used to locate

the candidates, followed by the PCA network (PCANet) [32] classifiers. The PCANet

classifier consists of a PCANet and a linear Support Vector Machine (SVM). Our

experimental results suggest the proposed method is highly effective for detecting

both green and red traffic lights of many types.

Despite of the effectiveness of PCANet and many outstanding achievements made

by computer vision researchers, object detection from an image still make frequent

errors, which may cause huge problems in the real-world critical applications such as

Advanced Driver Assistance Systems (ADAS). Traditional frame-by-frame detection

methods ignore the inter-frame information in the video. Since the objects in a video

are normally in continuous motion, their identities and trajectories are valuable in-

formation that can improve the frame-based detection results. Unlike a pure tracking

problem that tracks a marked object from the first frame, tracking-by-detection algo-

rithms involves frame-by-frame detection, inter-frame tracking and data association.

In addition, multi-object tracking (MOT) algorithms can be employed to distinguish

different objects and keep track of their identities and trajectories. When it becomes

a multi-class problem such as recognizing different types of traffic lights, additional

50

procedure such as a voting scheme is often applied. In addition, the method needs to

address the situation that traffic light status can change suddenly during the detection

process.

The rest of the chapter is organized as follows. Section 5.2 describes our data

collection and experimental setup. In Section 5.3, we propose a method that combines

computer vision and machine learning techniques for traffic light detection using

PCANet. In Section 5.3.3, we propose a MOT-based method that stabilizes the

detection and improves the recognition results. Performance evaluations is presented

in Section 5.4, followed by some discussion in Section 5.5 and conclusions in Section

5.6.

5.2 Data Collection and Experimental Setup

In this chapter, we focus on the detection of red and green traffic lights, and the

recognition of their types. The amber lights can be detected using similar techniques,

but we do not consider amber lights here due to lack of data. The recognition of arrow

lights requires that the input frames must be high resolution images. Otherwise all

lights are just colored dots or balls in the frame, and it is impossible to recognize

them.

We mount a smartphone behind the front windshield and record videos when

driving on the road. Several hours of videos are recorded around the city of Worcester,

Massachusetts, USA, during both summer and winter seasons. Subsequently, we select

a subset of video frames to build the dataset since most of the frames do not contain

traffic lights. In addition, passing an intersection only takes a few seconds in case of

the green lights. At red lights, the frames are almost identical as the vehicle is stopped.

51

Thus the length of selected video for each intersection is very short. Several minutes

of traffic-light-free frames are retained in our dataset for assessment of false positives.

Each image has a resolution of 1920×1080 pixels. To validate the proposed approach

and to avoid overlapping of training and test data, the data collected in the summer

is used for training and the data collected in the winter is used for testing. Our traffic

light dataset is made available online at http://computing.wpi.edu/Dataset.html.

5.2.1 Training data

All the training samples are taken from the data collected during the summer. Input

data to the classifier are obtained from the candidate selection procedure described in

5.3.1, and the classifier output goes to the tracking algorithm for further processing.

Thus evaluation of the classifier is independent to the candidate selection or the post-

processing (tracking). The classifier is trained to distinguish true and false traffic

lights, and to recognize the types of the traffic lights. OpenCV [84] is used for

SVM training, which chooses the optimal parameters by performing 10-fold cross-

validation.

The positive samples, which contain the traffic lights, are manually labeled and

extracted from the dataset images. The negative samples, such as segments of trees

and vehicle tail lights, are obtained by applying the candidate selection procedure over

the traffic-light-free images. The green lights and red lights are classified separately.

For green lights, there are three types base on their aspect ratios. The first type is

called Green ROI-1, which contains one green light in each image and its aspect ratio

is approximately 1:1. The second type is called Green ROI-3. It contains the traffic

light holder area which has one green light and two off lights, and its aspect ratio is

approximately 1:3. The third type is called Green ROI-4. It contains the traffic light

52


Figure 5.1: Examples of 5 classes of Green ROI-1.

holder area which has one green round light, one green arrow light, and two off lights,

and its aspect ratio is approximately 1:4.

Each type of sample images has several classes. The Green ROI-1 and Green

ROI-3 both has five classes including negative samples as shown in Fig. 5.1 and Fig.

5.2. These 5 classes from top to bottom are Green Negative (GN-1; GN-3), Green

Arrow Left (GAL-1; GAL-3), Green Arrow Right (GAR-1; GAR-3), Green Arrow

Forward (GAF-1; GAF-3) and Green Circular (GC-1; GC-3).

The Green ROI-4 also has five classes including negative samples as shown in Fig.

5.3. The five classes from top to bottom are Green Negative (GN-4), Green Circular

and Green Arrow Left (GCGAL-4), Green Circular and Green Arrow Right (GCGAR-

4), Green Arrow Forward and Left (GAFL-4) and Green Arrow Forward and Right

(GAFR-4). The Green Negative samples are obtained from traffic-lights-free videos

by using the color extraction method discussed in Section 5.3.1.

For red lights, there are two types of sample images base on their aspect ratios.

The first type is called Red ROI-1 as shown in Fig. 5.4. It contain one red light in

each image and its aspect ratio is approximately 1:1. The other type is called Red

53


54


55

Figure 5.4: Examples of 3 classes of Red ROI-1.

ROI-3 as shown in Fig. 5.5. It contains the traffic light holder which contains one

red light and two off lights, and its aspect ratio is approximately 1:3. Each type of

sample images has three classes: Red Negative (RN-1; RN-3), Red Arrow Left (RAL-

1; RAL-3) and Red Circular (RC-1; RC-3). The Red Negative samples are obtained

from traffic-lights-free videos by using the color extraction method mentioned in 5.3.1.

The red light do not have ROI-4 data because the red light is on top followed by an

amber light and one or two green lights at the bottom. If the red light is on, the

amber and green lights beneath must be off. These three lights are in ROI-3 vertical

setting, regardless of the status of the 4th light at the very bottom.

Table 5.1 shows the number of training samples of Green ROI-n and Red ROI-n,

where n is 1, 3 or 4.

Features of a traffic light itself may not be as rich as other objects such as a

human or a car. For example, a circular light is just a colored blob that looks similar

to other objects in the same color. Therefore, it is difficult to distinguish the true

traffic lights from other false candidates solely based on color analysis. The ROI-3 and

ROI-4 samples are images of the holders, which provide additional information for

detection and classification. The approach of combing all these information together

is explained in 5.3.2.2.

56

Figure 5.5: Examples of 3 classes of Red ROI-3.

57

Table 5.1: Number of training samples of Green ROI-n and Red ROI-n

Class n = 1 n = 3 n = 4

GN-n 13218 13218 13213

GAL-n 1485 835 -

GAR-n 1717 617 -

GAF-n 2489 1018 -

GC-n 3909 3662 -

GCGAL-n - - 369

GCGAR-n - - 281

GAFL-n - - 749

GAFR-n - - 1005

RN-n 7788 7619 -

RAL-n 1214 1235 -

RC-n 4768 5035 -

5.2.2 Test data

All test images are taken from the dataset that we collected in the winter. The ground

truths are manually labeled and are used for validating the results. In our proposed

method, tracking technique is used to further improve the performance. However,

traffic lights can move out of the image or change states during the tracking process.

Therefore the test sequences need to cover many possible scenarios for all types of

lights. Detailed information of the test sequences is shown in Table 5.2.

5.3 Proposed Method of Traffic Light Detection

and Recognition

Fig. 5.6 shows the flowchart of our proposed method of traffic light detection and

recognition, which consists of three stages. Firstly, color extraction and candidates

58

Table 5.2: Information of 23 test sequences

Seq ID Frames Traffic lights Types of traffic lights Description

1 91 182 Green circular×2. Lights in all frames.


3 61 147 Green arrow left×3. Lights in all frames.


5 156 312 Red circular×2. Lights in all frames.

6 156 211 Green circular×2. Lights at start, then move out.





11 91 348Red circular×3; green arrow left; green arrow right; Red lights at start,

green arrow forward; green circular. then green lights.

12 56 280Red arrow left; green arrow right×2;

Lights in all frames.green arrow forward×2.



15 65 325Red arrow left; green arrow right×2;

Lights in all frames.green arrow forward×2.



18 630 0 None. No traffic lights.






Total 8112 4207 - -

59

selection are performed over the input image. Secondly, to determine whether the

selected candidates are traffic lights and what types of lights, they are processed

by PCANet and SVM. Finally, tracking and forecasting techniques are applied to

improve the performance and stabilize the final output.

Figure 5.6: Flowchart of the proposed method of traffic light detection and recogni-tion.

5.3.1 Locating candidates based on color extraction

To locate the traffic lights, color extraction is applied to locate the Region of Interest

(ROI), i.e., the candidates. The images are converted to hue, saturation, and value

(HSV) color space. Comparing with RGB color space, HSV color space is more robust

against illumination variation and is more suitable for segmentation [85]. The desired

color is extracted from an image mainly based on the hue values, which results a

60

binary image. Suppose the HSV value of the ith pixel in an image is

HSVi = {hi, si, vi} (5.1)

In order to extract green pixels, we set the color thresholds based on the empirical

data:

40 ≤ hi ≤ 90 (5.2)

60 ≤ si ≤ 255 (5.3)

110 ≤ vi ≤ 255 (5.4)

In order to extract red pixels, besides (5.3) and (5.4), one of the the following condi-

tions must hold:

165 ≤ hi ≤ 180 (5.5)

0 ≤ hi ≤ 20 (5.6)

These values are adjustable and similar settings can be found in [28]. Note that

the threshold values that we choose work well in OpenCV [84] and may need proper

conversion in order to work with other libraries.

Blob detection can be implemented using flood-fill or contour following. The blobs

can be considered as the potential candidates. However, it is possible that an arrow

light may be labeled as two different regions, because the head and tail of an arrow

61

Figure 5.7: Color extraction, blob detection and closing operation.

are sometimes separated with a gap between them. When the traffic lights are closer

to the camera, it is more likely that the gaps can be clearly seen and thus affect the

result of blob extraction. To solve this problem, the closing operation is performed

on the binary image obtained from color extraction. Closing operation is a typical

morphological operation in image processing. It applies a dilation followed by an

erosion, which eliminates gaps and holes on the binary image. Therefore, the arrow

light can be detected as a whole and the candidates after closing is more reliable than

the original candidates. Fig. 5.7 shows the original result of color extraction and blob

detection (top right), and the result with closing operation (bottom right).

The side-effect of the closing operation is that it might connect a green light

with other green objects in the background such as trees. When the traffic lights

are far away from camera, this problem is more likely to occur because the black

62

borders of traffic light holders are thin. However, when the traffic lights are far away,

the gaps are more likely to be filled by the halo of the lights, or become invisible

due to the limitation of image resolution. Therefore, the original candidates are more

reliable than those after closing. It is difficult to determine whether the morphological

closing operation should be applied. Therefore, we choose to keep both the original

candidates and the candidates after closing operation. In case overlapped candidates

are identified through the classification, the candidate with aspect ratio closest to one

is selected.

The objective of eliminating false positives is considered in the latter part of the

proposed method. Fig. 5.8 shows an example of the road images. In this image,

there are four green traffic lights, but 895 green candidates can be extracted using

the method mentioned above. This requires our classifier to be very strong to filter out

the negative candidates while retain positive ones. However, even if the classifier is

able to filter out 99% of the negative candidates, there are still about 9 false positives

remaining in this image, which is an unacceptable result. Therefore, pre-filtering and

post-validation steps are necessary in addition to the classifier itself. For red traffic

lights, the number of candidates is much smaller than that of the green traffic lights.

For example, there are 19 red candidates in Fig. 5.8 from the color extraction .

In many previous works [21, 22, 25, 29], a variety of morphological filtering tech-

niques were applied to eliminate some candidates for the purpose of reducing false

positives. However, any filtering has a possibility of missing the true traffic lights,

because the traffic lights are not always clear due to their size and the obscure back-

ground in an image. Thus only aspect ratio check is performed in the proposed

method and all blobs that pass the check are kept as candidates. The aspect ratio ar

63

Figure 5.8: A sample frame from our traffic light dataset.

is defined as

ar = w/h (5.7)

where w is the width and h is the height of the candidate. In order to pass the aspect

ratio check, the following inequality must hold:

2/3 ≤ ar ≤ 3/2 (5.8)

The aspect ratio check is to reduce the number of candidates. In Fig. 5.8, the number

of green candidates is reduced to 51 and the number of red candidates in reduced to

9 after the aspect ratio check.

64

5.3.2 Classification

5.3.2.1 PCANet

The PCANet classifier is applied to determine whether a candidate is a traffic light

or not. PCANet classifier consists of a PCA network and a multi-class SVM. The

structure of PCANet is simple, including a number of PCA stages followed by an

output stage. The number of PCA stages can be variant, but the typical value

is 2, making it so called two-stage PCANet. As shown in [32], a two-stage PCANet

outperforms the single stage PCANet in most cases, but further increase of the number

of stages does not necessarily provide better performance, according to their empirical

experience. Therefore, two-stage PCANet is used in our proposed method.

The structure of PCANet is to emulate that of a traditional convolutional neural

network [86]. The convolution filter bank is chosen to be PCA filters. The nonlinear

layer is the binary hashing (quantization). The pooling layer is the block-wise his-

togram of binary vectors. There are two parts in the PCA stage - patch mean removal

and PCA filters convolution. For each pixel of the input image, there are a patch of

pixels in the same size of the filter. Their mean are then removed from each patch,

followed by PCA filter convolution. The PCA filters are obtained by unsupervised

learning during the training process. The number of PCA filters can be variant. The

impact of the number of PCA filters is discussed in [32]. Generally speaking, more

PCA filters lead to better performance. In this chapter, we choose 8 filters for both

PCA stages and we find it is sufficient to deliver good performance.

The output stage consists of binary hashing and block-wise histogram. The output

of PCA stages are converted to binary values, with positive value to one and else to

zero. Thus a binary vector is obtained for each patch, and the length of this vector

65

Figure 5.9: The structure of two-stage PCANet.

is fixed. This binary vector is then converted to decimal value. The block-wise

histogram of these decimal values forms the output features. The SVM is then fed

with the features from PCANet. Fig. 5.9 shows the structure of a two-stage PCANet.

The number of filters in stage 1 is m and in stage 2 is n.

5.3.2.2 Recognizing green traffic lights using PCANet

As mentioned in 5.3.1, due to a large number of green objects in an image such

as trees, street signs, and green vehicles, the classifier must be strong enough to

eliminate the potential false positives while maintain a high detection rate. Using

the green areas as candidates is not sufficient. For example, a fragment of tree leaves

may occasionally look similar to the green lights in some frames, which causes false

positive “flashing” in the video of detection results.

To solve this problem, a validation step is applied to the system. It is assumed

that the traffic lights always appear in a holder. The traffic light holder contains three

66

or four lamps that are vertically aligned in our collected data. Note that horizontal

traffic lights are also often used and can be processed using the same method if the

dataset is available. In addition, these lamps have certain combinations. The traffic

light holder area thus contains important information that can help us detect the

traffic lights. In a vertical traffic light holder, the bottom one is always a green lamp.

Therefore, the position of potential traffic light holder can be located according to

the green area. The aspect ratio of the green area is approximately 1:1 and the green

area is called as ROI-1. The traffic holder area with three lamps is called ROI-3

and the traffic holder area with four lamps is called ROI-4. Suppose the rectangular

bounding box of ROI-1 is RROI−1 where

RROI−1 = {xROI−1, yROI−1, wROI−1, hROI−1} (5.9)

Similarly there are bounding boxes RROI−3 for ROI-3 and RROI−4 for ROI-4 where



The variables xROI−i, yROI−i are the coordinates of the top-left corner of the bounding

box RROI−i , wROI−i is its width and hROI−i is its height. The RROI−3 can be obtained

based on RROI−1 as follows, where the coefficients are determined empirically based

on the assumption that the lights are vertically aligned and the green light is the

lowest light:

xROI−3 = xROI−1 − 0.1× wROI−1 (5.12)

67

yROI−3 = yROI−1 − 2.5× hROI−1 (5.13)

wROI−3 = 1.2× wROI−1 (5.14)

hROI−3 = 3.6× hROI−1 (5.15)

In the case of horizontally aligned lights, these coefficient should be changed accord-

ingly. Similarly, the RROI−4 can be obtained based on RROI−1 as follows:

xROI−4 = xROI−1 − 0.1× wROI−1 (5.16)

yROI−4 = yROI−1 − 3.9× hROI−1 (5.17)

wROI−4 = 1.2× wROI−1 (5.18)

hROI−4 = 5.1× hROI−1 (5.19)

All samples of ROI-1 are resized to 10 × 10 pixels, all samples of ROI-3 to 10 × 33

pixels and all samples of ROI-4 to 10×43 pixels. Three PCANet classifiers are trained

separately for for ROI-1, ROI-3 and ROI-4. Each classifier is able to perform multi-

class classification, such as distinguishing left arrows, right arrows, circular lights and

negative samples.

In order to combine the results of these three classifiers, several methods are

evaluated using the test dataset. An intuitive solution is the voting strategy. The

results of ROI-1, ROI-3 and ROI-4 are voted to several classes and the class that has

the most votes is selected as the final result. However, this method is not accurate.

The ROI-3 may contain partial area of a traffic light holder if it is actually a four-

light holder. The ROI-4 may contain background if it is actually a three-light holder.

68

Therefore, the positive results of ROI-3 and ROI-4 are both considered as possible

regions. If any positive results of ROI-1 overlap with these regions, it is considered

a true positive green light. This is a more plausible approach because the two cases

mentioned above do contain the traffic light holders that are the possible regions.

Although the class types determined by ROI-3 and ROI-4 may be inaccurate, the

ROI-1 is capable of providing an accurate result.

5.3.2.3 Recognizing red traffic lights using PCANet

Red traffic lights are recognized in a similar way as to green lights. The bounding

boxes of Red ROI-1 and ROI-3 are expressed the same as that of the green lights

shown in (5.9) and (5.10). Assuming the lights are vertically aligned and the red

light is the top light, the RROI−3 can be obtained based on RROI−1 using Equation

5.12, 5.14, 5.15 and

yROI−3 = yROI−1 − 0.1× hROI−1 (5.20)

5.3.3 Stabilizing the detection and recognition output

5.3.3.1 The problem of frame-by-frame detection

Frame-by-frame detection is important, but not sufficient to render stable output.

The reasons are twofold. One aspect is that no detector can perform perfectly under

all possible scenarios. Another aspect is that the input data sometimes are not of good

quality. For example, vehicle vibrations may cause cameras to lose focus, making the

frames vague. An arrow red traffic light in such situation may look identical to a

circular red light and can hardly be recognized even by human eyes, which is shown

on the image in the center of Fig. 5.10. However, the arrow light is clear in other

69

Figure 5.10: An arrow light in three consecutive frames. The middle one is vagueand look similar to a circular light. A detector often fails on such vague frame.

frames. If the detector recognizes this arrow light in previous frames and keeps track

of it, a correct estimation can be provided in the vague frame even if the detector

gives an incorrect result. In addition, there may be multiple lights in a frame, so

multiple lights need to be distinguished and not confused with each other.

The goal of multi-object tracking is to recover the complete tracks of multiple ob-

jects and to give estimation of their current states. There are two categories of multi-

object tracking methods: batch methods and online methods. The batch methods

require the detection results of the entire sequence before analyzing the identity and

constructing the trajectory of each object, which makes it impractical for real-time

applications. The online methods are based on information that is available up to

the current frame, which can provide results in real-time. Traffic light detection is a

time-critical application that needs to give immediate feedback to the driver or con-

troller, therefore multi-object tracking must be done using the online method. The

online methods track objects from previous frames, and associate the tracking result

with detection result of the current frame.

70

5.3.3.2 Tracking and data association

Here we propose an intuitive approach which is optimized for the traffic light detection

application. For video camera at 30 frames per second (FPS), the motion of the

lights between the adjacent frames are of small values. Therefore, an object in the

next frame should be found near its location in the previous frame. Since color is an

important feature of traffic lights, mean shift method is employed to locate the traffic

light based on its previous position. Given a traffic light in the previous frame, the

mean shift procedure calculates the histogram in the hue channel of the HSV color

space, and then calculates histogram back-projection in the current frame in order to

locate the light.

There are other tracking methods such as particle filter, which is proven to work

for multiple people tracking [87]. We do not adopt it for two reasons. One is that

traffic lights are small objects in a high resolution image which has 1920 × 1080

pixels. This makes it difficult for the particles to locate the traffic lights accurately

and may need a large number of particles, which is computationally expensive. The

other reason is that the weights of each particle cannot be evaluated effectively. The

assumption that the detection confidence of each particle is higher when it gets closer

to the actual position of the light is not true. The lights are so small in the image

and a small deviation may lose the target completely. In addition, our detector is

trained based on images of complete traffic lights, thus it cannot distinguish partial

lights from backgrounds nor give higher confidence values for them.

For data association, [87] employs greedy data association and observes similar

result compared with the Hungarian algorithm [88]. In our approach, the tracking

result is simply associated with the detection result when they overlap. The reason is

71

that the traffic lights are motionless in adjacent frames and mean shift performs well in

locating them. In addition, unlike people detection, traffic lights do not intersect with

each other and there is no need to consider the object identities switch problem, which

makes it easier to associate the tracking and detection results. Once the association

is established, the detected regions are used for mean shift tracking in the next frame,

instead of using the regions found by mean shift itself. It solves the scale problem of

mean shift and the detected regions are considered more accurate than the tracking

result.

Building trajectories of the objects can overcome occasional misses, but still can-

not filter out false positives. For example, if a rear light of a car is misclassified

as a red traffic light in several frames, its trajectory is very likely to be built by

multi-object tracking algorithms. However, the time series data for each object can

be obtained from online multi-object tracking. Since the time series data consist of

classification results over time, they can be used to generate the final output using

forecasting and time series analysis.

5.3.3.3 Forecasting

Given the previous detection or recognition result of a target, the estimation of its

current state is the final output. Such process is called forecasting and time series

analysis. Multi-object tracking algorithms focus on building the trajectories and

pay little attention to filtering out false positives. The idea is that the accumulated

classification results of a false object often have different patterns compared to that

of a true object, which can be used to filter out false positives. It is based on an

assumption that the detector has the ability to distinguish the true positives and

false positives to some extent, at least better then random guessing. Otherwise, it is

72

impossible to filter out the false positives. Some methods can be used to address the

false positives problem. In [87], a tracker is only initialized in certain regions of the

image, and is deactivated or terminated when there is no associated detection in a

certain number of frames. Tracklet confidence is introduced in [89], which is influence

by factors such as length, occlusion and affinity between tracking and detection.

In this chapter, we employ a simple forecasting technique after online multi-object

tracking, aiming at stabilizing the imperfect output of traffic light detection and

recognition. For each object, there is a binary time series where 1 denotes that the

detection result is true and 0 otherwise. The simple moving average (SMA) of the

time series is then calculated. Let n be the window size of the SMA, bi be the value

of the time series in the ith frame, and Sm be the SMA value in frame m, the formula

is

Sm =bm−(n−1) + bm−(n−2) + · · ·+ bm−1 + bm

n(5.21)

or alternatively

Sm = Sm−1 −bm−nn

+bmn

(5.22)

It can be interpreted that the Sm is propagated from Sm−1 while replacing the oldest

value with the newest value in the sliding window. The Sm can be used to determine

whether the object is considered positive, and a threshold t is set to determine the

final output bm as

bm =

1 Sm ≥ t

0 Sm < t

(5.23)

When bm is positive, a majority voting scheme is used to determine the type of the

traffic light. The history labels of this particular light are voted to corresponding

73

bins, and the one bin which has the most votes tells the type of the traffic light.

5.3.3.4 Minimizing delays

Forecasting and time series analysis usually have delays. As the window size m

grows, the delays become more severe. The delays at the head of a trajectory helps

avoid picking up false positives, because false positives are expected to be occasional

and inconsistent. However, slowly picking up true positives produces misses or false

negatives. On the other hand, the delays at the tail of a trajectory helps avoid

dropping off true positives, because true positives are expected to be consistent with

minimal and temporal errors. However, slowly dropping off false positives produces

erroneous output and increase the total number of false positives in the sequence. The

delays must be balanced so that their side effects are minimized while their useful

functionalities are not compromised.

At the head of trajectories, a dynamic threshold and modified moving average are

employed. Suppose in frame m, the moving average Sm is modified as

Sm =

bm−(n−1)+bm−(n−2)+···+bm−1+bm

nm ≥ n

b1+b2+···+bm−1+bmm

m < n

(5.24)

and set the threshold tm with a positive constant value α as

tm =

t m ≥ n

t+ α(1− mn

) m < n

(5.25)

At the beginning, the threshold is high and it drops slowly when more frames are

available. The output from the first n frames is suppressed because of insufficient

74

information to make a reliable decision. In a video at 30 FPS, 5 frames correspond

to about 167 ms. According to [90], the reaction time of human is over a second. So

such delays are acceptable. As a result, a true object with high confident is picked

up quickly and the false positives can still be filtered out.

At the tail of trajectories, the object that no longer exists need to be dropped

quickly. Traffic lights may change their states or move out of image during the

tracking process. The transition of state is sudden. It usually has at most 1 frame

that shows both lights are on or both are off, indicating the transition is taking place.

This particular frame does not exist in many cases, so it is not reliable to tell when

the transition occurs. However, traffic lights are motionless in adjacent frames and

the last valid position of a currently off light is still useful. When transition happens,

it can be determined if a detected light belongs to the same traffic light holder with

a different colored light. Subsequently, the transition is identified and the expired

information is dropped. On the other hand, when positive detections of a light around

the edge of an image for a few consecutive frames are lost, the object is dropped to

avoid erroneous output. Occlusion is not considered in this chapter, because it is not

safe to predict the state of the light without actually seeing it completely.

75

5.4 Performance Evaluation

5.4.1 Detection and recognition

Fig. 5.11 shows an example frame with detected traffic lights. Here two metrics

named precision and recall are used, where

precision =true positives

true positives+ false positives(5.26)

recall =true positives

true positives+ false negatives(5.27)

The true positives (TP) are samples that belong to this class, and are recognized as

this class correctly. The false positives (FP) are samples not belong to this class,

but are incorrectly recognized as this class. The false negatives (FN) are samples

that belong to this class, but are recognized as the other classes erroneously.

76

Figure 5.11: All traffic lights are detected and recognized correctly in the frame.

The true positives here must be detected and recognized correctly. A detected

but misclassified light does not provide correct identity of the actual light, which is a

falsenegative. Meanwhile, it provides a false identity of another type of light, which

is a false positive. Therefore, a detected but misclassified light is considered both a

false positive and a false negative. For example, if a red left arrow light is detected

but recognized as a red circular light, then the number of false negatives and the

number of false positives are both incremented by 1. Table 5.3 shows the results

of the test sequences with different configurations, such as using HOG or PCANet,

with or without tracking. It is clear that the PCANet outperforms HOG and tracking

77

Table 5.3: Test result of 17 sequences that contain traffic lights

Seq. IDHOG HOG + Tracking PCANet PCANet + Tracking

TP FN FP Precision Recall TP FN FP Precision Recall TP FN FP Precision Recall TP FN FP Precision Recall

1 182 0 9 95.3% 100% 162 12 13 92.6% 93.1% 182 0 6 96.8% 100% 162 12 6 96.4% 93.1%

2 179 1 13 93.2% 99.4% 171 1 4 97.7% 99.4% 180 0 13 93.3% 100% 172 0 15 92.0% 100%

3 143 4 48 74.9% 97.3% 135 4 8 94.4% 97.1% 145 2 3 98.0% 98.6% 135 4 0 100% 97.1%

4 140 4 10 93.3% 97.2% 132 0 0 100% 100% 139 5 3 97.9% 96.5% 132 0 0 100% 100%

5 102 210 0 100% 32.7% 154 150 0 100% 50.7% 298 14 0 100% 95.5% 304 0 0 100% 100%

6 211 0 51 80.5% 100% 186 17 41 81.9% 91.6% 211 0 42 83.4% 100% 186 17 32 85.3% 91.6%

7 411 17 15 96.5% 96.0% 420 0 11 97.4% 100% 428 0 6 98.6% 100% 420 0 0 100% 100%

8 136 16 6 95.8% 89.5% 420 0 11 97.4% 100% 428 0 6 98.6% 100% 144 0 0 100% 100%

9 302 3 374 44.7% 99.0% 297 0 128 69.9% 100% 303 2 99 75.4% 99.3% 297 0 37 88.9% 100%

10 168 9 14 92.3% 94.9% 169 0 10 94.4% 100% 140 37 6 95.9% 79.1% 160 9 5 97.0% 94.7%

11 325 23 18 94.8% 93.4% 306 30 22 93.3% 91.1% 329 19 2 99.4% 94.5% 314 22 3 99.1% 93.5%

12 218 62 33 86.9% 77.9% 232 28 11 95.5% 89.2% 211 69 33 86.5% 75.4% 201 59 29 87.4% 77.3%

13 67 3 5 93.1% 95.7% 54 8 17 76.1% 87.1% 66 4 1 98.5% 94.3% 54 8 17 76.1% 87.1%

14 485 33 83 85.4% 93.6% 510 0 144 78.0% 100% 493 25 34 93.5% 95.2% 510 0 13 97.5% 100%

15 282 43 21 93.1% 86.8% 295 10 7 97.7% 96.7% 280 45 0 100% 86.2% 271 34 0 100% 88.9%

16 231 11 44 84.0% 95.4% 230 4 35 86.8% 98.3% 201 41 19 91.4% 83.1% 220 14 16 93.2% 94.0%

17 186 0 144 56.4% 100% 178 0 110 61.8% 100% 186 0 12 93.9% 100% 178 0 1 99.4% 100%

Total 3586 439 879 80.3% 89.1% 3612 253 548 86.8% 93.45% 3752 273 276 93.1% 93.2% 3698 167 168 95.7% 95.7%

technique improves the performance. The results are not perfect due to the lack of

more training data and/or the occasional quality issue of captured video as shown in

Fig. 5.10.

5.4.2 False positives evaluation

The number of false positives is evaluated over several traffic-light-free sequences

as shown in Table 5.4. Again, PCANet outperforms HOG and tracking technique

improves the performance. The number of false positives is rapidly increased if there

are mis-recognized objects. A single mis-recognized object produces 30 false positives

in one second, if the video frame rate is 30 FPS.

The false positives are not eliminated completely in our proposed method simply

because the trade-off between precision and recall. Eliminating more false positives

may cause more false negatives, making precision increase and recall decrease, or

vice versa. Reference [27] argues that false-positive green lights are dangerous and

78

Table 5.4: Number of false positives in traffic-light-free sequences

Seq. IDHOG HOG + Tracking PCANet PCANet + Tracking

No. No. per frame No. No. per frame No. No. per frame No. No. per frame

18 150 0.2381 12 0.0190 39 0.0619 0 0

19 45 0.0776 35 0.0603 56 0.0966 26 0.0448

20 11 0.0264 0 0 18 0.0433 12 0.0288

21 127 0.2309 23 0.0418 37 0.0673 9 0.0164

22 280 0.3689 125 0.1647 40 0.0527 6 0.0079

23 179 0.0590 85 0.0280 105 0.0346 80 0.0264

Total 792 0.1327 280 0.0469 295 0.0494 133 0.0223

should be eliminated as much as possible, yielding 99% precision and 62% recall.

While such argument is reasonable for practical applications, we do not perform

such adjustments in this chapter. Instead, we demonstrate highly accurate and well-

balanced precision and recall results to validate our proposed approach as well as the

performance improvements by the introduction of PCANet and tracking.

5.5 Discussion

5.5.1 Comparison with related work

Table 5.5 compares several recent papers on traffic light detection and recognition.

However, it is difficult to compare them directly, because different testing data and

different evaluation metrics were used. There are benchmarks for object detection

and image classification like ImageNet [91], but no benchmark has yet been created

for multi-class traffic light detection and classification. Researchers use their own

collected data in their respective papers. Some papers [25–27] utilize the information

other than images, such as GPS data and prior knowledge of traffic light locations.

79

Some focus on a specific type of traffic lights, while others try to solve multiple colors

and types at the same time. These factors make it difficult for us to compare their

performance appropriately.

On the other hand, the efficiency is also hard to compare. The image sizes in

these papers vary. In general, traffic lights can be seen even if they are still far

away using higher resolution cameras. Instead, a far away traffic light may appear

only as a few pixels in a lower resolution image. A higher resolution camera can

provide clear images of traffic lights when they are further away. So the system may

detect the traffic light slightly earlier, which provides the driver additional time to

respond. However, large image size leads to higher computational cost and longer

processing time. Another factor is that different hardware platforms were used in

their implementations, such as desktop computers and on-board systems. Additional

hardware modules may also be involved such as GPS and inertial measurement unit

(IMU) [26].

5.5.2 Limitation and plausibility

This chapter presents a prototype system that can effectively detect several common

types of traffic lights in a vertical aligned setting. We would like to emphasize that

the proposed system is extendable. The ROI selection can be modified for other

types of traffic lights such as horizontally aligned lights. The multi-class classification

can be trained if sufficient data are provided. We feel confident that the proposed

system can be extended to detect all type of traffic lights and even for other pattern

recognition tasks with some modification.

Different light condition, color distortion, motion blur and variance of scenes may

compromise the system performance in the real world. Thus the robustness of the

80

Tab

le5.

5:R

esult

sof

seve

ral

rece

nt

wor

ks

ontr

affic

ligh

tsdet

ecti

on

Pap

erY

ear

Met

hod

Lig

ht

typ

esIm

age

size

Tim

ing

Per

form

ance

Ou

rap

pro

ach

2016

PC

AN

et;

Mu

lti-

obje

cttr

ackin

gG

reen

circ

ula

r;R

edci

rcu

lar;

1920×

1080

3H

zP

reci

sion

95.7

%;

Rec

all

95.7

%G

reen

arro

w;

Red

arro

w

[23]

2014

Sp

otli

ght

det

ecti

on;

Ad

apti

vete

mp

late

mat

chin

g;G

reen

circ

ula

r;R

edci

rcu

lar;

Am

ber

circ

ula

r-

-A

vera

geac

cura

cy97

.6%

;M

ult

iple

mod

elfi

lter

;S

ingl

eob

ject

trac

kin

gF

alse

alar

ms

ign

ored

ind

etec

tion

[28]

2014

Imag

ep

roce

ssin

g;H

idd

enM

arko

vm

od

els

Gre

enci

rcu

lar;

Red

circ

ula

r;A

mb

erci

rcu

lar

648×

488

25fr

ames

per

seco

nd

Ove

rall

det

ecti

onra

te98

.33%

and

91.3

4%in

diff

eren

tsc

enar

ios

[24]

2014

Fas

tra

dia

lsy

mm

etry

tran

sfor

mR

edci

rcu

lar;

Am

ber

circ

ula

r24

0×32

0M

ost

tim

eco

nsu

min

gp

art

˜1.8

2s

Pre

cisi

on84

.93%

;R

ecal

l87

.32%

[25]

2013

Fil

teri

ng

sch

eme

wit

hG

PS

info

rmat

ion

Gre

enci

rcu

lar;

Red

circ

ula

r72

0×48

015

.7m

sp

erfr

ame

Pre

cisi

on88

.2%

;R

ecal

l81

.5%

[26]

2011

Tra

ffic

ligh

tm

app

ing

and

loca

liza

tion

usi

ng

GP

Sin

form

atio

n;

Gre

enci

rcu

lar;

Red

circ

ula

r;A

mb

erci

rcu

lar

1.3

meg

apix

elR

eal-

tim

e;15

Hz

fram

ein

pu

tA

ccu

racy

:91

.7%

Sev

eral

pro

bab

ilis

tic

stag

es

[27]

2011

Tra

ffic

ligh

tm

app

ing

and

loca

liza

tion

usi

ng

GP

Sin

form

atio

n;

Gre

enci

rcu

lar;

Red

circ

ula

r;A

mb

erci

rcu

lar;

2040×

1080

4H

zP

reci

sion

99%

;R

ecal

l62

%O

nb

oard

per

cep

tion

syst

emG

reen

arro

w;

Red

arro

w;

Am

ber

arro

w

81

trained model is a key factor in addition to detection accuracy. The robustness of

our trained models can be improved by training with more data collected under all

kinds of conditions using different cameras. Researchers in machine learning are

often focused on investigating better algorithms, but sometimes getting more data

beats a clever algorithm [92]. However, detecting traffic lights in severe weather or

night condition may require different algorithms or even additional sensors and little

research has been done on such topics. This will be part of our future work as more

data become available.

The processing time depends on the image size as well as the number of candidates

in an image. The image size in our dataset is 1920×1080, which is considerably larger

than most of the other papers in Section 5.5.1. Our implementation is currently a

single-thread version running at approximately 3 Hz on a CPU. Our implementation

can be accelerated by using multiple CPU threads, GPUs or FPGA hardware. Previ-

ously we have successfully employed GPU to accelerate a traffic sign detection system

in [93] and a fast deep learning system in [94]. Using hardware is another option to

accelerate such systems. The most time consuming part is the PCANet classification

part, which has been accelerated on an FPGA in our latest work [95].

Since the proposed system is based on a camera sensor, its reliability is directly

affected by the video quality. There are many factors that can affect the output

images, such as the camera sensors, configurations, post-processing procedures, and

etc. An example of the data quality problem is shown in Fig. 5.10. Therefore, the

proposed method is not expected to work at night. The traffic lights at night appear

in different ways depending on the camera and its configurations. There may be halo

effect around the lights, or the lights turn to be white at center and only have thin

colored rings at the edge. A solution on one camera may not suitable for another

82

camera. Therefore we decide not to investigate the problem at night.

5.6 Conclusions

In this chapter, we propose a system that can detect multiple types of green and red

traffic lights accurately and reliably. Color extraction and blob detection are applied

to locate the candidates with proper optimization. A classification and validation

method using PCANet is then used for frame-by-frame detection. Multi-object track-

ing method and forecasting technique are employed to improve accuracy and produce

stable results. As an additional contribution, we build a traffic light dataset from

the videos captured by a camera mounted behind the windshield. This dataset has

been released to the public for computer vision and machine learning research and is

available online at http://computing.wpi.edu/Dataset.html.

83


Chapter 6

Pedestrian Detection for

Autonomous Vehicle Using

Multi-spectral Cameras

Pedestrian detection is a critical feature of autonomous vehicle or advanced driver as-

sistance system. This chapter presents a novel instrument for pedestrian detection by

combining stereo vision cameras with a thermal camera. A new dataset for vehicle ap-

plications is built from the test vehicle recorded data when driving on city roads. Data

received from multiple cameras are aligned using trifocal tensor with pre-calibrated

parameters. Candidates are generated from each image frame using sliding windows

across multiple scales. A reconfigurable detector framework is proposed, in which fea-

ture extraction and classification are two separate stages. The input to the detector

can be the color image, disparity map, thermal data, or any of their combinations.

When applying to convolutional channel features, feature extraction utilizes the first

three convolutional layers of a pre-trained convolutional neural network cascaded with

84

an AdaBoost classifier. The evaluation results show that it significantly outperforms

the traditional histogram of oriented gradients features. The proposed pedestrian

detector with multi-spectral cameras can achieve 9% log-average miss rate. The ex-

perimental dataset is made available at http://computing.wpi.edu/dataset.html.

6.1 Introduction

Automatic and reliable detection of pedestrians is an important function of an au-

tonomous vehicle or advanced driver assistance system (ADAS). Research works on

pedestrian detection are heavily depended on data, as different data and methods may

yield different evaluation results. The most commonly used sensor in data collection is

a regular color camera, and many datasets have been built such as the INRIA person

dataset [9] and the Caltech Pedestrian Detection Benchmark [10]. Thermal cameras

have also been considered lately, and different methods of pedestrian detection were

developed based on the thermal data [44]. It is worth investigating whether the meth-

ods developed from one type of sensor data are applicable to other types of sensors.

A method may not work anymore since the nature of data has changes, e.g., finding

certain hot objects by intensity value threshold on thermal image is not applicable

to a regular color image. Some methods such as gradient and shape based feature

extraction may still be applicable since an object has similar silhouettes in both color

and thermal images. In addition, data from different sensors may contain comple-

mentary information and combining them may result better performance. Multiple

cameras can form stereo vision, which provides additional disparity and depth infor-

mation. An example of combining stereo vision color cameras and a thermal camera

for pedestrian detection can be found in [56].

85

http://computing.wpi.edu/dataset.html

The data collection environment is also very important. Unlike static cameras for

surveillance applications, cameras mounted on a moving vehicle may observe much

more complex background and distance-varied pedestrians. Therefore, it calls for

different pedestrian detection algorithms from the surveillance camera applications.

To use multiple sensors on a vehicle, a cooperative multi-sensor system need to be

designed and new algorithms that can coherently process multi-sensor data need to

be investigated. The contributions of this chapter are listed as follows:

1. A multi-spectral camera instrument is designed and assembled on a moving

vehicle to collect data for pedestrian detection.

2. A new dataset for multi-spectral pedestrian detection is built from on-road

driving data. These data contain many complex scenarios that are challenging

for detection and classification.

3. We propose a machine learning based algorithm for pedestrian detection by

combining stereo vision and thermal images. Evaluation results show satisfac-

tory performance.

The rest of the chapter is organized as follows. Section 6.2 describes our instrumental

setup for data collection. In Section 6.3, we propose a framework that combines stereo

vision color cameras and a thermal camera for pedestrian detection using different

feature extraction methods and classifiers. Performance evaluations are presented in

Section 6.4, followed by further discussion in Section 6.5 and conclusions in Section

6.6.

86

6.2 Data Collection and Experimental Setup

6.2.1 Data Collection Equipment

To collect on-road data for pedestrian detection, we design and assemble a custom

test equipment rig. This design enables the data collection system to be mobile on

the test vehicle as well as maintaining calibration between data collection runs. The

completed system can be seen in Figure 6.1.

The stereo vision cameras called ZED StereoLabs are chosen for providing color

images as well as disparity information. The ZED cameras can capture high resolution

side-by-side video that contains synchronized left and right video streams, and can

create a disparity map of the environment in real-time using the graphics processing

unit (GPU) in the host computer. Furthermore, an easy to use SDK is provided,

which allows for camera controls and output configuration. In addition, the on-board

cameras are pre-calibrated and come with known intrinsic parameters. This makes

image rectification and disparity map generation easier.

The thermal camera is called FLIR Vue Pro, which is a long wavelengths infrared

(LWIR) camera. The IR camera is an uncooled vanadium-oxide microbolometer

touting a 640 × 512 resolution at a full 30 Hz and paired with a 13 mm germanium

lens providing a 45 × 35 field of view (FOV). This IR camera has a wide −20 to

50 operation range which allows for rugged outdoor use. The thermal camera also

provides Bluetooth wireless control and video data recording via its on-board microSD

card as well as an analog video output.

Both stereo vision and thermal cameras must remain fixed relative to each other

for consistency of data collection. A threaded rod is custom cut to length and each

end is threaded into the respective cameras tripod mounting hole. This provides

87

Figure 6.1: Instrumentation setup with both thermal and stereo cameras mounted onthe roof of a vehicle.

88

a rigid connection between the color and thermal cameras. An electrical junction

box is utilized as an appropriately sized, water proof box that provides high impact

resistance. The top lid is replaced with an impact resistant clear acrylic sheet such

that the stereo vision cameras can be situated safely behind it. A circular hole is

cut into the top lid to accommodate for the thermal camera lens to fit through and

mounted via the lens barrel. This is essential, as even clear acrylic would block most,

if not all the IR spectrum that is used by the thermal camera.

The mounting system is designed, modeled, and built utilizing aluminum extru-

sions. The entire structure is completely portable and can be mounted to any vehicle

with a ski rack. The aluminum extrusions can sit between the front and back ski

rack hold-downs. On the other hand, cable management is crucial in our design as

long cables are needed for communication between the laptop inside the vehicle and

the cameras on the roof. To avoid interference and safety issues, the cables must run

down the back of the vehicle, through the trunk and into the vehicle cabin, which

needs approximately 20 feet of cable. This creates an issue for the ZED stereo vision

cameras, as it operates on high speed USB 3.0 protocol that allows for a 10 feet

maximum length due to signal degradation and loss. To resolve this issue, an active

USB extension cable is used. A total of four cables terminated from the camera

setup are wrapped together with braided cable sleeves to prevent tangling and ensure

robustness.

An analog frame grabber is employed to capture the real-time analog output of

the IR camera instead of directly recording to the on-board microSD card. It is to

ensure proper synchronization between the thermal camera and stereo vision cameras.

With analog frame grabber, we are able to precisely capture at 30 FPS. AVI files are

generated using software provided along with the frame grabber. These AVI files are

89

then converted into image sequences.

6.2.2 Data Collection and Experimental Setup

Our dataset is made available online at http://computing.wpi.edu/dataset.html.

The data are collected while driving on city roads. Highway driving data are not

collected since pedestrians are hardly seen on highways. A total number of 58 data

sequences are extracted from approximately three hours of driving on city roads

across multiple days and lighting conditions. There are 4330 frames in total, in which

a person or multiple people are in clear view and un-occluded, similar to the Caltech-

USA reasonable set [36]. However, unlike the Caltech-USA reasonable set, we do not

discard small samples. In fact, more than half of the pedestrian samples in our dataset

are no more than 50 pixels in height due to image resolution and their distances to

cameras, which make our dataset more challenging. Each frame contains the stereo

color images, thermal image and disparity map. Since cameras have different angle

of view and field of view, the 58 usable sequences are rather short, ensuring the

pedestrians are within the view of all cameras. Furthermore, video frames without

any pedestrians are not included in our dataset.

6.3 Proposed Method

6.3.1 Overview

Figure 6.2 shows the flowchart of our proposed pedestrian detection method. Dispar-

ity data are generated from stereo color data. Thermal data are obtained from the

thermal cameras and reconstructed according to the point registration using trifocal

90


tensor. Instead of concatenating the features of different data sources and training a

single classifier, feature extraction and classification are performed independently for

each data source before the decision fusion stage. The decision fusion stage uses the

confidence scores of the classifiers, along with some additional constraints to make

the final decision. The proposed detector system can be reconfigured using differ-

ent feature extraction and classification methods, such as HOG with SVM or CCF

with AdaBoost. The decision fusion stage can utilize information from one or mul-

tiple classifiers. The performance of different configurations can be evaluated and

compared.

6.3.2 Trifocal tensor

These three cameras have different angle of view and field of view, making the point

registration (pixel level alignment) essential to windowed detection method cross

multi-spectral images. Simple overlay with fixed pixel offsets does not work because

every object has its own offset values depending on the distance to camera. There-

fore, trifocal tensor [56,96] is used for pixel level alignment over the color and thermal

images. The trifocal tensor T is a set of three 3× 3 matrices that can be denoted as

{T1,T2,T3} in matrix notation, or T jki in tensor notation [96] with two contravariant

and one covariant indices. The idea of the trifocal tensor is that given a view point

correspondence x↔ x′ ↔ x′′, there is a relation

[x′]×

(∑i

xiTi

)[x′′]× = 03×3. (6.1)

One method to compute the trifocal tensor T is by using the normalized linear

91

Figure 6.2: Framework of the proposed pedestrian detection method.

92

algorithm. Given a point-point-point correspondence x↔ x′ ↔ x′′, there is a relation

xix′jx′′kεjqsεkrtT qri = 0st

where 4 out of 9 equations are linearly independent for all choices of s and t. Therefore

at least 7 point-point-point correspondences are needed to compute the 27 elements

of the trifocal tensor. The trifocal tensor can be computed from a set of equations in

the form of At = 0, using the algorithm for least-squares solution of a homogeneous

system of linear equations.

Given the correct correspondence x ↔ x′, it is possible to determine the corre-

sponding point x′′ in the third view without reference to image content. It can be

denoted as x′′k = xil′jTjki and can be obtained by using the trifocal tensor and fun-

damental matrix F21. The line l′ goes through x′ and is perpendicular to l′e = F21x.

Both the trifocal tensor and fundamental matrix F21 can be pre-computed and only

need to be computed once as long as the placement of the cameras remains un-

changed. An alternative method is epipolar transfer x′′ = (F31x)× (F32x′). However,

this method has a serious problem that it fails for all points lying on the trifocal

plane. Therefore, trifocal tensor is a practical solution for point registration. In our

experiment, cameras are calibrated using a checkerboard. The pattern is made of

different materials, making it visible in both color and thermal camera. Figure 6.3

shows the usage of trifocal tensor in aligning color and thermal images.

6.3.3 Sliding windows vs. region of interest

There are two main methods to locate a pedestrian: sliding window detection and

Region of Interest (ROI) extraction. In sliding window detection, it applies a small

93

(a) Color image. (b) Thermal image.

(c) Reconstructed thermal image using tri-focal tensor and disparity information.

(d) Red-cyan anaglyph of color and recon-structed thermal images.

Figure 6.3: Proper alignment of color and thermal images using trifocal tensor.

94

sliding window over the entire image, often in different scales, to perform an ex-

haustive search. Each window is classified followed by some post-processing, such

as bounding box grouping. The ROI extraction finds out the potential candidates

first by some pre-processing techniques such as color and pixel intensity to filter out

negatives from these candidates by using a classifier or some other constraints. It is

often more efficient, as the number of candidates is much less than the amount of

sliding windows.

For pedestrian detection, both ROI extraction and sliding window detection have

been employed in the literature. The sliding window detection method is an universal

approach but is computationally expensive. On the other hand, ROI extraction is

often used for thermal images, because pedestrians are often hotter than the sur-

rounding environment. The ROIs are segmented based on the pixel intensity values.

However, we find that the ROI extraction on thermal images does not always work

well. The assumption that the pedestrians are hotter is not always true for various

reasons. For instance, a pedestrian wearing heavy layers of clothing does not appear

with distinctively high pixel intensity values in a thermal image, and thus a pedes-

trian can not be located by simple morphological operations. As another example,

the temperature of the road surface exposed to intense sunlight has higher tempera-

ture than the human bodies. Although false positives introduced by hot objects such

as vehicle engines can be filtered in later steps, the losses of true positives become a

serious problem. As a result, we feel the sliding window detection method is more

reliable in case of these complex scenarios. The classifier can analyze the windowed

samples thoroughly and make an accurate decision. Figure 6.4 shows some examples

of our pedestrian samples in color images and corresponding thermal images, where

row 1 and 3 are color samples and corresponds to thermal samples in row 2 and 4,

95

Figure 6.4: Examples of pedestrians in color and thermal images.

respectively.

However, sliding window detection method also has its own drawbacks, besides

much higher computational cost. The total number of windows in an images often

reaches 105 or more. Even a fair classifier with False Positives Per Window (FPPW) of

10−4 would still result 10 False Positives Per Image (FPPI). Since 2009, the evaluation

metric has been changed from FPPW to FPPI [38]. To solve this problem, many state-

of-the-art CNN-based classifier have been proposed in recent years. An alternative

approach is to combine information from additional sensors. Our proposed approach

96

of multi-spectral cameras is along this line.

6.3.4 Detection

In this chapter, we only compare the HOG and CCF methods for the task of pedes-

trian detection for the follow reasons:

1. The HOG method was always included as a baseline in Caltech-USA dataset.

Among 44 methods reported on the Caltech-USA dataset [38], 30 of them em-

ployed HOG or HOG-like features.

2. The CCF is one of the best performed methods reported on Caltech-USA

dataset as of May 2016. The idea of combining low level CNN feature and

a boosting forest model is promising.

3. The goal of this chapter is to investigate the combination of multi-spectral

cameras and its improvement on pedestrian detection. We publicize our dataset,

so other researchers can continue this study to discover many better solutions

in the future.

The HOG features have been widely used in object detection. It defines overlapped

blocks in a windowed sample, and cells within blocks. The histogram of the unsigned

gradients of several different directions are computed in all blocks, and are concate-

nated as features. The HOG features are often combined with SVM and sliding

window method for detection on different scaling levels.

At the training stage, the positive samples are manually labeled. The initial

negative samples are randomly selected on training images as long as they do not

overlap with the positive samples. All samples are scaled to a standard window size

97

of 20× 40 for training. The size of the minimum sample in our data is 11× 22. After

the initial training, the detector is tested on the training set and more false positives

are added back to the negative samples set. These false positives are often called hard

negatives and this procedure is often called hard negatives mining. This procedure

can be repeated for a few times until the performance improvement becomes marginal.

Once the detector is trained, it is ready to perform detection on the test dataset

and give a decision score for each window. Each frame with original size of 640× 480

is scaled into different sizes. The detector with a fixed size of 20× 40 is then applied

to the scaled images to find pedestrians of various sizes at different locations in a

frame.

CCF uses low level features from a pre-trained CNN model, cascaded with a

boosting forest model such as Real AdaBoost [97] as a classifier. The lower level

features from CNN are considered generic descriptors for objects, which contain richer

information than channel features. Meanwhile, the boosting forest model replaces

the remaining parts of CNN. Thus we avoid training a complete end-to-end CNN

model for a specific object detection application which would require large resources

of computation, storage and time. In our experiment, we apply similar settings as

described in [43], except for the parameters of the scales and number of octaves, in

order to detect pedestrians far away that are as small as 20×40 pixels. The conv3−3

layer in VGG-16 model are used as feature extraction. The windowed sample size is

128×64 instead of 20×40. The feature dimension of the 20×40 sample is 1296. The

training samples of CCF are from the training stage of HOG, similar to the method

described in [43] which use aggregated channel features (ACF) [98] to select training

samples for CCF. Caffe [99] is used for feature extraction of CCF on a GPU-based

computer platform. At the test stage, CCF method runs on the GPU platform is

98

considerably faster than the HOG method, but it requires more memory and disk

space for data storage.

6.3.5 Information fusion

The idea of combing the information from color image, disparity map and thermal

data for decision making is referred as information fusion. One approach is to con-

catenate these features together [56]. A single classifier can be trained on the con-

catenated features and the final decisions of the test instances can be obtained from

the classifier. This approach has an disadvantage that the classifier training becomes

a challenge as the dimension of features increases. Furthermore, if a new type of

feature needs to be added or an existing feature needs to be removed, the classifier

need to be re-trained, which is time consuming.

An alternative approach of information fusion is to employ multiple classifiers

and an example can be found in [100]. Each classifier makes decision on a certain

type or subset of features and the final result is obtained by using a decision fusion

technique such as majority voting or sum rule [101]. This approach has an advantage

that the structure of a system is reconfigurable. Without re-training the classifiers,

adding or removing different types of features becomes very convenient. Therefore,

we choose the later approach to make our system reconfigurable so that it evaluates

various settings and methods. Specifically, an SVM is used at the decision fusion

stage and its inputs are confidence scores from classifiers in the previous stage, which

is more appropriate than commonly used statistical decision fusion method in the

case of multi-source data [102, 103]. The data from different sources are often not

equally reliable, and so are the classifiers. The confidence scores must be weighted

when obtaining the final decision from information fusion.

99

6.3.6 Additional constraints

6.3.6.1 Disparity-size

Besides the extracted features from an image frame, additional constraints can be

incorporated into the decision fusion stage to further improve the detector perfor-

mance. An example is the disparity-size relationship. Figure 6.5 shows the disparity

and height relationship of the positive samples in the form of a linear regression line

d =

[h 1

]×B , where d is mean disparity, h is the height of the sample, and B is

a 2× 1 coefficient matrix. Given a pair of mean disparity d and height h of a sample,

the residual r = |d −[h 1

]× B| can be used to estimate whether this sample is

possibly a pedestrian or not.

From Figure 6.5 we can see a number of samples have very small mean disparity

and are far below the regression line. This is because the disparity information is not

accurate when an object is far away from camera. In fact, the stereo vision camera

we use automatically clamps the disparity value at certain distance. Object beyond

that distance results zero disparity, which makes the estimation for small size samples

inaccurate.

6.3.6.2 Road horizon

During detection, a few reasonable assumptions can be made to filter out more false

positives while retaining the true positives. The assumptions vary depending on

the application, including color, shape, position, etc. One assumption here is that

pedestrians stand on the road, i.e., the lower bound of a pedestrian must below the

road horizon. The road horizon can be automatically detected in an image. This

kind of simple constraint may or may not improve the detector performance, and

100

Figure 6.5: The relationship between the mean disparity and the height of an object.

101

experiments should be carried out to determine its effectiveness.

6.4 Performance Evaluation

There are a total of 58 labeled video sequences in our dataset. We use 39 of them for

training and the remaining 19 for test. Figure 6.6 shows the performance of different

settings, including disparity map, color image, thermal data, and their combinations,

all based on HOG features. Generally, the more types of information are used, the

better performance is achieved. The disparity-only setup performs the worst. The

color image only is better, followed by the combination of color and disparity. Note

that the thermal-only setup outperforms the combination of color and disparity. The

heat signature of pedestrians seems more recognizable in thermal images. The com-

bination of color, thermal and disparity information achieves the best performance,

with about 36% log-average miss rate (MR).

Figure 6.7 shows the performance of the HOG features, added with disparity-size

information and road horizon constraint. The road horizon improves the log-average

MR by about 5%. Despite little improvement provided by adding the disparity-size

information alone, the combination both provides nearly 7% improvement in log-

average MR.

Figure 6.8 shows the performance of different settings using CCF. Performance

of disparity only is the worst. Thermal image performs very well. However, it is

interesting to see the disparity does not provide any improvement when combined

with color or thermal. In fact, combing with disparity results lower performance.

This is due to the fact that CCF implementation accepts 8-bit image as input, thus

the precision of the disparity is not accurate. In comparison, CCF outperforms HOG

102

Figure 6.6: Performance of different input data combinations, all using HOG features.

103

Figure 6.7: Performance improvement by adding disparity-size and road horizon con-straints.

104

Figure 6.8: Performance of different input data combinations, all using CCF.

almost on all settings except for disparity. The best performance comes from CCF

with the combination of color and thermal, which achieves 9% log-average MR. Sim-

ilarly, we also attempt to add disparity-size information and road horizon constraint

to the CCF method, but the performance changes are negligible.

6.5 Discussion

Although the combination of multi-spectral cameras can improve the performance in

pedestrian detection, the performance is still highly dependent on the instrument.

Our thermal camera has a resolution of 640× 480, which is relatively low. To accom-

105

Figure 6.9: A pedestrian is embedded in the shadow of a color image.

modate the resolution and FOV of the thermal camera, the color cameras have to be

set to the same resolution. In addition, color cameras are sensitive to the lighting

condition, therefore the quality of the image sometimes cannot be guaranteed. Figure

6.9 shows an example, with bounding box drawn on the detected pedestrian in both

color and thermal images. It is obvious that the thermal image provide much better

information about the presence of the pedestrian, while it is hardly identifiable in the

color image due to the shadow.

Although thermal images seem to be dominant in our experiment, its reliability

still needs improvement. Figure 6.10 shows a thermal image taken on a hot sunny

day. Two pedestrians circled are not bright enough compared to the surroundings,

which is contradictory to the assumption of distinct thermal intensity in many existing

research works. In this case, the methods or operations based on pixel intensity values

become unreliable, such as intensity thresholding, head recognition using hot spot,

etc. On the contrary, some shape or gradient based methods may still perform well,

such as HOG and CCF described in this chapter.

106

Figure 6.10: An example thermal image with two pedestrians.

107

6.6 Conclusions

In this chapter, a novel pedestrian detection instrumentation is designed using both

thermal and RGB-D stereo cameras. Data are collected from on-road driving and an

experimental dataset is built with pedestrians labeled as ground truth. A reconfig-

urable multi-stage detector frame is proposed. Both HOG and CCF based detection

methods are evaluated using the multi-spectral dataset with various combinations of

thermal, color, and disparity information. The experimental results show that CCF

significantly outperforms the HOG features. The combination of color and thermal

images using CCF method results the best performance of 9% log-average miss rate.

For the future work, other advanced feature extraction and classification methods will

be considered to further improve the pedestrian detector performance.

108

Chapter 7

End-to-End Learning for Lane

Keeping of Self-Driving Cars

Lane keeping is an important feature for self-driving cars. This chapter presents an

end-to-end learning approach to obtain the proper steering angle to maintain the

car in the lane. The convolutional neural network (CNN) model takes raw image

frames as input and outputs the steering angles accordingly. The model is trained

and evaluated using the comma.ai dataset, which contains the front view image frames

and the steering angle data captured when driving on the road. Unlike the traditional

approach that manually decomposes the autonomous driving problem into technical

components such as lane detection, path planning and steering control, the end-to-end

model can directly steer the vehicle from the front view camera data after training.

It learns how to keep in lane from human driving data. Further discussion of this

end-to-end approach and its limitation are also provided.

109

7.1 Introduction

Lane keeping is a fundamental feature for self-driving cars. Despite many sensors

installed on autonomous cars such as radar, LiDAR, ultrasonic sensor and infrared

cameras, the ordinary color cameras are still very important for their low cost and

ability to obtain rich information. Given an image captured by camera, one of the

most important tasks for a self-driving car is to find the proper vehicle control input

to maintain it in lane. The traditional approach divides the task into several parts

such as lane detection [104,105], path planning [106,107] and control logic [108,109],

and they are often researched separately. The lane markings are usually detected by

some image processing techniques such as color enhancement, Hough transform, edge

detection, etc. Path planning and control logic are then performed based on the lane

markings detected in the first stage. In this approach, its performance highly relies

on the feature extraction and interpretation of the image data. Often the manually

defined features and rules are not optimal. Errors can also accumulate from previous

processing stage to next stage, leaving the final result inaccurate. On the other

hand, an end-to-end learning approach for self-driving cars has been demonstrated

in [70] using convolutional neural networks (CNNs). The end-to-end learning takes

the raw image as input and outputs the control signal automatically. The model

is self-optimized based on the training data and there is no manually defined rules.

These become the two major advantages of end-to-end learning: better performance

and less manual effort. Because the model is self-optimized based on the data to

give maximum overall performance, the intermediate parameters are self adjusted to

be optimal. Moreover, there is no need to detect and recognize certain categories of

pre-defined objects, to label those objects during training or to design control logic

110

Figure 7.1: Comparison between the traditional approach and end-to-end learning.

based on observation of these objects. As a result, less manual efforts are required.

Figure 7.1 compares the traditional approach with the end-to-end learning approach.

This chapter presents the end-to-end learning approach to produce the proper

steering angle from camera image data aimed at maintaining the self-driving car in

lane. The model is trained and evaluated using comma.ai dataset, which contains

image frames and the steering angle data captured when driving. The rest of the

chapter is organized as follows. Section 7.2 provides the details of our implementa-

tion, including data pre-processing and CNN architecture. The evaluation results are

presented in Section 7.3, followed by discussions in Section 7.4 and conclusions in

Section 7.5.

7.2 Implementation Details

7.2.1 Data pre-processing

The data used in this chapter are from comma.ai driving dataset. The dataset con-

tains 7.25 hours of driving data, including 11 video clips recorded at 20 Hz and some

other measurements such as steering angle, speed, GPS data, etc. The image frames

111

Figure 7.2: An example of image frame from the dataset.

are of size 320× 160 pixels, and are cropped from original video frames. The original

frames are not provided by the dataset. An example of the frame from the dataset is

shown in Figure 7.2. For lane keeping, only the image frames and the steering angle

data are used. The steering angle data are recorded at 100 Hz, and they are aligned

with the image frames using the alignment stamps provided by the dataset. In case

there are multiple steering angle instances correspond to the same image frame, their

average is used to form an one-to-one mapping between each image frame and its

corresponding steer angle.

Before training the CNN model, the data need to be further processed. First of

all, to simplify the problem, driving at night is not considered in this chapter and

all four clips recorded at night are not considered. Second, the data contains many

scenarios such as driving forward, changing lanes, making turns, driving on straight or

curved roads, driving in normal speed or moving slowly in a traffic jam, etc. To train

a lane keeping model, the data that meet the following criteria are selected: driving

112

in normal speed, no lane changes or turns, and both straight and curved roads. After

data selection, the remaining data are from 7 video clips with a total of about 2.5

hours. At last, five video clips containing 152K frames are used for training and two

video clips containing 25K frames are used for test.

During the training stage, one important issue needs to be addressed that the

data used for training is highly unbalanced as shown in Figure 7.3. As highway roads

tend to be mostly straight and the portion of curved road is at a small percentage,

the trained model based on these unbalanced data may tend to driving straight while

still have low losses. To remove such bias, the data of curved roads are up-sampled

by five, where curved roads are defined by where the absolute steering angle values

are larger than five degrees. The data are then randomly shuffled before training.

7.2.2 CNN implementation details

The CNN architecture that we proposed is shown in Figure 7.4, which is similar to

that in [70] and [110] but is much simpler. The loss layer used during training is

Euclidean loss, which computes the sum of squares of differences between predicted

steering angle and ground truth steering angle: 12N

∑Ni=1 ||x1i −x2i ||22. The CNN model

is trained using Caffe [99].

The CNN model consists of three convolutional layers and two fully connected

layers. The input layer is raw RGB image, and output layer is the predicted steering

angle for the input image. The first convolutional layer uses a 9×9 kernel and a 4×4

stride. The following two convolutional layers use a 5×5 kernel and a 2×2 stride. The

convolutional layers are mainly for feature extraction and the fully connected layers

are mainly for steering angle prediction, but there is no clear boundary between

them since the model is trained end-to-end. Dropout layers are used for preventing

113

Figure 7.3: Histogram of steering angles in training data.

114

Figure 7.4: The proposed CNN architecture for deep learning.

115

over-fitting. There are no pooling layers because the feature maps are small.

The CNN architecture, as well as the hyper-parameters used, can be further tuned

through more experiments. Overall, the CNN architecture is not the major concern

of this work for the following of two reasons. First, we feel the dataset is too small.

Despite the training and testing data contain more than 170K frames that equals to

about 2.5 hours driving, it is actually insufficient to train a generic lane keeping model

that uses raw image as input. The appearance of the roads can be very complex due

to different curves, road markings, lighting conditions, etc. In fact, the proportion

of data for curved roads is relatively small, with only about 20 minutes of driving.

Training a model that gives a continuous value as predicted steering angle, these

amount of data is not sufficient. The other reason is that tuning a model requires

proper evaluation metric, which is also limited by the current dataset. The details of

the evaluation method will be discussed in Section 7.4.

7.3 Evaluation

The trained model is evaluated using two test video clips containing 25K frames. For

each frame, the predicted steering angle is compared with the ground truth value. The

histogram of the error is shown in Figure 7.5. The standard deviation of the error is

3.26, the mean absolute error is 2.42, and the unit is degree. To better understand

the errors, the predicted angle and ground truth angle are compared in each frame

and the results cam be visualized.

Figure 7.6 shows an example frame along with the ground truth angle and pre-

dicted angle. The projected paths for both angles are plotted using the same approx-

imation as in [110]. The path using ground truth angle is in blue and the path using

116

Figure 7.5: Histogram of error of predicted steering angles during test.

117

Figure 7.6: An example frame with the ground truth angle, predicted angle and theirrespective projected path

predicted angle is in green. The simulated steering wheels for both angles are also

drawn for better visualization.

Figure 7.7 also visualizes the feature maps from the first two convolutional lay-

ers. The top-right 4 × 4 cells are results from the first convolutional layer, and the

bottom 4× 8 cells are results from the second convolutional layer. As expected, the

convolutional layers automatically learned to extract the lane markings as a kind of

feature during training. The model does not use any manually defined or hand-crafted

features, since it can learn useful features from the data automatically .

118

Figure 7.7: Visualization of the results from first two convolutional layers.

7.4 Discussion

7.4.1 Evaluation

As an evaluation metric, computing the differences of ground truth angle and pre-

dicted angle is actually questionable. Firstly, the ground truth provided by the human

driver is not globally optimal. The human driver cannot maintain the vehicle in the

center of the lane all the time. As long as the vehicle stays in lane, the predicted

angles are fine and do not have to be exactly the same as the human driver. Secondly,

both the vehicle movement and the steering control are continuous, thus the frame

by frame evaluation is not appropriate. Let’s consider two scenarios if the road is

straight. In the first scenarios, the steering angle turns to the left a bit, then quickly

turns to the right a bit to maintain the vehicle in the lane. This process can be

repeated. In the second scenario, the steering angle turns to the left a bit and stay at

119

that angle for a period of time, then it turns to the right a bit and stays for a while. In

the second scenario, the vehicle actually would drive out of the lane most of the time.

In these two scenarios, the histogram of the errors, mean absolute error, standard de-

viation of the error are the same. However, the first scenario is fine while the second

one is completely unacceptable. Figure 7.8 shows an example of the disadvantage of

this type of frame by frame evaluation. The frames and their predicted angles are

from the test dataset. These 5 frames are put in chronological order. We can see that

the middle frame has a huge error of 10 degrees. However, the recorded ground truth

does not seem correct in this frame. By looking at the previous and following frames,

we find out that the ground truth in this frame is transitioning from left to right.

This example shows that evaluating the error frame by frame is not appropriate.

To solve this problem, a simulator is needed to provide feedback based on the

predicted angle. The simulator should be able to generate the frames and simulate

the vehicle movement realistically. The frames should be generated according to the

vehicle position and orientation. One way to do so is using a virtual game engine,

such as described in [11, 111]. The advantage of using a virtual engine is that there

are built-in physics simulation and 3D rendering mechanism. The vehicle movement

simulation and frame generation can be done realistically. Besides, the ground truth

information is very rich in the virtual world. Information such as vehicle position,

orientation and velocity can be easily obtained, so do other objects. The disadvantage

is that the frames are computer generated graphics, but not real images captured from

driving in the real world. Although they look very realistic with the state of the art

game engine, the details and variations they provide still cannot match the data from

real world.

Alternatively, we can generate the next frames according to controls inputs using

120

Fig

ure

7.8:

An

exam

ple

ofth

edis

adva

nta

geof

fram

eby

fram

eev

aluat

ion

wit

h5

conse

cuti

vefr

ames

:th

eer

ror

inth

em

iddle

fram

eis

fals

e

121

recorded frames, i.e., data captured in real world. This can be achieved by either

learning approach [112] or 3D image projection approach [70]. The learning approach

learns auto-encoders to embedding road frames, and learns a transition model in the

embedded space. The next few frames can be generated based on the current frame

image and the current control inputs. On the other hand, the 3D image projection

approach assumes the ground is a flat surface, and solves the 3D geometry [113]

to generated the next frame based on the actual recorded frame, through predicted

camera shift and rotation. The camera shift and rotation can be obtained from vehicle

movement simulation, which can be computed using vehicle kinematic or dynamic

models [108,109].

7.4.2 Data augmentation

Since we are not supposed to drive off the lane when recording, the data obtained

from human driving are lack of error correction process. The human driver is able to

maintain the vehicle within the lane, but a model trained on such data is not robust

to errors and the vehicle may slowly drift away. To train a model that can correct

small errors such as vehicle shifts and rotations, the error correction data must be

provided during training. One solution is to perform data augmentation by randomly

creating some shifts and rotations, which generates corresponding frames based on

the 3D geometry described above. The correction control input can be computed

again using the vehicle kinematic or dynamic models. The comma.ai dataset does

not contain the original sized frames or camera calibration parameters. Therefore

simulator and data augmentation is not included in this chapter. Our current work

collects real world data using multiple camera. All the aforementioned techniques

will be incorporated in the future work.

122

7.5 Conclusions

This chapter presents the end-to-end learning approach to lane keeping for self-driving

cars that can automatically produce proper steering angles from image frames cap-

tured by the front-view camera. The CNN model is trained and evaluated using

comma.ai dataset, which contains image frames and the steering angle data captured

from road driving. The test results show that the model can produce relatively ac-

curate steering of vehicle. Further discussions on evaluation and data augmentations

are also presented for future improvement.

123

Chapter 8

Building an Autonomous Lane

Keeping Simulator Using

Real-World Data and End-to-End

Learning

Autonomous lane keeping is an important safety feature for intelligent vehicles. This

chapter presents a state-of-the-art end-to-end learning method using convolutional

neural network (CNN) that takes front view camera data as input and produces the

proper steering wheel angle to keep the vehicle in lane. A novel method of data

augmentation is proposed using vehicle dynamic model and vehicle trajectory track-

ing, which can create addition training data as if a vehicle drives off-lane at random

displacement and orientation. Real-world driving data is recorded from three front-

view cameras on left, center, and right. A lane keeping simulator is built using the

recorded data in conjunction with image projection and vehicle dynamics estima-

124

tion. Experimental results demonstrate that the end-to-end learning method with

augmented data can achieve high accuracy for autonomous lane keeping and very low

failure rate. The simulator can serve as a platform for both training and evaluation

of vision-based autonomous driving algorithms. The experimental dataset is made

available at http://computing.wpi.edu/dataset.html.

8.1 Introduction

Lane keeping is a fundamental feature for intelligent and autonomous vehicles. De-

spite many sensors installed on autonomous cars such as radar, LiDAR, ultrasonic

sensor and infrared cameras, the ordinary color cameras are still very popular owing

to their low cost and ability to obtain rich information. Given the video images from

the front-view camera, an vision-based lane keeping system can automatically output

the proper steering angles to maintain the vehicle in lane. A traditional framework

divides the task into several stages including lane detection [104, 105], path plan-

ning [106, 107] and control logic [108, 109]. Applying image processing techniques

such as color enhancement, Hough transform and edge detection, the lane detection

system is to identify the lane markings on the road. Path planning and control logic

are then employed to provide the proper steering angle adjustment for the vehicle.

In this approach, performance of lane detection heavily relies on the feature extrac-

tion and interpretation of image data. Errors can also accumulate from previous

processing stage to the next, leaving the final control output less accurate.

In contrast, an end-to-end learning method has the advantages of better perfor-

mance and less manual effort. End-to-end learning for self-driving cars has been

successfully demonstrated in [70] using convolutional neural networks (CNNs), which

125


Figure 8.1: Comparison between the traditional framework and end-to-end learning.

takes the images from cameras as input and produce the vehicle control output auto-

matically. The model is self-optimized based on the training data and does not need

manually defined features. User does not need to label the detected objects and their

categories during the training process. Figure 8.1 is a comparison between the tra-

ditional framework and the end-to-end learning approach for vision-based automatic

lane keeping.

Although the approach of end-to-end learning for lane keeping is not new, the

existing work has several deficiencies. For instance, the error difference between

the recorded “ground truth” and predicted steering angle is not the best evaluation

metric. Since it is hardly possible for a human driver to maintain the vehicle perfectly

in the center of the lane at all time, the recorded angles are not optimal. Thus, the

predicted angles do not have to be exactly the same as the ground truth angles

recorded from the human driving experience. It is more important to predict the

position and orientation of the vehicle in the very next time step given current vehicle

speed and steering angle control. As long as the vehicle stays in lane, the steering

angle is acceptable. By using a simulator, the effects of the control input can be

simulated and monitored, and therefore providing a more reliable evaluation metric.

126

Furthermore, we need to provide data to train the deep neural network to take

appropriate steering angle actions when the vehicle drifts away from the center of

the lane. However, the recorded driving data are lack of this type of actions since

it is unsafe to drive off the lanes during data collection. To solve this dilemma, we

propose a data augmentation method based on vehicle dynamic model and vehicle

trajectory tracking. Given any displacement and orientation, the model can generate

a projected trajectory and a sequence of steering angle controls. Correspondingly, we

can also create the augmented front views using image projection based on the shift

location and orientation. Therefore, the system becomes a simulator that can not

only generate augmented data for training the convolutional neural network but also

be used as a platform to evaluate the performance of other vision-based lane keep

algorithms.

The main contributions of this chapter are listed as follows:

1. This chapter presents a simulator for vision-based autonomous lane keeping.

Although there are many recent works on lane keeping algorithms, it is hard to

compare and evaluate them. Built on the recorded driving data, this simulator

employs image projection, vehicle dynamics modeling, and vehicle trajectory

tracking to predict vehicle movement and its corresponding camera views. The

simulator can be used for both training and evaluation of lane keeping algo-

rithms.

2. An end-to-end learning method is to proposed that can generate proper steering

angles from front-view camera data, which can maintain the vehicle in lane.

A highly effective end-to-end learning system is demonstrated using the the

aforementioned simulator. The CNN model trained with augmented data from

127

the simulator performs significantly better than the model trained with recorded

data only.

3. A completely new dataset for autonomous lane keeping is developed and was

made available at http://computing.wpi.edu/dataset.html. The dataset

contains recorded video frames from three forward facing cameras (left, center,

and right) as well as steering wheel angles and vehicle speed information.

The rest of the chapter is organized as follows. Section 8.2 provides the implemen-

tation details of our simulator, including image projection, vehicle dynamics, vehicle

trajectory tracking as well as the CNN architecture. The experiment and evalua-

tion results are presented in Section 8.3, followed by discussions in Section 8.4 and

conclusions in Section 8.5.

8.2 Building a Simulator

8.2.1 Overview

For evaluation of vision-based lane keeping algorithms, a simulator is needed to pro-

vide feedback based on the predicted angle. The simulator can generate image frames

to the vehicle position and orientation, and it can also simulate the vehicle movement

giving a steer angle input. Therefore, a simulator for self-driving cars has two im-

portant components: graphic engine and physics engine. The graphic engine utilizes

the information of the surrounding environment, as well as the pose of the camera to

generate images. The physical engine simulates vehicle movement based on the input

control actions. A virtual game engine usually contains both graphic and physics

engine, and some autonomous driving simulators were built upon it [11,111]. Vehicle

128


movement simulation and frame generation can be integrated into the game engine.

Besides, the ground truth information is very rich in the virtual world. Information

such as vehicle position, orientation and velocity can be easily obtained, so do other

objects. Despite these advantages, a significant drawback of these virtual simula-

tors is that the generated images are still quite different from the real world data.

Although they look very realistic with advanced graphic techniques, the details and

variations of virtual images still cannot match the data from real world. It is risky to

train a model using virtual game engines and then deploy the model for real-world

driving. It would be better to build a simulator from the real world data.

Different camera views can be generated from recorded video frames by learning

approach [112] or 3D image projection approach [70]. The learning approach learns

auto-encoders to embedding road frames, and learns a transition model in the em-

bedded space. The next few frames can be generated based on the current frame

image and the current control inputs. On the other hand, the 3D image projection

approach assumes the ground is a flat surface, and solves the 3D geometry [113] to

generated the next frame based on the actual recorded frame. The camera shift and

rotation can be obtained from vehicle movement simulation, which can be estimated

using vehicle kinematic or dynamic models [108,109].

In our simulator, the image projection approach is employed for rendering the

images. The CNN takes the image as input and the vehicle dynamics is used to

simulate vehicle movement given the control action. Figure 8.2 shows the detail

operations of the simulator when testing the CNN-based lane keeping algorithm.

The predicted position is constantly validated against the ground truth position. A

failure is recorded if the error exceeds a threshold value. More importantly, the

simulator can be very useful when training the neural network by providing a large

129

amount of additional training through augmentation. When using the simulator for

training, the vehicle trajectory tracking replaces the CNN controller to provide the

control actions that can gradually correct initial position shift and/or orientation

rotation. Practically, assuming an arbitrary shift and rotation of the vehicle from

the ground truth, vehicle trajectory tracking block can produce the proper steering

angle control actions. Combined together with the generated camera view from the

image projection process, augmented data can be generated. Figure 8.3 shows the

operation flow of the simulator at training phase during that many augmented data

can be generated from each ground truth image by arbitrary shift and rotation of the

vehicle.

8.2.2 Image projection

Rendering the image according to the vehicle position and orientation is required by

the simulator, in order to provide more instances for machine learning and better

evaluation metric. However, without using a gaming engine, data collected in real

world are sparse, often along a single trajectory as the car goes. These data themselves

are far from enough to cover all possible positions and orientations. Therefore, these

data must be transformed for an arbitrary position and orientation, using computer

vision knowledge of image projection base on 3D geometry. Given a point in the

world coordinates which is Xw = (xw, yw, zw) and the corresponding point in image

coordinates which is p = (p1, p2), there are relations that

ph = XhwMexMin (8.1)

Xhw = c(xw, yw, zw, 1)

130

Figure 8.2: The flowchart of test phase.

131

Figure 8.3: The flowchart of training phase, using original data and augmented data.

132

ph = d(p1, p2, 1)

where ph and Xhw are 1× 3 and 1× 4 homogeneous coordinates, c and d are arbitrary

nonzero constants, Mex is the 4 × 3 extrinsic matrix and Min is the 3 × 3 intrinsic

matrix. The extrinsic matrix contains a rotate matrix and a translation vector, which

defines the camera’s position and orientation in the world coordinates. Therefore

the extrinsic matrix is changed if the camera is shifted or rotated. The intrinsic

matrix defines the transformation from camera coordinates to image coordinates,

including parameters such as focal length, aspect ratio, location of principle point,

etc. The intrinsic matrix stays the same even if the camera is shifted or rotated. The

extrinsic matrix and intrinsic matrix can be obtained through calibration procedure.

Given the image taken in the real world with known calibration parameters Mex,

Min and its pixels coordinates p, the new pixels coordinates p need to be found with

a new extrinsic matrix Mex, when the camera is shifted and rotated. The physical

dimensions of the 3D scene are required in order to find the projection parameters. In

the case of highway lane keeping simulation, we made an assumption that the ground

surface is flat, e.g., zw = 0. According to formula 8.1, the mapping of p to p then can

be obtained as follows:

Xhw = phM−1

in M−1ex (8.2)

ph = XhwMexMin (8.3)

Note that the lens distortion, if any, needs to be corrected before performing such

image projection. Figure 8.4 shows some examples of transforming an original image

according to camera’s virtual position and orientation. The additive black area on

133

the generated image is usually not an issue for vehicle simulation, since the captured

images from front-view cameras are often cropped to retain only the middle section

as the region of interest.

Another challenging task is ground surface estimation during calibration. To es-

timate the calibration parameters, especially Mex in formula 8.1 with the assumption

zw = 0 for the ground surface, these three cameras used in our system need to be

deployed on vehicle and world coordinates need to be established properly. When

calibrating the cameras in the lab, a checkerboard pattern is usually used, as shown

in Figure 8.4. However, estimating the ground surface needs a very large pattern,

which is hard to craft and deploy. In our experiment, a flat parking lot with existing

markings is used for ground surface estimation. Physical dimensions of the markings

are measured manually while the corresponding images are captured by the cameras

installed on the vehicle. Figure 8.5 shows the selected points in the image taken

by the center camera during the calibration. The physical locations of the cameras

and the selected points in the world coordinates are also shown in Figure 8.5. Three

cameras are installed on the left, center and right of the vehicle, all facing forward,

because they can provide better field of view than a single camera. In fact, the nearest

camera to the vehicle’s virtual position is selected as the source in equations 8.2 and

8.3. Therefore, the generated images have better quality and less additive black areas

after projection.

8.2.3 Vehicle dynamics and vehicle trajectory tracking

According to [108], the bicycle vehicle dynamics shown in Figure 8.6 is captured by

the following equations:

134

(a) (b)

(c) (d)

Figure 8.4: Example of original image and generated images given arbitrary cameraposes. (a) Original image. A checkerboard pattern on a flat surface. (b) Generatedimage as if the camera is shifted left by 50 mm. (c) Generated image as if the camerais rotated right by 15.25 degrees. (d) Generated image as if the camera is shifted leftby 50 mm and rotated right by 15.25 degrees.

135

(a)

X coordinates (Meters)

-15 -10 -5 0 5 10 15 20

Y c

oo

rdin

ate

s (

Me

ters

)

-20

-15

-10

-5

0

5

10

15Multi-camera calibration and ground surface estimation.

Selected Points

All Points

Right Cam

Middle Cam

Left Cam

(b)

Figure 8.5: Camera calibration and ground surface estimation. (a) Selected points inthe image taken by the center camera. (b) Cameras and selected points in the worldcoordinates.

136

x = v cos θ

y = v sin θ

θ = ω

θ = ψ + β

ψ =v

lrsin β

v = a

β = arctan

(lr

lf + lrtan (σf )

)

where P = (x, y, θ) ∈ R2×S1 is the state of the position and orientation, v and ω are

the linear velocity and angular velocity respectively that are also the control input.

a is the acceleration and σf is the turning angle. lf and lr are the distance from the

vehicle’s mass center to the front and rear axles. In our test vehicle, we use estimated

values lf=1 m and lr=1.7 m.

The dynamics in Figure 8.6 is feedback linearized by introducing a nonlinear

mapping from the current nonlinear system to a new linear system and a new state

variable z = [x, y, x, y]:

z = Az +Bu

x

y

x

y

= A

x

y

x

y

+Bu

137

Figure 8.6: A virtual bicycle vehicle dynamics.

138

where the state matrix A =

0 0 1 0

0 0 0 1

0 0 0 0

0 0 0 0

, the input matrix B =

0 0

0 0

1 0

0 1

and the

input vector u =

xy

in the new linear system. After the feedback linearization, the

whole problem is transformed into searching the proper gain K for the linear system.

To solve this optimal control problem, Linear Quadratic Regulator (LQR) is used to

acquire the optimal gain K. The quadratic cost is defined as the following:

J =

ˆ ∞0

(xᵀQx+ uᵀRu)dt (8.4)

where Q =

1 0 0 0

0 1 0 0

0 0 1 0

0 0 0 1

and R =

1 0

0 1

, x and u are the state and control effort

respectively. Practically, Q and R matrices do not have to be identity matrices

but positive definite, and the entries can be tuned to achieve required performance

accordingly. Once the gain K is computed, the feedback control law and the ordinary

differential equation (ODE) of the new linear system are described as follows:

e = z − zd

u = −Ke+ ud

e = (A−BK)e

z = zd + e

139

where e is the error between the true state and desired state, K is the gain computed

based on the defined cost equation 8.4 with A and B matrices, u ∈ R2 is the input

vector, ud = (xd, yd) is the referenced input given by the ground truth, e, z, zd is

changing of the error, state, desired state, respectively. A is the 4 × 4 state matrix,

and B is the 4× 2 input matrix.

v = x cos θ + y sin θ (8.5)

ω =1

v(y cos θ − x sin θ) (8.6)

The control input for the nonlinear system can then be calculated by remapping the

new input variables of linear system back to the original input of the nonlinear system

shown in equations 8.5 and 8.6, which are linear velocity v and angular velocity ω.

The results in Figure 8.7 demonstrate the effectiveness and correctness of the

vehicle trajectory tracking controller design. A vehicle with feedback control law

has the the capability of converging to and following the desired trajectory, even

though there exists initial error. At the beginning, owing to some errors between the

predicted and actual orientations, the steering angle is positive and large, which helps

the vehicle to correct its orientation in a short time. After 2 seconds, the predicted

orientation and the ground truth converges. The vehicle orientation does not change

rapidly for the next few seconds, which matches the fact that the steering angle of

the vehicle remains in a small range near zero.

140


1400 1405 1410 1415 1420 1425 1430

Y c

oo

rdin

ate

s (

Me

ters

)

-4620

-4600

-4580

-4560

-4540

-4520

-4500

-4480

-4460

-4440

-4420Trajectory

Ground truth

Predicted

(a)

Time (Seconds)

0 1 2 3 4 5 6

An

gle

(D

eg

ree

s)

88

90

92

94

96

98

100

102

104Orientation

Ground truth

Predicted

(b)

Time (Seconds)

0 1 2 3 4 5 6

An

gle

(D

eg

ree

s)

-5

0

5

10

15

20

25

30Steering Wheel Angle

Ground truth

Predicted

(c)

Figure 8.7: Correction of vehicle’s position and orientation using vehicle trajectorytracking. (a) Ground truth and predicted trajectory. (b) Ground truth and predictedorientation. (c) Ground truth and predicted steering wheel angle.

141

8.2.4 CNN implementation

Convolutional neural networks (CNNs) [40–42] has achieved impressive performance

in image classification. In this chapter, learning the human driver’s control is not a

classification problem but a regression problem, therefore the loss layer during training

is Euclidean loss, which computes the sum of squares of differences between predicted

steering angle and ground truth steering angle: 12N

∑Ni=1 ||x1i − x2i ||22, where N is the

number of instances, x1i is the ith predicted value and x2i is the ith ground truth value.

The CNN is used as a steering angle predictor given the input image. It does not take

the entire image frame as input since only the center section is the region of interest

for lane keeping. The images are cropped before fed to CNN, as shown in Figure

8.8. The proposed CNN architecture is shown in Figure 8.9, and it is based on the

PilotNet [70,72]. It has 5 convolutional layers and 3 fully-connected layers. There are

no pooling layers because the feature maps are small. The convolutional layers are

mainly for feature extraction and the fully connected layers are mainly for steering

angle prediction, but there is no clear boundary between them since the model is

trained end-to-end. Unlike the PilotNet, our input image size is 400× 150 instead of

200× 66. The first convolutional layer is 4× 4 stride and 9× 9 kernel instead of 2× 2

stride and 5 × 5 kernel. The system of PilotNet uses the vehicle’s turning radius r

as steering command, and makes the inverse-turning-radius 1r

as the output to avoid

infinite numbers when driving straight. Our CNN uses the steering wheel angle as

the output, which is more intuitive. The proposed CNN model is trained using our

own dataset on Caffe [99] and Matlab software platform.

142

Figure 8.8: An example of cropped image frame from the dataset.

8.3 Experiment

8.3.1 Data collection

To capture images, three forward facing cameras are mounted on the dashboard of

the car, from left to right. Because the cameras are not water-proof, installing them

on top of the vehicle can be inappropriate. To avoid re-calibration each time, the

cameras remain stationary once installed. Multi-thread programming and software

triggers are used to synchronize the three cameras to capture images at 10 Hz. The

shutter time is set to auto with an upper-bound value to avoid extremely low frame

rate when the light condition is too dark. The image resolution is set to1288 × 968,

and captured images are stored as color image sequences. Meanwhile, the steering

angle and speed information are recorded by accessing the CAN BUS via OBD-II

port. The data from OBD-II port are decoded by our customized program, and then

saved with time stamps, in order to synchronize with the image data. The steering

wheel angle decoded from the OBD-II port has a precision of 0.07 degree and the

speed data has precision of 1 km/h or approximately 0.28 m/s. The steering wheel

143

Figure 8.9: The CNN structure used, slightly modified from NVIDIA’s PilotNet.

144

Figure 8.10: Our data collection system, including three forward facing cameras, aUSB hub, a laptop and access to OBD-II port.

angle s need to be converted to vehicle’s turning angle σf in Figure 8.6 by dividing the

steering ratio k as σf = sk, where k has an estimated value of 17.8 in our experiment.

Figure 8.10 shows our data collection system on a vehicle, including three forward

facing cameras, a USB hub, a laptop computer and an interface to OBD-II port.

The experimental data were collected on 7 occasions at 6 different days, approx-

imately 1 hour each. Different lighting and weather conditions are included, such as

sunny, cloudy and foggy, as shown in Figure 8.11. Night time driving is not included

in our data. The collected data are then refined to be used for the task of lane keep-

ing. Some recorded data that meet the following criteria are discarded: non-highway

145

(a) (b)

(c) (d)

Figure 8.11: Example frames under different weather or lighting condition. (a)Cloudy. (b) Shadowed. (c) Foggy. (d) Sunny.

driving, speed lower then 40 mph, change of lane, extreme lighting condition, equip-

ment failure, and sequences that is shorter than 1 minute. After refinement, about

3 hours driving data are valid. Among the 7 groups of collected data, 4 groups were

used for training and the other 3 groups are for testing. This is to prevent overlaps

between training and test data. Overall, the train data contain 68082 frames, nearly

2 hours at 10 Hz. The test data contain 32053 frames, nearly 1 hour at 10 Hz. The

training data sequences are randomly shuffled before applied to the CNN model.

146

8.3.2 Data augmentation

Ideally, the training dataset should contain some error correction scenario such that

the trained CNN model is capable of handling errors. So the vehicle stays in the

lane instead of drifting away. Such error correction data introduce initial errors for

the vehicle’s position and/or orientation, and then provide the proper control action

to correct such errors and guide the vehicle back to the lanes. The original data

collected from highway driving are lack of such error correction data, because of

the safety concern to perform such dangerous maneuvers on highway. Therefore, we

propose to apply data augmentation technique that can generate this type of error

correction data virtually. This is one of the important benefits of building a simulator.

Once the data are collected and the world coordinates established, it is possible to

obtain the ground truth of vehicle’s position and orientation at any given time. For

each frame, errors can be added manually into the vehicle’s position and orientation.

By using the knowledge of image projection of 3D geometry, the augmented images

can be generated accordingly. At the same time, correct control action is provided by

the vehicle trajectory tracking algorithm. Therefore, the augmented data can be used

as part of the training data to improve the model’s robustness. In our experiment,

each frame is randomly augmented 10 times by shifting the vehicle positions and

change orientations. Figure 8.3 shows the entire process of data augmentation. Figure

8.12 shows examples of augmented images.

8.3.3 Evaluation using simulator

In our previous work [73], it shows that the differences of ground truth angle and pre-

dicted angle is not an effective metric for evaluating the performance of lane keeping

147

(a) (b)

(c) (d)

Figure 8.12: Example of original image and augmented images given arbitrary vehicleposes. (a) Original image. (b) Augmented image as if the vehicle is shifted right by 0.5m. (c) Augmented image as if the vehicle is rotated left by 7 degrees. (d) Augmentedimage as if the vehicle is shifted right by 0.5 m and rotated left by 7 degrees.

systems. Hereby we propose a new metric by measuring the percentage of driving

time when the vehicle is in lane. Our simulator cane be employed as an evaluation

platform for autonomous lane keeping. The process flow of using the simulator for

evaluation is illustrated in Figure 8.2.

Giving the initial steering angle provided by the CNN model, vehicle positions

and orientations are updated by vehicle dynamics. Subsequently a front-view camera

image is generated through image projection according to the current vehicle position

and orientation. The new image is then fed to the CNN model and it produces the

steering angle for the next time-step. The same process repeats for all frames in a

test sequence. At each time step, the amount of position difference to the ground

truth is calculated. For the purpose of simplicity, the longitude difference is fixed to

zero, and the horizontal is compared with a threshold value. If the horizontal shift

is larger than the threshold, it is considered a lane keeping failure. The threshold

148

is set to 1 meter in our experiment. For each failure occurrence, the next 60 frames

are automatically marked as manual driving period. All other frames without failure

are considered autonomous driving period. The final criteria is the percentage of

autonomous driving time (autonomy):

A =ta

ta + tm(8.7)

where ta and tm represent the autonomous time and manual controlled time, respec-

tively. Figure 8.13 shows an example of the simulation results when comparing the

vehicle positions with the ground truth. The steering angles are produced by the

CNN model trained with data augmentation.

In our experiment, the CNNs trained with and without augmented data are both

evaluated using the simulator, and the results are shown in Table 8.1. The error of

position is only evaluated when the vehicle is in autonomous driving mode. The data

during manual controlled time in simulation are not evaluated. The percentage of

autonomous driving time using the model trained with augmented data is 98.32%

and number of failures is 9, which are significantly better than the result of 82.09%

and 98 without augmented data.

Table 8.1: Evaluation result using the simulator, with and without augmented data.

AugmentedAutonomy

No. of Error of Position (Meters)

Data Failures Mean Standard Deviation

Yes 98.32% 9 0.2179 0.1813

No 82.09% 98 0.2670 0.2071

In addition, the simulation results also show that the error of steering wheel

angle is not an effective metric for performance evaluation. The model trained with

149


1400 1600 1800 2000 2200 2400 2600 2800Y

co

ord

ina

tes (

Me

ters

)

2600

2800

3000

3200

3400

3600

3800

4000

Trajectory

CNN

Ground Truth

Lane Limits

(a)


2300 2350 2400 2450 2500

Y c

oo

rdin

ate

s (

Me

ters

)

3050

3100

3150

3200

3250Trajectory

CNN

Ground Truth

Lane Limits

(b)


2392 2394 2396 2398 2400 2402

Y c

oo

rdin

ate

s (

Me

ters

)

3135

3136

3137

3138

3139

3140

3141

3142

3143

3144

3145Trajectory

CNN

Ground Truth

Lane Limits

(c)

Figure 8.13: An example of the simulation result, produced by the CNN trainedwith data augmentation. (a) Overview of the trajectory in a test sequence. (b)Trajectory zoomed-in in the black rectangle in (a). (c) Trajectory zoomed-in in theblack rectangle in (b).

150

augmented data has mean error of 0.3042 degrees and standard deviation of 1.6029

degrees. The model trained without augmented data has mean error of 0.3118 degrees

and standard deviation of 1.2043 degrees. It can hardly tell which model is better

from the mean error and standard derivation of steering angles.

The deployed simulator with CNN predictor runs at approximately 13 frames

per second (FPS). Considering the input data at 10 Hz, the end-to-end lane keeping

system is able to run at real-time. The hardware platform is a desktop computer with

Intel i5 3570K processor running at 3.4 GHz, 32 GB DDR3 RAM and one NVIDIA

GTX 1080 GPU.

8.4 Discussion

It is worth investigating the causes of some failures during evaluation. For example,

a failure case is shown in Figure 8.14. The vehicle is moving out of lane to the right

because the front vehicle is changing lane and lane markings are partially blocked.

Another case is shown in Figure 8.15 with casting shadow on the road. In most cases,

we believe the quality of the input data play a role in those failures, which can be

attributed to factors such as shadows on the road, extreme lighting conditions, camera

exposure settings and etc. Because of the complicated scenarios in the real world,

the robustness of a model needs be fully examined prior to deployment. Therefore, a

simulator built on the real world data becomes very useful.

151

Figure 8.14: An example of failure. The vehicle is going out of lane to the rightbecause another vehicle is changing lane, and lane markings are partially blocked.

Figure 8.15: An example of failure. The vehicle is going out of lane to the rightbecause of unclear lane markings.

152

8.5 Conclusions

This chapter presents an autonomous driving simulator that is built on the real-world

data with recording from three front-view cameras, steering wheel angles and vehicle

speed information. Vehicle dynamic model and trajectory tracking are incorporated

in the simulator to predict the vehicle movement. With proper calibrations, 3D

image projection technique can be applied to generate updated front-view images

at the current vehicle position and orientation. The simulator can be used both

training and evaluation of vision-based lane keeping algorithms. Moreover, an end-

to-end learning lane keep system is proposed using a CNN model to predict the

steering angle from front-view camera input. The CNN model trained with augmented

data results significantly better performance than using only the original recorded

data, when measured by percentage of automated diving time. This new real-world

driving dataset is shared online and can bring benefits to research and education of

autonomous vehicle technology.

153

Chapter 9

Conclusions

This dissertation presents the design and implementation of a group of systems for

autonomous vehicles.

The real-time GPU-based traffic sign detection and recognition system is capable

of detecting and recognizing 48 classes of traffic signs in any size on each image frame.

The detection rate is about 91.69% and the recognition rate is about 93.77%. The

system can process 27.9 fps video with the active pixels of a 1,628 Ö 1,236 resolution.

Because each frame is processed individually, no information from previous frames

is required. As part of our future work, information from previous frames will be

considered for tracking traffic signs which is expected to further improve the detection

accuracy.

Two traffic light detection and recognition systems are presented. The first system

detects and recognizes red circular lights only, using image processing and SVM. The

performance is better than that of traditional detectors. The second system is more

complicated. It detects and classifies different types of traffic lights, including green

and red lights in both circular and arrow forms. Color extraction and blob detection

154

are applied to locate the candidates with proper optimization. A classification and

validation method using PCANet is then used for frame-by-frame detection. The

multi-object tracking method and forecasting technique are employed to improve the

accuracy and produce stable results. As an additional contribution, we build a traffic

light dataset from the videos captured via a camera mounted behind the windshield.

A novel pedestrian detection instrumentation is designed using both thermal and

RGB-D stereo cameras. Data are collected from on-road driving and an experimental

dataset is built with the bounding box labeling of pedestrians as the ground truth.

A reconfigurable multi-stage detector frame is proposed. Both HOG and CCF based

detection methods are evaluated using data from multi-spectral cameras and their

various combinations. The experimental result indicates that the approach using

CCF outperforms that involving HOG features. The combination of color and ther-

mal images using the CCF method can achieve the best performance of about 9%

log-average miss rate. For future work, other advanced feature extraction and classi-

fication methods will be considered to further improve the detector performance.

The lane keeping system employs an end-to-end learning approach to obtain the

proper steering angle for maintaining the car in the lane. The CNN model is trained

and evaluated using comma.ai dataset, which contains image frames and the steering

angle data captured from road driving. The test results show that the model can

produce the relatively accurate steering of vehicle. Further discussions on evaluation

and data augmentations are also presented for future improvement.

A simulator for the lane keeping system is built using image projection, vehicle

dynamics and vehicle trajectory tracking. This is important for data augmentation

and evaluation. The test results show that the model trained with augmented data

using the simulator has better performance.

155

Our on-vehicle data collection systems are also implemented and deployed, and

our own datasets are built from recorded driving videos. These datasets are used in

most of our projects and can benefit other researchers in the future. Our experimental

datasets are available at http://computing.wpi.edu/Dataset.html.

156


Bibliography

[1] “Red light running,” Insurance Institute of Highway Safety. [Online]. Available:

http://www.iihs.org/iihs/topics/t/red-light-running/topicoverview

[2] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the

kitti vision benchmark suite,” in Conference on Computer Vision and Pattern

Recognition (CVPR), 2012.

[3] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The

kitti dataset,” International Journal of Robotics Research (IJRR), 2013.

[4] J. Fritsch, T. Kuehnl, and A. Geiger, “A new performance measure and eval-

uation benchmark for road detection algorithms,” in International Conference

on Intelligent Transportation Systems (ITSC), 2013.

[5] M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” in Con-

ference on Computer Vision and Pattern Recognition (CVPR), 2015.

[6] M. Mathias, R. Timofte, R. Benenson, and L. V. Gool, “Traffic sign recognition

- how far are we from the solution?” in Proceedings of IEEE International Joint

Conference on Neural Networks (IJCNN 2013), August 2013.

157

http://www.iihs.org/iihs/topics/t/red-light-running/topicoverview

[7] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and C. Igel, “Detection of

traffic signs in real-world images: The German Traffic Sign Detection Bench-

mark,” in International Joint Conference on Neural Networks, no. 1288, 2013.

[8] “Traffic Lights Recognition public benchmarks.” [Online]. Available: http:

//www.lara.prd.fr/benchmarks/trafficlightsrecognition

[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec-

tion,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE

Computer Society Conference on, vol. 1, June 2005, pp. 886–893 vol. 1.

[10] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: A bench-

mark,” in CVPR, June 2009.

[11] S. R. Richter, V. Vineet, S. Roth, and V. Koltun, “Playing for data: Ground

truth from computer games,” in European Conference on Computer Vision

(ECCV), ser. LNCS, B. Leibe, J. Matas, N. Sebe, and M. Welling, Eds., vol.

9906. Springer International Publishing, 2016, pp. 102–118.

[12] J. Greenhalgh and M. Mirmehdi, “Real-time detection and recognition of

road traffic signs,” Intelligent Transportation Systems, IEEE Transactions on,

vol. 13, no. 4, pp. 1498–1506, 2012.

[13] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel, “Man vs. computer:

Benchmarking machine learning algorithms for traffic sign recognition,” Neural

Networks, no. 0, pp. –, 2012. [Online]. Available: http://www.sciencedirect.

com/science/article/pii/S0893608012000457

158

http://www.lara.prd.fr/benchmarks/trafficlightsrecognition

http://www.lara.prd.fr/benchmarks/trafficlightsrecognition

http://www.sciencedirect.com/science/article/pii/S0893608012000457


[14] C. Keller, C. Sprunk, C. Bahlmann, J. Giebel, and G. Baratoff, “Real-time

recognition of u.s. speed signs,” in Intelligent Vehicles Symposium, 2008 IEEE,

June 2008, pp. 518–523.

[15] W. Liu, Y. Wu, J. Lv, H. Yuan, and H. Zhao, “U.s. speed limit sign detection

and recognition from image sequences,” in Control Automation Robotics Vision

(ICARCV), 2012 12th International Conference on, Dec 2012, pp. 1437–1442.

[16] F. Zaklouta, B. Stanciulescu, and O. Hamdoun, “Traffic sign classification us-

ing k-d trees and random forests,” in Neural Networks (IJCNN), The 2011

International Joint Conference on, July 2011, pp. 2151–2155.

[17] P. Sermanet and Y. LeCun, “Traffic sign recognition with multi-scale convolu-

tional networks,” in Neural Networks (IJCNN), The 2011 International Joint

Conference on, July 2011, pp. 2809–2813.

[18] E. Herbschleb and P. H. N. de With, “Real-time traffic sign detection

and recognition,” pp. 72 570A–72 570A–12, 2009. [Online]. Available:

http://dx.doi.org/10.1117/12.806171

[19] A. D. L. Escalera, L. E. Moreno, M. A. Salichs, and J. M. Armingol, “Road

traffic sign detection and classification,” IEEE Transactions on Industrial Elec-

tronics, vol. 44, pp. 848–859, 1997.

[20] K. Par and O. Tosun, “Real-time traffic sign recognition with map fusion on

multicore/many-core architectures,” Acta Polytechnica Hungarica, vol. 9, no. 2,

2012.

159

http://dx.doi.org/10.1117/12.806171

[21] R. de Charette and F. Nashashibi, “Real time visual traffic lights recognition

based on spot light detection and adaptive traffic lights templates,” in Intelligent

Vehicles Symposium, 2009 IEEE, June 2009, pp. 358–363.

[22] ——, “Traffic light recognition using image processing compared to learning

processes,” in Intelligent Robots and Systems, 2009. IROS 2009. IEEE/RSJ

International Conference on, Oct 2009, pp. 333–338.

[23] G. Trehard, E. Pollard, B. Bradai, and F. Nashashibi, “Tracking both pose and

status of a traffic light via an interacting multiple model filter,” in Information

Fusion (FUSION), 2014 17th International Conference on, July 2014, pp. 1–7.

[24] S. Sooksatra and T. Kondo, “Red traffic light detection using fast radial symme-

try transform,” in Electrical Engineering/Electronics, Computer, Telecommu-

nications and Information Technology (ECTI-CON), 2014 11th International

Conference on, May 2014, pp. 1–6.

[25] T.-P. Sung and H.-M. Tsai, “Real-time traffic light recognition on mobile devices

with geometry-based filtering,” in Distributed Smart Cameras (ICDSC), 2013

Seventh International Conference on, Oct 2013, pp. 1–7.

[26] J. Levinson, J. Askeland, J. Dolson, and S. Thrun, “Traffic light mapping, local-

ization, and state detection for autonomous vehicles,” in Robotics and Automa-

tion (ICRA), 2011 IEEE International Conference on, May 2011, pp. 5784–

5791.

[27] N. Fairfield and C. Urmson, “Traffic light mapping and detection,” in Robotics

and Automation (ICRA), 2011 IEEE International Conference on, May 2011,

pp. 5421–5426.

160

[28] A. Gomez, F. Alencar, P. Prado, F. Osorio, and D. Wolf, “Traffic lights detec-

tion and state estimation using hidden markov models,” in Intelligent Vehicles

Symposium Proceedings, 2014 IEEE, June 2014, pp. 750–755.

[29] S. Salti, A. Petrelli, F. Tombari, N. Fioraio, and L. Di Stefano, “Traffic sign

detection via interest region extraction,” Pattern Recognition, vol. 48(4), pp.

1039–1049, 2015.

[30] G. Hinton, S. Osindero, and Y. Teh, “A fast learning algorithm for deep belief

nets,” Neural Computation, vol. 18, no. 7, pp. 1527–1554, July 2006.

[31] I. Arel, D. Rose, and T. Karnowski, “Deep machine learning - a new frontier

in artificial intelligence research [research frontier],” Computational Intelligence

Magazine, IEEE, vol. 5, no. 4, pp. 13–18, Nov 2010.

[32] T.-H. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, and Y. Ma, “Pcanet: A simple deep

learning baseline for image classification?” arXiv preprint arXiv:1404.3606,

2014.

[33] S. Lafuente-Arroyo, S. Maldonado-Bascon, P. Gil-Jimenez, H. Gomez-Moreno,

and F. Lopez-Ferreras, “Road sign tracking with a predictive filter solution,” in

IEEE Industrial Electronics, IECON 2006 - 32nd Annual Conference on, Nov

2006, pp. 3314–3319.

[34] S. Lafuente-Arroyo, S. Maldonado-Bascon, P. Gil-Jimenez, J. Acevedo-

Rodriguez, and R. Lopez-Sastre, “A tracking system for automated inventory

of road signs,” in Intelligent Vehicles Symposium, 2007 IEEE, June 2007, pp.

166–171.

161

[35] S. Zhang, R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “How far

are we from solving pedestrian detection?” CoRR, vol. abs/1602.01237, 2016.

[Online]. Available: http://arxiv.org/abs/1602.01237

[36] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian detection: An

evaluation of the state of the art,” PAMI, vol. 34, 2012.

[37] P. Viola and M. J. Jones, “Robust real-time face detection,” International

Journal of Computer Vision, vol. 57, no. 2, pp. 137–154, 2004. [Online].

Available: http://dx.doi.org/10.1023/B:VISI.0000013087.49260.fb

[38] R. Benenson, M. Omran, J. H. Hosang, and B. Schiele, “Ten years of

pedestrian detection, what have we learned?” CoRR, vol. abs/1411.4304, 2014.


[39] P. Dollar, Z. Tu, P. Perona, and S. Belongie, “Integral channel features,” pp.

91.1–91.11, 2009, doi:10.5244/C.23.91.

[40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifica-

tion with deep convolutional neural networks,” in Advances in Neural

Information Processing Systems 25, F. Pereira, C. J. C. Burges,

L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc.,

2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/

4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks for

large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014. [Online].

Available: http://arxiv.org/abs/1409.1556

162

http://arxiv.org/abs/1602.01237

http://dx.doi.org/10.1023/B:VISI.0000013087.49260.fb


http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf


[42] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,

V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in The

IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June

2015.

[43] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Convolutional channel features

for pedestrian, face and edge detection,” CoRR, vol. abs/1504.07339, 2015.


[44] R. Gade and T. B. Moeslund, “Thermal cameras and applications: a survey,”

Machine Vision and Applications, vol. 25, no. 1, pp. 245–262, 2014. [Online].

Available: http://dx.doi.org/10.1007/s00138-013-0570-5

[45] W. Li, D. Zheng, T. Zhao, and M. Yang, “An effective approach to pedestrian

detection in thermal imagery,” in Natural Computation (ICNC), 2012 Eighth

International Conference on, May 2012, pp. 325–329.

[46] F. Suard, A. Rakotomamonjy, A. Bensrhair, and A. Broggi, “Pedestrian detec-

tion using infrared images and histograms of oriented gradients,” in 2006 IEEE

Intelligent Vehicles Symposium, 2006, pp. 206–212.

[47] C. Dai, Y. Zheng, and X. Li, “Pedestrian detection and tracking in

infrared imagery using shape and appearance,” Computer Vision and

Image Understanding, vol. 106, no. 2-3, pp. 288 – 299, 2007, special

issue on Advances in Vision Algorithms and Systems beyond the Visible

Spectrum. [Online]. Available: http://www.sciencedirect.com/science/article/

pii/S1077314206001925

163


http://dx.doi.org/10.1007/s00138-013-0570-5



[48] J. W. Davis and M. A. Keck, “A two-stage template approach to person de-

tection in thermal imagery,” Applications of Computer Vision and the IEEE

Workshop on Motion and Video Computing, IEEE Workshop on, vol. 1, pp.

364–369, 2005.

[49] F. Xu, X. Liu, and K. Fujimura, “Pedestrian detection and tracking with night

vision,” IEEE Transactions on Intelligent Transportation Systems, vol. 6, no. 1,

pp. 63–71, March 2005.

[50] D. Olmeda, A. de la Escalera, and J. M. Armingol, “Contrast invariant features

for human detection in far infrared images,” in Intelligent Vehicles Symposium

(IV), 2012 IEEE, June 2012, pp. 117–122.

[51] W. Wang, J. Zhang, and C. Shen, “Improved human detection and classifi-

cation in thermal images,” in 2010 IEEE International Conference on Image

Processing, Sept 2010, pp. 2313–2316.

[52] M. Bertozzi, A. Broggi, C. H. Gomez, R. I. Fedriga, G. Vezzoni, and M. DelRose,

“Pedestrian detection in far infrared images based on the use of probabilistic

templates,” in 2007 IEEE Intelligent Vehicles Symposium, June 2007, pp. 327–

332.

[53] T. T. Zin, H. Takahashi, and H. Hama, “Robust person detection using far

infrared camera for image fusion,” in Innovative Computing, Information and

Control, 2007. ICICIC ’07. Second International Conference on, Sept 2007, pp.

310–310.

164

[54] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf, “Survey of pedestrian de-

tection for advanced driver assistance systems,” IEEE Transactions on Pattern

Analysis and Machine Intelligence, vol. 32, no. 7, pp. 1239–1258, July 2010.

[55] S. Hwang, J. Park, N. Kim, Y. Choi, and I. S. Kweon, “Multispectral pedes-

trian detection: Benchmark dataset and baseline,” in 2015 IEEE Conference on

Computer Vision and Pattern Recognition (CVPR), June 2015, pp. 1037–1045.

[56] S. J. Krotosky and M. M. Trivedi, “On color-, infrared-, and multimodal-stereo

approaches to pedestrian detection,” IEEE Transactions on Intelligent Trans-

portation Systems, vol. 8, no. 4, pp. 619–629, Dec 2007.

[57] K. H. Lee and J. N. Hwang, “On-road pedestrian tracking across multiple driv-

ing recorders,” IEEE Transactions on Multimedia, vol. 17, no. 9, pp. 1429–1438,

Sept 2015.

[58] W. Liu, R. W. H. Lau, X. Wang, and D. Manocha, “Exemplar-amms: Recog-

nizing crowd movements from pedestrian trajectories,” IEEE Transactions on

Multimedia, vol. 18, no. 12, pp. 2398–2406, Dec 2016.

[59] R. Risack, N. Mohler, and W. Enkelmann, “A video-based lane keeping assis-

tant,” in Proceedings of the IEEE Intelligent Vehicles Symposium 2000 (Cat.

No.00TH8511), 2000, pp. 356–361.

[60] S. Ishida and J. E. Gayko, “Development, evaluation and introduction of a lane

keeping assistance system,” in IEEE Intelligent Vehicles Symposium, 2004, June

2004, pp. 943–944.

165

[61] J. F. Liu, J. H. Wu, and Y. F. Su, “Development of an interactive lane keep-

ing control system for vehicle,” in 2007 IEEE Vehicle Power and Propulsion

Conference, Sept 2007, pp. 702–706.

[62] A. H. Eichelberger and A. T. McCartt, “Toyota drivers’ experiences with

dynamic radar cruise control, pre-collision system, and lane-keeping assist,”

Journal of Safety Research, vol. 56, pp. 67 – 73, 2016. [Online]. Available:


[63] Y. Li, “Deep reinforcement learning: An overview,” CoRR, vol. abs/1701.07274,

2017. [Online]. Available: http://arxiv.org/abs/1701.07274

[64] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “End-to-end deep

reinforcement learning for lane keeping assist,” CoRR, vol. abs/1612.04340,


[65] S. Shalev-Shwartz, S. Shammah, and A. Shashua, “Safe, multi-agent,

reinforcement learning for autonomous driving,” CoRR, vol. abs/1610.03295,


[66] A. E. Sallab, M. Abdou, E. Perot, and S. Yogamani, “Deep reinforcement

learning framework for autonomous driving,” CoRR, vol. abs/1704.02532,


[67] S. Sharifzadeh, I. Chiotellis, R. Triebel, and D. Cremers, “Learning to

drive using inverse reinforcement learning and deep q-networks,” CoRR, vol.

abs/1612.03653, 2016. [Online]. Available: http://arxiv.org/abs/1612.03653

166







[68] D. A. Pomerleau, “Advances in neural information processing systems 1,” D. S.

Touretzky, Ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.,

1989, ch. ALVINN: An Autonomous Land Vehicle in a Neural Network, pp.

305–313. [Online]. Available: http://dl.acm.org/citation.cfm?id=89851.89891

[69] Y. LeCun, U. Muller, J. Ben, E. Cosatto, and B. Flepp, “Off-road

obstacle avoidance through end-to-end learning,” in Proceedings of the 18th

International Conference on Neural Information Processing Systems, ser.

NIPS’05. Cambridge, MA, USA: MIT Press, 2005, pp. 739–746. [Online].

Available: http://dl.acm.org/citation.cfm?id=2976248.2976341

[70] M. Bojarski, D. D. Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D.

Jackel, M. Monfort, U. Muller, J. Zhang, X. Zhang, J. Zhao, and K. Zieba,

“End to end learning for self-driving cars,” CoRR, vol. abs/1604.07316, 2016.


[71] M. Bojarski, A. Choromanska, K. Choromanski, B. Firner, L. D.

Jackel, U. Muller, and K. Zieba, “Visualbackprop: visualizing cnns for

autonomous driving,” CoRR, vol. abs/1611.05418, 2016. [Online]. Available:


[72] M. Bojarski, P. Yeres, A. Choromanska, K. Choromanski, B. Firner, L. D.

Jackel, and U. Muller, “Explaining how a deep neural network trained with

end-to-end learning steers a car,” CoRR, vol. abs/1704.07911, 2017. [Online].

Available: http://arxiv.org/abs/1704.07911

[73] Z. Chen and X. Huang, “End-to-end learning for lane keeping of self-driving

cars,” in 2017 IEEE Intelligent Vehicles Symposium (IV), June 2017.

167

http://dl.acm.org/citation.cfm?id=89851.89891

http://dl.acm.org/citation.cfm?id=2976248.2976341




[74] J. Hardy and M. Campbell, “Contingency planning over probabilistic obstacle

predictions for autonomous road vehicles,” IEEE Transactions on Robotics,

vol. 29, no. 4, pp. 913–929, 2013.

[75] E. Frazzoli, M. A. Dahleh, and E. Feron, “Real-time motion planning for agile

autonomous vehicles,” in American Control Conference, 2001. Proceedings of

the 2001, vol. 1. IEEE, 2001, pp. 43–49.

[76] M. Likhachev and D. Ferguson, “Planning long dynamically feasible maneu-

vers for autonomous vehicles,” The International Journal of Robotics Research,

vol. 28, no. 8, pp. 933–945, 2009.

[77] R. Y. Hindiyeh, “Dynamics and control of drifting in automobiles,” , March,

2013.

[78] E. Galceran, R. M. Eustice, and E. Olson, “Toward integrated motion planning

and control using potential fields and torque-based steering actuation for au-

tonomous driving,” in Proceedings of the IEEE Intelligent Vehicle Symposium,

Seoul, Korea, June 2015, pp. 304–309.

[79] R. DeSantis, “Path-tracking for articulated vehicles via exact and jacobian lin-

earization,” IFAC Proceedings Volumes, vol. 31, no. 3, pp. 159–164, 1998.

[80] N. Dalal and B. Triggs, “Histograms of oriented gradients for human detec-

tion,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE

Computer Society Conference on, vol. 1, June 2005, pp. 886–893 vol. 1.

[81] “BelgiumTS Dataset,” 2010. [Online]. Available: http://btsd.ethz.ch/

shareddata/

168

http://btsd.ethz.ch/shareddata/

http://btsd.ethz.ch/shareddata/

[82] F. Zaklouta and B. Stanciulescu, “Real-time traffic sign recognition

in three stages,” Robotics and Autonomous Systems, vol. 62, no. 1,

pp. 16 – 24, 2014, new Boundaries of Robotics. [Online]. Available:


[83] S. Suzuki and K. Abe, “Topological structural analysis of digitized binary im-

ages by border following.” Computer Vision, Graphics, and Image Processing,

vol. 30, no. 1, pp. 32–46, 1985.

[84] G. Bradski, “The OpenCV Library,” Dr. Dobb’s Journal of Software Tools,

vol. 25, no. 11, pp. 120–126, 2000.

[85] H. Cheng, X. Jiang, Y. Sun, and J. Wang, “Color image segmentation: advances

and prospects,” Pattern Recognition, vol. 34, no. 12, pp. 2259 – 2281, 2001.

[86] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with

deep convolutional neural networks,” in Advances in Neural Information Pro-

cessing Systems 25, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, Eds.

Curran Associates, Inc., 2012, pp. 1097–1105.

[87] M. Breitenstein, F. Reichlin, B. Leibe, E. Koller-Meier, and L. Van Gool, “On-

line multiperson tracking-by-detection from a single, uncalibrated camera,” Pat-

tern Analysis and Machine Intelligence, IEEE Transactions on, vol. 33, no. 9,

pp. 1820–1833, Sept 2011.

[88] H. W. Kuhn, “The hungarian method for the assignment problem,” in 50 Years

of Integer Programming 1958-2008. Springer, 2010, pp. 29–47.

169


[89] S.-H. Bae and K.-J. Yoon, “Robust online multi-object tracking based on track-

let confidence and online discriminative appearance learning,” in Computer Vi-

sion and Pattern Recognition (CVPR), 2014 IEEE Conference on, June 2014,

pp. 1218–1225.

[90] K. Basak, S. N. Hetu, ZheminLi, C. L. Azevedo, H. Loganathan, T. Toledo,

RunminXu, YanXu, Li-ShiuanPeh, and M. Ben-Akiva, “Modeling reaction time

within a traffic simulation model,” in 16th International IEEE Conference on

Intelligent Transportation Systems (ITSC 2013), Oct 2013, pp. 302–309.

[91] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang,

A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei, “ImageNet

Large Scale Visual Recognition Challenge,” International Journal of Computer

Vision (IJCV), vol. 115, no. 3, pp. 211–252, 2015.

[92] P. Domingos, “A few useful things to know about machine learning,” Commun.

ACM, vol. 55, no. 10, pp. 78–87, Oct. 2012.

[93] Z. Chen, X. Huang, Z. Ni, and H. He, “A gpu-based real-time traffic sign

detection and recognition system,” in Computational Intelligence in Vehicles

and Transportation Systems (CIVTS), 2014 IEEE Symposium on, Dec 2014,

pp. 1–5.

[94] Z. Chen, J. Wang, H. He, and X. Huang, “A fast deep learning system using

gpu,” in 2014 IEEE International Symposium on Circuits and Systems (IS-

CAS), June 2014, pp. 1552–1555.

170

[95] Y. Zhou, W. Wang, and X. Huang, “FPGA design for pcanet deep learning

network,” in Field-Programmable Custom Computing Machines (FCCM), 2015

IEEE 23rd Annual International Symposium on, May 2015, pp. 232–232.

[96] R. Hartley and A. Zisserman, Multiple view geometry in computer vision. Cam-

bridge university press, 2003.

[97] R. E. Schapire and Y. Singer, “Improved boosting algorithms using confidence-

rated predictions,” Machine Learning, vol. 37, no. 3, pp. 297–336, 1999.

[Online]. Available: http://dx.doi.org/10.1023/A:1007614523901

[98] P. Dolle°©r, R. Appel, S. Belongie, and P. Perona, “Fast feature pyramids

for object detection,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 36, no. 8, pp. 1532–1545, Aug 2014.

[99] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadar-

rama, and T. Darrell, “Caffe: Convolutional architecture for fast feature em-

bedding,” arXiv preprint arXiv:1408.5093, 2014.

[100] M. Rohrbach, M. Enzweiler, and D. M. Gavrila, “High-level fusion of depth and

intensity for pedestrian classification,” in Joint Pattern Recognition Symposium.

Springer, 2009, pp. 101–110.

[101] J. Kittler, M. Hatef, R. P. W. Duin, and J. Matas, “On combining classifiers,”

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 20, no. 3,

pp. 226–239, Mar 1998.

171

http://dx.doi.org/10.1023/A:1007614523901

[102] B. Waske and J. A. Benediktsson, “Fusion of support vector machines for clas-

sification of multisensor data,” IEEE Transactions on Geoscience and Remote

Sensing, vol. 45, no. 12, pp. 3858–3866, Dec 2007.

[103] R. Pouteau, B. Stoll, and S. Chabrier, “Support vector machine fusion of mul-

tisensor imagery in tropical ecosystems,” in Image Processing Theory Tools and

Applications (IPTA), 2010 2nd International Conference on, July 2010, pp.

325–329.

[104] J. Zhao, B. Xie, and X. Huang, “Real-time lane departure and front collision

warning system on an fpga,” in 2014 IEEE High Performance Extreme Com-

puting Conference (HPEC), Sept 2014, pp. 1–5.

[105] A. J. Humaidi and M. A. Fadhel, “Performance comparison for lane detection

and tracking with two different techniques,” in 2016 Al-Sadeq International

Conference on Multidisciplinary in IT and Communication Science and Appli-

cations (AIC-MITCSA), May 2016, pp. 1–6.

[106] C. Li, J. Wang, X. Wang, and Y. Zhang, “A model based path planning algo-

rithm for self-driving cars in dynamic environment,” in 2015 Chinese Automa-

tion Congress (CAC), Nov 2015, pp. 1123–1128.

[107] S. Yoon, S. E. Yoon, U. Lee, and D. H. Shim, “Recursive path planning us-

ing reduced states for car-like vehicles on grid maps,” IEEE Transactions on

Intelligent Transportation Systems, vol. 16, no. 5, pp. 2797–2813, Oct 2015.

[108] J. Kong, M. Pfeiffer, G. Schildbach, and F. Borrelli, “Kinematic and dynamic

vehicle models for autonomous driving control design,” in 2015 IEEE Intelligent

Vehicles Symposium (IV), June 2015, pp. 1094–1099.

172

[109] D. Wang and F. Qi, “Trajectory planning for a four-wheel-steering vehicle,”

in Proceedings 2001 ICRA. IEEE International Conference on Robotics and

Automation, vol. 4, 2001, pp. 3320–3325 vol.4.

[110] The comma.ai driving dataset. [Online]. Available: https://github.com/

commaai/research

[111] S. Minhas, A. Hernandez-Sabate, S. Ehsan, K. Dıaz-Chito, A. Leonardis, A. M.

Lopez, and K. D. McDonald-Maier, LEE: A Photorealistic Virtual Environ-

ment for Assessing Driver-Vehicle Interactions in Self-driving Mode. Cham:

Springer International Publishing, 2016, pp. 894–900.

[112] E. Santana and G. Hotz, “Learning a driving simulator,” CoRR, vol.

abs/1608.01230, 2016. [Online]. Available: http://arxiv.org/abs/1608.01230

[113] R. Szeliski, Computer vision: algorithms and applications. Springer Science &

Business Media, 2010.

173

https://github.com/commaai/research

https://github.com/commaai/research


Documents

Computer Vision and Machine Learning for Autonomous …...computer vision and machine learning techniques. Convolutional channel features (CCF) and the traditional HOG+SVM approach