Real Time Vehicle Detection, Tracking, and Inter-vehicle

(IJACSA) International Journal of Advanced Computer Science and Applications,Vol. 12, No. 8, 2021

Real Time Vehicle Detection, Tracking, andInter-vehicle Distance Estimation based on

Stereovision and Deep Learning using YOLOv3

Omar BOURJA∗1, Hatim DERROUZ2, Hamd AIT ABDELALI3, Abdelilah MAACH4,Rachid OULAD HAJ THAMI5, Francois BOURZEIX6

RIME Departement, Mohammadia School of Engineers, Mohammed V University in Rabat, 10100, Morocco1,4

Embedded Systems and Artificial Intelligence Departement, MAScIR, 10100, Morocco1,2,3,6

IRDA team, ADMIR Laboratory, Rabat IT center, ENSIAS, Mohammed V University, Rabat 10100, Morocco2,5

These authors contributed equally to this work1,2,3

Abstract—In this paper, we propose a robust real-time vehicletracking and inter-vehicle distance estimation algorithm basedon stereovision. Traffic images are captured by a stereoscopicsystem installed on the road, and then we detect moving vehicleswith the YOLO V3 Deep Neural Network algorithm. Thus, thereal-time video goes through an algorithm for stereoscopy-basedmeasurement in order to estimate distance between detectedvehicles. However, detecting the real-time objects have alwaysbeen a challenging task because of occlusion, scale, illuminationetc. Thus, many convolutional neural network models based onobject detection were developed in recent years. But they cannotbe used for real-time object analysis because of slow speed ofrecognition. The model which is performing excellent currentlyis the unified object detection model which is You Only LookOnce (YOLO). But in our experiment, we have found that despiteof having a very good detection precision, YOLO still has somelimitations. YOLO processes every image separately even in acontinuous video or frames. Because of this much importantidentification can be lost. So, after the vehicle detection andtracking, inter-vehicle distance estimation is done.

Keywords—Stereovision; stereo image; YOLOv3 deep neuralnetwork; convolutional neural network; vehicle detection; tracking;bounding boxes; distance estimation

I. INTRODUCTION

Today, there are millions of vehicles authorized on theroads and their number is constantly increasing. Consequently,traffic efficiency, reducing congestion and the human and mate-rial damage related to accidents, has become a major challengein cities. However, this has been progressively improved in thelast decade using ITS Intelligent Transport Systems (ITS). Asa result, the incorporation of new information and commu-nication technologies into vehicle interiors and transportationinfrastructure has significantly revolutionized the way we traveltoday. These tools improve traffic flow by reducing travel timeand congestion, detect road violations, support drivers, andreduce the risk of road accidents, and minimize the damageresulting from unavoidable accidents. These applications alsoimpose demands, requiring credible dedicated hardware andreliable and timely communications. In addition, most traf-fic management systems are based on camera-based videosurveillance because of their low cost, ease of maintenance,and ability to capture high quality images of the traffic scene[1]–[4]. This allows the dissemination and collection of useful

information between vehicles, and transport infrastructure andvehicles to help drivers travel safely and comfortably.

However, these systems are reliable under normal con-ditions. In fact, they may not work very well in specialcircumstances, such as occlusion, bad weather, changes inlighting, and so on [1]–[3], [5]–[9].

In the present study, we are mainly interested in the cal-culation of inter-vehicle distance, which is an important trafficfactor to be studied in intelligent transportation systems [2],[5], [8][1,5,6]. The aim of this paper is to present the possibilityof using stereo cameras instead of LIDAR to estimate the inter-vehicle distance [1], [5]. Indeed, this method is advantageousbecause the recordings made by the cameras can be adapted formany algorithms. In addition, cameras are a much cheaper andtherefore more cost effective solution compared to LIDAR [3].Therefore, the images were captured by a stereoscopic systemusing two cameras placed above the traffic lanes. Indeed, oursystem consists of two slightly displaced cameras that obtaintwo images and go through a measurement algorithm based onthe principles of stereoscopy in order to estimate the distance tothe detected vehicles. The proposed solution is notably basedon YOLO Deep Learning which will allow us to detect anddelimit vehicles in a stereo image [7], [8].

The rest of the document is organized as follows. First,Section 2 gives a description of our stereo vision system.Then, we talk about calibration and stereo synchronization. InSection 3, we present our method of inter-vehicle distance esti-mation based mainly on the YOLO V3 Deep Neural Networkalgorithm and then we remove the general structure of ourcomputational algorithm. In Section 5, a literature review onCNN models and YOLO model is given. The brief explanationof background of fully connected neural network is alsoprovided. In Section 6 explanation of the experiment flow andmethodology is given and overall system design is explained.Then, the test validation results, and the experimental resultsare explained. Finally, a conclusion is drawn.

II. STEREOSCOPIC VISION SYSTEM

A stereovision system is characterized by the acquisitionof two images of the object to be observed from two differentangles. After acquiring two images of an object from two

www.ijacsa.thesai.org 915 | P a g e


different angles, the image coordinates of the points to bemeasured are determined on each of them. The matching ofsimilar points is usually done automatically. The result is alist of 3D coordinates In fact, each of the two acquired imagesis processed by classical image processing tools to producea list of 2D points characteristic of the objects. Each imagehaving produced a list of points, a matching is then necessaryto determine which points of the left and right lists correspondto each other. This matching can be based on a priori ofthe observed scene such as the preservation of the order ofthe elements from one image to the other. This constraintmost often imposes to have two similar images or, in otherwords, to position the cameras very close to each other. Oncethe (x, y)left and (x, y)right pairs have been created, thecalibration models can be used to calculate the corresponding(x,y,z) world points.

Therefore, the goal of stereovision is to calculate the spatialposition of points from the coordinates of their images intwo different views, in order to make measurements or toreconstruct the three-dimensional structure of the scene. So,the problematic of stereovision revolves around two essentialpoints: camera calibration and synchronization. Indeed, at theshooting level, it is obviously necessary to obtain images ofthe same scene [3]–[5].

To acquire stereo images, we have designed a stereoscopicsystem with two similar cameras fixed and aligned on a stereobar. The whole system was installed on a bridge over thehighway, as shown in the following Fig. 1 [4], [5].

Fig. 1. Stereo System Installed on a Bridge over the Highway [1,8]. StereoSystem Hardware Platform.

The calibration of a single camera (monocular applications)is equivalent to estimating its intrinsic parameters and itsposition in relation to the world reference frame. Configuringa stereoscopic sensor means calibrating both cameras (intrinsicparameters of each camera) and the relative position andorientation of the two cameras. Thus, calibrating a camerameans estimating the transfer function that transforms a 3Dpoint of the scene into a 2D point of the image.

Therefore, calibration is a very important step before theacquisition of the stereoscopic image [6], [10].

It allows to determine the intrinsic and extrinsic parametersof each camera. A bad camera calibration can influence the

quality of the distance estimation. Therefore, the cameras mustbe installed accurately, otherwise measurement errors mayoccur (Fig. 2).

Fig. 2. Well-Calibrated Cameras (a), (b, c): the Most Common CalibrationErrors.

The accuracy of the system depends on its correct calibra-tion. Generally, we have two types of calibration [3]–[5]:

• Internal calibration to adjust the internal parameters ofthe camera (focal length, lens aperture, etc.) in orderto eliminate image distortion. The intrinsic parametersof the camera are the projection of the optical centerin the image frame, the focal length, and the imagedistortion parameters.

• External calibration to adjust the position and ori-entation of the two cameras in order to make theiroptical axes parallel. The extrinsic parameters are thetranslation and rotation between the camera frame andthe world frame. They allow to position each camerain the same reference frame; to avoid any kind of non-calibration of both right and left image (Fig. 3).

Fig. 3. Non Synchronized Pair of Stereo Images: Right Image is CapturedAfter Left Image.

It should be noted that the use of stereo cameras in oursystem simplifies the calibration process since it is performedonce and for all in the laboratory, whereas a monocular camerarequires calibration for each road scene [3].



Thus, after calibration (2), we used a trigger card toconstantly generate an electrical signal that will activate bothcameras simultaneously to capture images at the same time[5]. Once the camera is activated, an image of the road sceneis captured.

Note that the cameras must be correctly synchronized toobtain adequate results.

We will see in the following paragraph the detail of astereovision method based on the geometry of the sensorseen previously. Fig. 4 illustrates the important parameters forstereoscopic measurements: SL and SR represent two camerasthat are at distance B from each other.

Fig. 4. Parameters for Stereoscopy Measurements. System PlacedHorizontally on the Road [6].

ϕ0 represents the field of view FoV of the cameras. Thedistance to the object (in our case a vehicle). D can beexpressed by geometrical derivations leading to the followingexpression (equation 1) [6]:

D =B

tanϕ1 + tanϕ2(1)

Where:ϕ1 and ϕ2 are the angles between the axis of the camera lensand the direction of the object. After further derivation, wearrive at the following expression (equation 2) [6]:

D =B ×X0

2 tan(ϕ0

2

)(X1 −X2)

(2)

X0 is the number of horizontal pixels of the images, X1 andX2 are the numbers of pixels between the midpoint of thehorizontal edge of the bounding box of the object and the leftedge of the image (X1 is for the left image and X2 for theright one).

Finally, we can estimate the distance to any object ap-pearing in both images if we know the distance between thecameras (B), the number of horizontal pixels of the image(X0), the FoV of the cameras ϕ0 and the horizontal differencebetween the same object in both images (X1-X2) also knownas disparity. In fact, the disparity refers to the difference inimage location of an object seen by the left and right cameras,resulting from the cameras’ horizontal separation [5].

To calculate the actual distance between the stereo systemand the vehicle, we need to calculate the angle between theroad and the orientation of the system (Fig. 5).

Fig. 5. Parameters for Stereoscopy Measurements. System Placed above theRoad [2].

This distance is calculated as follows (equation 3):

D = D′ cos(α) (3)

Where:- α: is the angle between the orientation of the system and theroad- D’: is the distance between the object and the horizontalplane of the cameras. Then, the proposed method for esti-mating inter-vehicle distance involves three major steps. Thefirst consists in preparing the two images generated by ourstereovision system to detect the vehicles present in each imageand to delimit them by bounding boxes [2], [8].

Then, the second step consists in finding the objects thatappear respectively in the two images to precisely determinethe value of X1 and X2 (cf. section 2) [6], [8]. Thus, if avehicle is detected in the left image, the algorithm will haveto search for it in the right image; the following criterionmust be perfectly respected. This step is especially the mostcomplicated of the whole algorithm.

The third step of the algorithm consists in calculating thedistance between our stereovision system and each vehicle. Atthis stage, all the necessary parameters are already obtained,and the distance is calculated from the second equation.Finally, we deduce the inter-vehicle distance from the thirdequation by subtracting the estimated distance between eachvehicle and the two cameras.

III. YOLO V3 DEEP NEURAL NETWORK ARCHITECTURE

The heart of our distance estimation algorithm is thevehicle detection and recognition block, which allows us tolocate and delimit vehicles in a stereo image by drawing abounding box around the vehicles in the image. In this respect,we have opted for a variety of the YOLO Deep Neural Network(YOLOv3) algorithm for vehicle detection and recognition[4,5]. This neural network can recognize several objects in thesame image, belonging to the same class or to different classes.In our case of study, we are mainly interested in the thirdversion of the YOLO model, because it has the advantage ofbeing able to run in real time on stereo images/video streams,while keeping a good predictive performance. This versionhas been developed by Joseph Redmon and researchers at theUniversity of Washington [7].



The YOLOv3 algorithm is an improvement of YOLOv1and YOLOv2 because it has advantages of high accuracyin detecting, recognizing, and locating objects as well asits speed of execution [7]; it has become a crucial pointof current research. However, it still lags the most powerfulobject detection algorithms in terms of accuracy. Moreover,the principle of the model is to scan the image only once,by passing it through a deep neural network, hence the nameYOLO (You Only Look Once) unlike methods based on CNNconvolutional or RNN recurrent neural networks [7].

In addition, the latest version of the model has also focusedon increasing the number of network layers as well as on theimplementation of three scales of bounding boxes to detectsmaller objects. These types of algorithms make it morepossible to detect overlapping bounding boxes for the sameobject. The authors therefore apply a method called Non-MaxSuppression to keep only the most significant bounding boxes[7]. After implementing and running our model, we obtain asoutput the bounding box coordinates of all detected vehicles.This information is very useful to obtain the parameters X1 andX2 which are used in the mathematical expression of distanceestimation by stereovision [6]–[8].

A. Comparison of YOLO with Other Detection Algorithm

In comparison to recognition algorithms, a detection algo-rithm does not only predict class labels but detects locationsof objects as well. So, it not only classifies the image into acategory, but it can also detect multiple Objects within an Im-age [11]. It is extremely fast and accurate. Moreover, you caneasily tradeoff between speed and accuracy simply by changingthe size of the model, no retraining required [12]. And thisAlgorithm does not depend on multiple Neural networks.It applies a single Neural network to the Full Image. Thisnetwork divides the image into regions and predicts boundingboxes and probabilities for each region. These bounding boxesare weighted by the predicted probabilities [11] (Fig. 6).

IV. LITERATURE REVIEW

A. Evolution of Image Recognition

Particularly in recent years, image processing has come along way. Evolution can be majorly seen in technological fieldslike computer vision and software. First computer vision andstudy on images started in 1960s. Before this, image analysiswas done manually. The major improvement in deep learningtechniques and in image recognition technology took place in2010. Now it is so advanced that we can write a program forsupercomputers to train themselves [13]. In early days, Featureextraction and classification paradigm was followed for objectdetection. Manually people need to define a specific featurewhich needs to be identified for extraction. After extractionof features, the objects or those features were represented invector forms. These vector forms were used for training amodel and for detecting an object while testing a model. Itwas a difficult task for detection of multiple objects sincewe have to find a general feature which can be found inmultiple objects and can fit in different objects for trainingthe model. The disadvantage is choosing the general featurewhich was a complex task, and the detection accuracy wasnot that great. In 2012, compared to other models which

were already there, CNN gave a satisfactory results and goodaccuracy. Though there was CNN model developed in 1990s,the accuracy was low due to improper training examples andfragile hardware. CNN model became strong when GPUs wereprevailing. The CNN model built in 2012 was trained bythe dataset which consists of 1.2 million images and 1000categories. The experiment conducted by Krizhevsky provedCNN’s powerful ability in images classification [13]. CNNmethods can build feature filters while training the processwhich cannot be done in traditional methods. When comparedto other models, CNN models are more friendly and haveself-learning ability [14]. Because of all these advantages,CNN became a major tool for image classification. To enhancethe performance of CNN model,other regression heads wereattached to the current model. This regression head is usedto predict 4 coordinated after training it separately. HenceCNN allows both classification and regression head. Whiletesting the model both classification and regression workssimultaneously. Classification predicts the class score, andRegression helps for positioning. During 2012 to 2015, theexperiments conducted were successful in attaching both clas-sification and regression to CNN models Overfeat-Net, VGG-Net and ResNet. The error rate was reduced from 34% to9% in these experiments. Since multiple object detection wasfailed in the experiments conducted in 2012, Research in 2014started conducting experiments to achieve the task of multipleobject detection. In a single image more than five objectswere to be detected. This can be done only when the systemfigures out object’s class and location of the object. Usually,deep convolution neural network works with the fixed size ofimage (e.g. 520 x 520). Because of this recognition accuracymight go low for the images and subimages of arbitrary sizeor scale [15]. To overcome this issue Spatial Pyramid Poolingwas introduced in 2014. Fixed length representation regardlessimage size or scale is achieved by developing a networkstructure called SPP-net. By removing the size or restriction,accuracy of the convolution neural network can be achieved. InSPP-net feature maps are computed only once to generate fixedlength representations to train the detectors by pool featuresin the sub-images. Repetition of computation of convolutionfeatures can be avoided by this method. This method wasbetter than R-CNN and it gave satisfactory accuracy. Mostof the ideas regarding CNN approach and classification cameout in 2014. The main idea was to perform classificationon every region that possibly contains objects. The regionproposals and classification approaches achieved high accuracyand precision. But these region proposals take a very longtime to process which makes the speed of the entire systemto go low. Because of this timeconsuming limitation, theregion proposal approaches cannot be deployed in applicationswhich are time critical like autodriving, surveillance systemsetc. Recently, YOLO (You Look Only Once) a unified objectdetection model was proposed by Joseph [7]. Frame Detectionin YOLO is considered as regression problem. It is a pretrainedmodel which does not require a dataset to train the model. Itconsists of weights and object detection is done as boxes. Theimage which is inputted is regressed to tensor from the modeldirectly which signifies the digit of every object’s positionand class score of the object. The images which are inputtedneed not go through the YOLO network more than once.Because of this, processing of images is faster in this model.When compared to other object detection models, Yolo has



Fig. 6. Two-Camera/ YOLO Real Time System. Algorithm Architecture.

accomplished more than 50 times better accuracy. So currentlyYOLO is one of the best choices for real time object detections[7].

B. CNN based Object Detection: Benchmarking

1) R-CNN: R-CNN stands for Region-based ConvolutionNeural Network. It combines region proposals with Convolu-tion Neural Networks (CNN). R-CNN aids in focusing objectswith deep neural network. It trains a model of high capacitywith fewer amounts of annotated detection data. To categorizethe object proposals deep convolution network is used anddue to this RCNN attains outstanding accuracy for objectdetection. Ability of R-CNN is high because numerous objectclasses can be scaled without resorting to estimated methodstogether with hashing [16]. The researchers projected a multistages purpose followed by classification. And classificationwas done using regions paradigm. The three main componentsof the developed system is feature vector extraction by CNN,classifier used which is Support Vector Machine and thelast one is region proposal component [17]. Feature vectorsextracted from CNN are used to train the SVM classifier.Training is done on two datasets where CNN supervisedis trained on one large dataset (ILSVRC) and one smalldataset (PASCAL). During the testing time, the region proposalcomponent used in this experiment is Selective Search. 2000fixed size category independent regions which contain objectsis produced by Selective Search [14]. SVM is used fot domainspecific classification after a completely trained extractor ofCNN converts every potential vector into feature vectors. Thetwo main problems that may arise are intersection-overunion(IOU) and duplicate detections.

IOU will overlay the higher scoring region. These problemsare eliminated by greedy non maximum suppression andrefining the bounding box by using a linear regression modelat the end. Satisfying accuracy for detection was accomplishedby RCN when compared to any other detecting methods foundin 2014. But RCNN also has many drawbacks because ofcomplex multi-stage pipeline. The main role of CNN is to

act as a classifier. The region prediction is totally depended onexterior region proposal methods. This slows down the wholesystem while both training and detecting objects. Since RCNNhas a separate training manner for every component whichresults in CNN, it is very difficult for optimization. Besides,CNN cannot be updated during the training of SVM classifier[14].

2) SSD: Single deep neural network is used for detectingthe objects in images by Single Shot Detector (SSD). Theoutput spaces of bounding boxes are varied in SSD method.These boxes are set of default boxes over different aspectratios. The approach is scaled to every feature map locationafter it varies. The predictions from multiple feature maps arecombined in Single shot detector. Multiple feature maps arecombined to handle objects of different sizes naturally. Someof the benefits of SSD are SSD totally removes the proposalgeneration. The following pixels or feature resampling stagesare also eliminated which encapsulates every computation ina single network. Training in SSD is easy when compared toother models and it is forthright to assimilate into systemswhich needs a detection component. SSD accuracy can beincreased by adding an additional method for object proposals.Since it is combined with other models, the training andinference is much faster.

3) R-FCNN: R-FCNN stands for region-based, fully con-volutional networks. It is a simple framework used for efficientand accurate object detection. The other region-based networkdetectors like FCNN and Faster RCNN [18], are based onper region sub network. But R-FCNN is entirely convolutionalwith every computation shared on the whole image. There is apredicament between image classification and object detection.Image classification has translation invariance issues and objectdetection has translation variance issues. To overcome thisissue positive sensitive score maps are proposed. Thus, regionbased fully conventional network can accept fully conventionalimage classifier like latest residual networks for detection ofobject [14]. PASCAL VOC datasets are used to show themodest results. ResNet with 101-layer is used. The results



achieved by RFCNN are 20x better and faster than fasterRCNN while both inference and training

4) Faster R-CNN: Faster region-based convolution neuralnetwork is similar to RCNN which is an object detection algo-rithm. The features are extracted from the input image throughconvolution layers. Region proposal network (RPN) is used inFaster RCNN which shares the convolution features for eachspatial location like objectness classification and bounding boxregressor. The FRCNN network is cost effective than RCNN. Itbasically predicts the object boundaries and objectness scoresfor every position of the object. High quality region proposalsare created, and end-to-end training is done then this techniqueis used by Fast RCNN method for object detection. Whencompared to other object detection methods, faster region-based convolution neural network has less running time fordetection of object. When feature maps are sent into RPN,feature maps projected region proposals are extracted. RoIpooling is done on feature maps. The result of Faster RCNNclassification will be multiclass classification and boundingbox regressor for each RoI [18].

5) Fast R-CNN: Fast RCNN stands for Fast Region BasedConvolution network. It is a training algorithm for detectionof objects. Fast RCNN is better than RCNN and SPP net as itresolves almost all disadvantages and increases the speed andaccuracy of RCNN and SPP net. When compared to RCNNand SPP net, Fast RCNN has higher detection quality thatis mAP. Training in Fast RCNN is done in single stage bymeans of multi-task loss. All the network layers can be updatedduring the training process. Disk storage is not utilized forfeature caching by fast RCNN. The Convolution feature mapis of Deep Convolution Network and RoI projection. The RoIpooling layer is extracted from the convolution feature mapin RoI feature vector. RoI feature vector is extracted for eachRoI.The output will be softmax and bbox regressor [11]. Thispaper proposes a Fast Region-based Convolutional Networkmethod (Fast R-CNN) for object detection [18]. By using thework of algorithms which are built previously, fast RCNNuses deep convolution network to classify object proposalsefficiently. This helps Fast RCNN to achieve better detectionaccuracy and increase training and test speed. The trainingdone by fast RCNN on deep VGG16 network is 9x faster thanRCNN and 213 x faster when compared to the test time. Agood mAP on PASCAL VOC 2012 is achieved. When FastRCNN is equated with SPPnet, test accuracy is 10x faster andaccurate and training of VGG16 is 3x times faster than SPPnet.Because of the detailed work carried out in this experiment,new insights are provided. The improved detector quality isachieved at the end. The main issue with other object detectionalgorithms is they are too expensive in time analyze in the past.

6) SPP-net: CNN models work only with the fixed sizeof input image like 520x520. Because of this the recognitionaccuracy will go low. To overcome the above-mentioned issueSpatial Pyramid Pooling was equipped. The fixed length ofsymbolization irrespective of size or scale can be generated bya Spatial Pyramid Pooling (SPP-net) network structure. Objectdeformation can be achieved by Spatial Pyramid Pooling.When compared all CNN based methods, Spatial PyramidPooling is an improved structure. Feature maps for the wholeimage can be computed at once in Spatial Pyramid Poolingmethod. Pool features in sub images of fixed length is also

computed to train the detectors. In the other methods, convolu-tion featured are repeatedly computed which can be overcomein Spatial Pyramid Pooling. SPP-net is more weighted in objectdetection. When compared to RCNN method, SPP-net is 30-170× faster and when both the models were tested on PascalVOC 2007, SPP-net gave better accuracy than RCNN.

C. Real Time Objects Detection and Tracking: Benchmarking

Moving object detection and tracking is presented in [16].Intuitive graphic interphase is achieved by means of newalgorithm during the extraction of Silhouette. For the fast de-tection following algorithms were combined, frame differencemethod, background subtraction method, Laplace filter andCanny edge detector. The multivision dataset is used for testingthe sequence images. The better performance object trackingalgorithm is proposed. The detection algorithms and basicoperation techniques are integrated, and graphic user interfaceis used to make the process simple and straight forward. Worldis adapting to artificial intelligence from past few years withinfluence of deep learning. Many object detection algorithmshave been compared like Region-based Convolutional NeuralNetworks (RCNN), Faster RCNN, Single Shot Detector (SSD)and You Only Look Once (YOLO). And the result id fasterRCNN and SSD gives better accuracy with Yolo. Efficientimplementation and tracking are done by combining deeplearning with SSD and mobile nets. SSD helps in detecting theobject and tracking them in a video sequence. They achievedin enabling good security utility for enterprise and order. Themodel created can be deployed in drones, detect attacks andCCTV cameras in government offices, colleges, hospitals etc.Distance and estimation of real time video is achieved in[10]. Combinations of two deep learning models are developedto achieve object detection and tracking. The algorithms aretested on both railway and environment. Monodepth algorithmis applied for the estimation of object distance. Stereo imagedataset and monocular images are used to train the model.Testing of both the models is done on another two datasets.They are Cityscape and KITTI datasets. Pedestrian and vehiclebehavior tracking is done by developing a new method-basedSSD. The new SSD algorithm is developed by the coordinatesof the output bounding boxes of SSD algorithm. The wholedevelopment is tested on the real time data and the mainobjective is to monitor the tracks of pedestrians and vehiclesto make sure it does not lead to any dangerous situations. Realtime video of Routen tramway is taken by embedded cameras.

D. YOLO Model

YOLO (You Look Only Once) a unified object detectionmodel was proposed [7]. Detection in YOLO is consideredas regression problem. It is a pre trained model which doesnot require a dataset to train the model. It consists of weightsand object detection is done as boxes. The image which isinputted is regressed to tensor from the model directly whichsignifies the digit of every object’s position and class scoreof the object. The images which are inputted need not gothrough the YOLO network more than once. Because of this,processing of images is faster in this model. When comparedto other object detection models, Yolo has accomplished morethan 50 times better accuracy. So currently YOLO is one of thebest choices for real time object detections. The base YOLO



model can process real time images up to 45 frames per secondwhereas Fast YOLO processes can process nearly 155 framesper second. The base version is the smaller version of thenetwork. The natural images can be generalized very wellusing this model. According to recent studies, YOLO is oneof the fastest detecting models when compared to other CNNobject detection models [7] (Fig. 7).

Fig. 7. Comparison to Other Detectors [7].

E. Kalman Tracking

In recent decades real time object tracking has been appliedin multiple areas like human computer interaction, security,surveillance, video communication, etc. Object tracking is theprocess of locating one or multiple moving objects in the sceneduring continuous time. Some of the challenges faced are,Initial moving object segmenting. The goal of segmentationis to simplify or change the representation of the imageinto something that is more minimal and easier to analyze[12]. Rapid appearance changes are caused by image noise,illumination changes, non-rigid motion, and different poses.Tracking the moving target is complex in background. Whentracking an object in real world background can be quitecomplicated for various depths in the background which caninterfere their tracking. So Kalman filter was introduced whichis also called as linear quadratic estimation. It is an algorithmwhich uses the series of observing measurements over time[13].

There are two main parts that contributes in Kalmantracking. They are Prediction and correction. Prediction willpredict the project current state and estimate the next state.If there is any mistake in prediction, it goes to correction.In correction, Kalman gain is computed. The system state isupdated after Kalman gain is found and error covariance isalso updated. Correction is in turn connected to prediction.The detecting range can be predicted by Kalman filter inorder to accurately track object in occlusion which means acomplicated background [12].

F. Unified Detection Model: YOLO

Earlier detection models repurpose the classifiers to achievedetection. The model is applied to many location and scalesin the image. If there is a maximum scoring region on theimage, then it is detected. But YOLO has an entirely dissimilartechnique. A single neural network is applied to a whole image.

In Yolo model, network spits the image into regions.After the splitting, bounding boxes and probabilities for eachregion is predicted. The predicted probabilities help to weighthe bounding boxes [15]. When compared to classifier-basedmethods, YOLO has several advantages. The predictions aremade by the global context in an image because Yolo takeswhole image for the testing. RCNN requires thousands ofpredictions for a single image. But in YOLO the predictionsare made by the single neural network assessment. Because ofthis yolo is tremendously fast. It is 1000 x times faster thanRCNN and 100 x times faster than Fast RCNN. Thus, Trainingof YOLO model can be done in two ways. One can either usetheir own dataset to train the model or can use the pre-trainedweights. These pretrained weights are available for public use.

Some of the dependencies required to build a YOLO inTensorflow:

• Tensorflow/Keras (GPU version preferred for DeepLearning)

• NumPy (for Numeric Computation)

• OpenCv (for Image Processing)

• IPython (for displaying images)

• Glob (for finding pathname of all the files)

Anaconda is suggested as it contains many libraries of ma-chine learning and deep learning and interaction with Spyder,and Jupiter are easier.

V. METHODOLOGY

A. Object Detection

There are mainly two ways of object detection. First oneis to take object images and train our own Machine Learningmodel. When we train the machine learning model, main inputis features. Based on the features, the model will learn andcreate weights for that object. But there are some disadvantagesin this method. For example, when we consider the object ascar, there are different types of cars based on their shapes.Sometimes even a truck might look like a car in the video.To overcome this issue, the feature extraction must be verymuch robust. Like the model should be trained by all theaspects like size, dimension, and shape. This requires a largeData. Because of this training will depend on our system. Ifthe system’s GPU is low, then we cannot train our model atall or it might take a very long time to process. If we gofor SVM, neural networks or Random Forest models or anybasic type of modeling which takes less amount of data, itdoes not work with the real time data. So, we use a YOLOmodel which can be defined by a concept of convolution neuralnetwork. The one difference between YOLO and other CNNmodels are, YOLO has a moving or floating window. Thatmeans a window is created in YOLO which keeps movingfrom left to right. While moving if any object which is neededoccurs on the screen, YOLO will highlight that object. Withthe weights which are already present in the model, it willtry to detect the object and recognize it. For each object thereexists a different weight in YOLO. There are different typesof YOLO. Some models may have 150 different objects, andsome might have 80. There is an option to limit the number of



objects to whatever is required, or one can use all the objectspresent in the model. In this research we have used a modelwith 80 weights. Every time the code runs, we have to loadthe weights. Since we do not want our model to detect all 80weights or objects, we limit the weights for first 10. This ishow detection of object takes place in YOLO.

B. Object Tracking

TensorFlow used can help to detect the object but it will nottrack the object. To track the object, bounding box is given toall the objects present in the video or on the screen. TensorFlowgives the kernel dimension of the weights. Out of eightcoordinates in kernel dimension we extract four coordinates.We will multiply the width and height of the coordinates of thekernel dimension because to fix the dimension of the objectwith is detected

C. Distance Calculation

Distance is calculated by the bounding boxes. In fact, Yolodetect the objects of interest and give their regions. Next, ittakes the left and right images and construct a dense disparitymap. Finally, it takes different regions of the objects and workout an average disparity value to then work out a distanceusing the focal length and camera baseline of the stereo capturedevice.

In addition, the movement when other object is completelydetected, the size becomes bigger. If the portion of the objectdetected is less, then we can assume that either the object isfar from us or it is in the sideways like left or right. The biggerthe size, the closer the vehicle or object is. And the smallerthe vehicle, the distance is more. So, from the bounding boxdistance is calculated. We are taking some constant and weare predicting the distance.

Then, we display on the screen the estimated inter-vehicledistances each time a vehicle crosses the delimited area.

D. Architecture of our System

Before any experimentation could occur, a baseline systemneeded to be created. We began by stripping the given yolo.pyto its main functions and modularizing it for use on individualframes. We then used parts of the given stereo disparity.py todevelop two functions for dense disparity distance calculation.First, yolo would detect the objects of interest and give theirregions. Next, it would take the left and right images andconstruct a dense disparity map. Finally, it would take differentregions of the objects and work out an average disparity valueto then work out a distance using the focal length and camerabaseline of the stereo capture device. Once we had this basicsystem working, we could begin to experiment with differenttechniques of optimization.

The first thing we noticed was that many of the imageshad a low contrast. To remedy this, we used a form ofhistogram equalization called Contrast Limited Adaptive His-togram Equalization (CLAHE) which works by taking smallregions (tiles) and applying equalization on those, rather thanthe entire image. Whilst the filter did not seem to have mucheffect on distance values, it had some success in helping yolodetect objects in poor light conditions.

The next experiment was to apply a filter to the disparitymap. The filter we tried was the Weighted Least Squares(WLS) filter. The WLS filter smooths the disparity map andmakes it more uniform. This seems like it should help withthe distance calculation though, in practice, not much changewas seen and in some cases the filter made things worse. Thiscould be because smoothing causes the image to lose detailand thus lose valuable information that could have helped withdistancing. We also apply a noise filter to the disparity mapto lower the amount of noise as this would help to provide abetter distance average.

In summary, we have experimented with various techniquesto attempt to increase the robustness of the system. Whilstnot all of these have proved effective, they have all leadto a solution that is a suitable prototype for object distancedetection. YOLO has been able to find most of the objects inthe scene (helped a little by CLAHE histogram equalization)and the disparity maps seem to have given enough informationto get a reasonable distance estimate.

Moreover, the results are assumed to be accurate when themodel detects the objects correctly.

During the validation of objects detected by the model, wehave got the object accuracy of 84.89% for 0.031 seconds perimage of processing (Fig. 8).

Fig. 8. Execution of the Algorithm.

VI. CONCLUSION

When we analyze the results, we have got for precision andrecall, it can be said that YOLO is one of the best models usedin detection of vehicles. YOLO model has achieved 85% ofprecision with 62% of recall with the time rate of 30 frames persecond. We have also successfully found the distance betweenthe vehicles.

The YOLO model is in the top place in the object detectionspeed when compared to other convolution neural networks.The detection speed that we have achieved is 0.03 secondsper image, which is 10 times faster than the already presentobject detection models. The YOLO model is the only modelthat has achieved this accuracy in real-time video streaming.From the computation of orientation estimation, we have foundthat YOLO has a good precision in prediction of objectorientation. By all the experiments conducted, it is provedthat performance of YOLO is high in both object detectionand orientation precision. Since object’s orientation has a main



role in intelligent transport systems, with the accuracy we gotfor orientation estimation we can state that YOLO fits in thebest for them. We have also successfully found the distancebetween the vehicles.

A. Implementation of our System in Real World

Layer 1: Acquisition and pre-treatmentThe main function of this layer is to ensure the acquisitionof images from a stereoscopic system. Then, there is imageprocessing which consists in improving the quality of theimage by removing noise, camera vibrations, lighting changes,etc.

Layer 2: Attribute extraction and analysisFrom the images obtained from Layer 1, this layer extracts thestatic and dynamic attributes of the vehicles necessary for roadtraffic management: vehicle detection, trajectory extraction, ve-hicle recognition (license plate, mark, and color detection), andthe measurement of inter-vehicle distance. Then, the extractedattributes are analyzed to understand traffic conditions andbehaviors.

Layer 3: Detection of illegal activities and anomalies andanalysis of traffic flowsBased on the results of the previous layers, this layer providesservices for efficient management and control of road traffic.It can detect traffic violations (such as stop violations, redlight violations, speeding, overtaking, fake license plate, unau-thorized change of direction, etc.) and anomalies on the road(accidents, obstacles blocking traffic, traffic light malfunction,etc.).

B. Future Work

In this paper, we present a method of inter-vehiculardistance estimation based on stereoscopic vision. Indeed, afterdetecting, locating, and delimiting vehicles using the YOLOV3 Deep Neural Network algorithm, we estimate the distanceseparating a vehicle from the cameras based on stereo visionprinciples to finally deduce the inter-vehicular distance. As aperspective, we plan to extend the technique to estimate theinter-vehicle distance in urban areas, we aim to satisfy thereal-time constraint to be able to deploy our system in the realworld.

REFERENCES

[1] O. Bourja, K. Kabbaj, H. Derrouz, A. El Bouziady, R. O. H. Thami,Y. Zennayi, and F. Bourzeix, “Movits: Moroccan video intelligent trans-port system,” in 2018 IEEE 5th International Congress on InformationScience and Technology (CiSt). IEEE, 2018, pp. 502–507.

[2] G. Kim and J.-S. Cho, “Vision-based vehicle detection and inter-vehicledistance estimation for driver alarm system,” Optical review, vol. 19,no. 6, pp. 388–393, 2012.

[3] A. EL BOUZIADY, R. O. H. THAMI, O. BOURJA, F. BOURZEIX,and S. EL FKIHI, “Estimation du vitesse du trafic routier avec la stereo-vision.”

[4] A. El Bouziady, R. O. H. Thami, M. Ghogho, O. Bourja, and S. El Fkihi,“Vehicle speed estimation using extracted surf features from stereoimages,” in 2018 International Conference on Intelligent Systems andComputer Vision (ISCV). IEEE, 2018, pp. 1–6.

[5] F. Bourzeix, O. Bourja, M. A. Boukhris, and N. Es-Sbai, “Speed estima-tion using stereoscopic effect,” in 2014 Tenth International Conferenceon Signal-Image Technology and Internet-Based Systems. IEEE, 2014,pp. 147–151.

[6] P. Charbonnier, V. Muzet, P. Nicolle, N. Hautiere, J.-P. Tarel, andD. Aubert, “La stereovision appliquee a l’analyse de scenes routieres,”Bulletin des laboratoires des ponts et chaussees, no. n272, pp. p57–74,2008.

[7] J. Redmon and A. Farhadi, “Yolov3: An incremental improvement,”arXiv preprint arXiv:1804.02767, 2018.

[8] B. Strbac, M. Gostovic, Z. Lukac, and D. Samardzija, “Yolo multi-camera object detection and distance estimation,” in 2020 ZoomingInnovation in Consumer Technologies Conference (ZINC). IEEE, 2020,pp. 26–30.

[9] H. Derrouz, A. E. Hassouny, R. Oulad Haj Thami, and H. Tairi, “Hybridmethod for background modeling and subtracting,” in 2017 IntelligentSystems and Computer Vision (ISCV), 2017, pp. 1–5.

[10] H. Derrouz, A. Elbouziady, H. Ait Abdelali, R. Oulad Haj Thami,S. El Fkihi, and F. Bourzeix, “Moroccan video intelligent transportsystem: Vehicle type classification based on three-dimensional and two-dimensional features,” IEEE Access, vol. 7, pp. 72 528–72 537, 2019.

[11] Y. Li, Z. Zhao, Y. Luo, and Z. Qiu, “Real-time pattern-recognition ofgpr images with yolo v3 implemented by tensorflow,” Sensors, vol. 20,no. 22, p. 6476, 2020.

[12] J. Ciberlin, R. Grbic, N. Teslic, and M. Pilipovic, “Object detectionand object tracking in front of the vehicle using front view camera,”in 2019 Zooming Innovation in Consumer Technologies Conference(ZINC). IEEE, 2019, pp. 27–32.

[13] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” Advances in neural informa-tion processing systems, vol. 25, pp. 1097–1105, 2012.

[14] D. Jifeng, L. Yi, H. Kaiming, and S. Jian, “Object detection via region-based fully convolutional networks,” Advances in Neural InformationProcessing Systems, pp. 379–387, 2016.

[15] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” arXiv preprintarXiv:1506.01497, 2015.

[16] S. K. Chadalawada, “Real time detection and recognition of constructionvehicles: Using deep learning methods,” 2020.

[17] J. B. Guapacha and S. C. A. Mantovanni, “Real time object detectionand tracking using the kalman filter embedded in single board in arobot,” in 2017 CHILEAN Conference on Electrical, Electronics Engi-neering, Information and Communication Technologies (CHILECON).Ieee, 2017, pp. 1–6.

[18] Z. Chen, R. Khemmar, B. Decoux, A. Atahouet, and J.-Y. Ertaud, “Realtime object detection, tracking, and distance and motion estimationbased on deep learning: Application to smart mobility,” in 2019 EighthInternational Conference on Emerging Security Technologies (EST).IEEE, 2019, pp. 1–6.


Documents

Real Time Vehicle Detection, Tracking, and Inter-vehicle