ResearchArticle Vehicle Emission Detection in Data-Driven ...downloads.hindawi.com/journals/mpe/2020/4875310.pdfamount of practical exhaust emission data can be obtained by environmental

Research ArticleVehicle Emission Detection in Data-Driven Methods

Zheng He 12 Gang Ye 12 Hui Jiang 3 and Youming Fu 12

1National Engineering Research Center for Multimedia Software School of Computer Science Wuhan UniversityWuhan 430072 China2Hubei Key Laboratory of Multimedia and Network Communication Engineering Wuhan University Wuhan 430072 China3School of Computer Science Wuhan University Wuhan 430072 China

Correspondence should be addressed to Gang Ye yegwhueducn

Received 26 May 2020 Revised 29 July 2020 Accepted 27 September 2020 Published 14 October 2020

Academic Editor Jun-Jun Jiang

Copyright copy 2020 Zheng He et al +is is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Environmental protection is a fundamental policy in many countries where the vehicle emission pollution turns to be outstandingas a main component of pollutions in environmental monitoring Remote sensing technology has been widely used on vehicleemission detection recently and this is mainly due to the fast speed reality and large scale of the detection data retrieved fromremote sensing methods In the remote sensing process the information about the fuel type and registration time of new cars andnonlocal registered vehicles usually cannot be accessed leading to the failure in assessing vehicle pollution situations directly byanalyzing emission pollutants To handle this problem this paper adopts data mining methods to analyze the remote sensing datato predict fuel type and registration time +is paper takes full use of decision tree random forest AdaBoost XgBoost and theirfusion models to successfully make precise prediction for these two essential information and further employ them to an essentialapplication vehicle emission evaluation

1 Introduction

+e popularization of vehicles in our daily life has beencontinuously enhanced with the expansion of urbanizationaround the world Gasoline-engine vehicles are the mostpopular and widely used type compared with new energyones and the pollution gases such as carbon dioxide carbonoxide hydrocarbon and oxynitride from vehicles havebecome the main contaminants in urban atmosphericpollution [1] Efficient vehicle pollution detection thereforeturns to be an emergency task which attracks more andmoreattention Exhaust emission detection methods have evolvedfrom periodic detection in the environmental monitoringstation to daily road detection with remote sensing tech-nology +is paper studies the vehicle emission detection incities of China which is one of the largest developingcountries

In the USA EPA (Environmental Protection Adminis-tration) proposed MOVES algorithm [2] to calculate thevehicle emission ratio in some fixed locations and periods oftime +e Japanese government enforces the vehicle exhaust

emission monitoring system in their country and theemission behaviour of each vehicle in Japan can be checkedon the official website of Japanese national transportation[3] In order to rapidly capture the emission detection re-sults a French transport agency collects the emission pol-lution-related information from different places and putsthem together to realize the sharing network for vehicleemission detection [4] Related researches and works on thisarea started a bit later in China In 2011 Cheng et al [5]made systematic analysis for the harm caused by vehicleemission verifying the necessities of exhaust emissioncontrolling Next year Wu [6] collected the values of CO2HC CO and NO exhausted by 1092 vehicles in the XianYang city using simplified loaded mode +ey establishedregression equations between the emission value and vehicleinformation and found that the average emission value washighly related with the vehicle acceptability and the age ofthe vehicle Referring to the local standards they furthergave a systematic explanation for the rationalization of thelocal standard mean emission value based on their researchWith the development of remote sensing technology a large

HindawiMathematical Problems in EngineeringVolume 2020 Article ID 4875310 13 pageshttpsdoiorg10115520204875310

amount of practical exhaust emission data can be obtainedby environmental protection agencies in China +is paperintroduces data mining technology to these valuable data toexplore efficient information in vehicle exhaust emissiondetection +is research has a huge potential contribution inpromoting the environmental protection departmentrsquos ac-curate assessment of unqualified vehicles and providing atheoretical basis for policymakers to learn from

+e first successful vehicle emissions demonstrationsystem was probably an across-road vehicle emissions re-mote sensing system (VERSS) proposed by Gary Bishop andcolleagues in the University of Denver in the late 1980s [7 8]A liquid nitrogen cooled nondispersive infrared was the firstinstrument that can only measure CO and CO2 In the nexttwo decades their team continuously refined the systemadded hydrocarbon H2O and NO channels to their NDIRsystem [9 10] integrated an ultraviolet spectrophotometerand improved it to enhance NO measurement [11 12] andremoved the dependence on the liquid nitrogen cooling [13]+e Denver group designed another commonly used remotesensing device known as fuel efficiency automobile testproviding some of the inchoate comments on across-roadparticulate measurement [14] +ere are also many othersensing systems typically based on multiple spectrometricapproaches proposed for detection of passing vehicleemissions [15ndash17] More recently Hager Environmental andAtmospheric Technologies introduced an infrared laser-based VERSS named Emission Detection and Reporting(EDAR) system which incorporated several new functionsmaking it a particularly interesting system for vehicleemission detection

Important information is buried in the vehicle emissionremote sensing data +is paper exploits data miningmethods to deal with the data and obtain valuable knowl-edge from them +ere are three main directions in datamining the improvements of classical data mining algo-rithms ensemble learning algorithms and data mining withdeep learning+e improvements on classical algorithms areusually performed and employed in multiple applicationscenarios taking additional information into considerationEnsemble learning is actually the integration of multiplelearners with a certain structure which completes learningtasks by constructing and combining different learners Itsgeneral structure can be concluded as follows firstly gen-erate a set of individual learners and then combine themwith some strategies +e combining strategies mainly in-clude average method voting method and learning methodBagging and boosting [18] are the most commonly usedensemble learning algorithms which improve the accuracyand robustness of prediction models As the rapid devel-opment and popularization of deep learning it plays moreand more important roles in data learning with the supportof big data and high-performance computing Many trafficengineering-related researches mainly focus on analyzingrelevant data such as traffic diversion [19] traffic safetymonitoring [20] engine diagnosis [21] road safety [22] andtraffic accident [23] and remote sensing image processing[24ndash35] extracting useful information and digging outvaluable knowledge A few works are proposed in vehicle

emission evaluation in data mining ways which is the keystudy subject in this paper Xu et al [36] used XgBoost todevelop prediction models for CO2eq and PM25 emissionsat a trip level In [37] Ferreira et al applied online analyticalprocessing (OLAP) and knowledge discovery (KD) tech-niques to deal with the high volume of this dataset and todetermine the major factors that influence the average fuelconsumption and then classify the drivers involvedaccording to their driving efficiency Chen et al [38] pro-posed a driving-events-based ecodriving behaviour evalu-ation model and the model was proved to be highly accurate(9672)

Relevant environmental policies have been introduced todefine difficult limitation standards based on the vehicle fueltype and registration time in China +e vehicle license platenumber plate color speed acceleration and VSP (vehiclespecific power) etc will be captured by the surveillancesystem when vehicles pass by the remote survey stations+eanalysis for the smoke plume generated by gas emission issimultaneously conducted by laser gears at the stationswhere the exhaust emission value can be calculatedWith thefuel type and registration time information learned fromvehicle plate numbers it is able to obtain the gas emissionstandard value to judge whether the vehicle emission iseligible However register information of nonlocal vehiclesand partial local vehicles is not recorded in the officialdatabase due to the limitation of environmental policieswhich leads to the failure to provide the fuel type andregistration time information for vehicle emission detectionAccording to the National Telemetry Standard in Chinarelevant departments will treat the information-missingvehicles as the diesel consumption ones and this situationkeeps the limitation criteria of the emission value of partialvehicles unknown resulting in the evaluation for thesevehicles being unable to carry on +erefore the preciseinformation upon fuel types and registration time of vehiclesis an essential prerequisite for finding out the pollution-exceeding vehicles +is paper adopts multiple data miningmethods to learn the fuel type and registration informationof vehicles from remote sensing data and further utilizecascaded classified framework to make accurate predictionon vehicle emission-related information providing valuablereference standards on evaluation of different vehicles

2 Data Mining Models for Analysis

In this section detailed descriptions on the models anddataset used in this study are given [39]

21 Data Mining Methods

211 Decision Tree Model Decision tree model [40] is acommonly used data mining method based on informationtheory and a greedy algorithm-like framework which isproposed for classification or prediction +e model dividesthe whole dataset into branch-like parts to construct aninverted tree with a root node internal nodes and leafnodes +e nonparametric design enhances the efficiency

2 Mathematical Problems in Engineering

and generalization ability of the decision tree in processing alarge and complex dataset

Five core components made up the tree decision in-cluding the following (1) nodes root node internal nodesand leaf nodes which are the three types that representdifferent choice operations for data distribution (2)Branches they represent the splitting process of nodes in thedecision tree and each branch from the root node to a leafnode represents a corresponding decision rule on classifi-cation (3) Splitting it is a procedure to generate child nodesfrom parent nodes terminating when the predeterminedhomogeneity or stopping criteria is met (4) Stoppingstopping rules are applied to prevent the overfitting andinaccuracy happening (5) Pruning it is an alternative way toestablish a large tree first and then prune it to an optimalstructure by removing useless nodes +is paper uses de-cision trees to make classification for vehicle fuel type andspecifically calculates the information gain of multiplecorresponding attributes to generate the rule model for fueltype prediction +e attributes that greatly affect the finalresults can be shown in a quite intuitive way Figure 1 il-lustrates the decision tree model for fuel type prediction

212 Random Forest Model +e random forest model isanother classic and efficient data mining method that be-longs to bagging learning In 2001 Leo Breiman combinedensemble learning theory [41] with random subspacemethod [42] proposing the well-known machine learningmethodology random forest It is a data-driven nonpara-metric model without a priori knowledge and has goodtolerance to noise and abnormal values as well as excellentextendibility and parallelism abilities for high-dimensionaldata classification +e ensemble learning structure enablesrandom forest in some extent overcoming the performancebottleneck and overfitting of single classifiers such as SVM

Given dataset D Xi Yi Xi isin RK Yi isin 1 2 middot middot middot C random forest is essentially a series of combined classifiersmade up by M decision trees g(D θm) m 1 2 middot middot middot M1113864 1113865+e classification result is decided by voting of every de-cision tree and is highly related with two vital randomiza-tions sample bagging and feature random subspace +esample bagging process randomly picks M training datasetwith return which shares the same size with the oral datasetconstructing a corresponding decision tree When a node inthe decision tree is split the model will randomly select afeature space from the whole K features (usually use log2 K)from which an optimal splitting feature is selected to con-struct trees +ese features consist of the feature randomsubspace and contain more discriminative feature combi-nation for classification Since in the construction of eachdecision tree the process of randomly picking training dataand feature subspace is independent and the constructionsare procedurally identical θm m 1 2 middot middot middot M1113864 1113865 is a se-quence of random variables with independently distribution+is character makes it applicable and efficient to be realizedin a parallel computing way and simultaneously ensures thehigh extendibility of random forest +e structure andconstruction of random forest are shown in Figure 2

213 AdaBoost Model Similar to the bagging method theboosting also belongs to the ensemble learning methodwhich enhances the classificationprediction accuracy ofbase learner continuously with an iterative update pro-cess AdaBoost [43] based on boosting learning wasproposed by Professor Freund and Schapire in 1995 +ealgorithm was widely used in various classificationpre-diction fields due to its outstanding performance +ecentral idea of AdaBoost is to continuously update thewrong judgment weights Weights of mistaken classifiedsamples of the previous basic classifier are set to increasewhile the correctly classified samplesrsquo weight is pro-grammed to decrease and the correct one will be used totrain the next basic classifier again At the same time anew weak classifier is added to cascaded classifiers in eachiteration and the final strong classifier stays undeter-mined until a predetermined sufficiently small error rateor the prespecified maximum number of iterations isreached +e concrete procedures are presented in thefollowing paragraph Given the training set(x1 y1) middot middot middot (xn yn)1113864 1113865 yi isin minus1 1 i 1 middot middot middot n xi is the i-thtraining data with its label yi and yi 1 and yi minus1denote the positive and negative labels respectively wi isthe i-th classifier of AdaBoost Initialize the weights at firstwith the uniform distribution

D1(i) w1 middot middot middot wn( 1113857 1n

middot middot middot 1n

1113874 1113875 (1)

+en perform m 1 2 middot middot middot M iteration the base clas-sifier Gm(x) is trained in the cost of Dm and the cost ac-cumulation of the wrong classified samples labeled byGm(xi) is represented by the classification error rate Em

which is defined as

Em P Gm xi( 1113857neyi( 1113857 1113944n

i1wiI Gm xi( 1113857neyi( 1113857 (2)

+e coefficient of the base classifier (Gm(x)) αm is cal-culated as the following formula

Vehicle fueltype dataset

Vehicleattribute 1

Vehicleattribute 2

Vehiclefuel type

Vehiclefuel type

Branches

Vehicleattribute N

Y

Y N Y N

N

Figure 1 Decision tree model for vehicle fuel type prediction

Mathematical Problems in Engineering 3

αm 12ln

1 minus Em( 1113857

Em

1113890 1113891 (3)

In the next iteration the cost distribution of samplesDm+1 is updated as

Dm+1 Dm(i)exp minusαmyiGm xi( 1113857( 1113857

Zm

(4)

where Zm is defined as

Zm 1113944n

i1Dm(i)exp minusαmyiGm xi( 1113857( 1113857 (5)

+e final strong classifier G(x) is the combination of thetrained base classifiers denoted as

G(x) sign(f(x)) sign 1113944M

m1αmGm(x)⎛⎝ ⎞⎠ (6)

214 XgBoost Model XgBoost [44] is a modified algorithmbased on GBDT (gradient boosting decision tree) [45] Both of

them share the same idea with boosting methodology in eachiteration the current decision trees are learned by the previousiteration results and move forward to the residual diminishingdirection When dealing with multiclassification problems thelogarithmic likelihood loss function is defined as

L(y f(x)) minus 1113944K

k1yklogpk(x) (7)

where y is the label of the input data x k denotes the at-tribute value and yk is an indicator function that if thepredicted value is k then yk 1 +e prediction probabilitypk(x) is denoted as

Pk(x) exp fk(x)( 1113857

1113936Kl1 exp fl(x)( 1113857

(8)

At the t iteration the label of sample data i is l thecurrent negative gradient can be calculated as

Rtil minuszL yi f xi( 1113857( 1113857

zf xi( 11138571113890 1113891

fk(x)fltminus1(x)

(9)

Trainingdata

Trainingdataset 1

Feature selectionfor splitting


Voting

Result


Trainingdataset 2

Trainingdataset n

Booststrap sampling

Figure 2 +e construction of random forest


As a classic algorithm GBDT has advantages such ashigh accuracy robustness and conveniences Yet it just usesthe first-order partial derivative to calculate the negativegradient which may cause relatively big error Aiming atovercoming this shortage XgBoost deduces the first-orderderivative by second-order Tailor expansion and the resultsturn to be more approximate to ground truth when thededuction is introduced to calculate the leaf node weight Inaddition XgBoost firstly ranks the sample data which storedthe records in the form of block and the speed of XgBoost ismuch faster than GBDT with the same training data

22ModelFusion Although the single models mentioned inthe above subsection have satisfactory performances on data

classification and prediction the model fusion methodsbased on the difference between the characters of differentsingle models are able to further improve the accuracy androbustness for the final results +e voting manner is acommon method for model fusion +is paper adopts twofusion methods to construct the combined model hardvoting and soft voting [46]

+e hard voting classifier follows the simple idea that theminority obeys the majority Given the classificationpre-diction results treat each label result for the same variance asa vote the most voted value is set to the variance in hardvoting way +is process can be defined as

Hvote(x) max 1113944i

lab(x i 1) 1113944i

lab(x i 2) middot middot middot 1113944i

lab(x i k)⎧⎨

⎩

⎫⎬

⎭ (10)

where x denotes the variance Hvote(x) is the vote result ofhard voting labc(x j c) is an indicator function that showswhether the j-th classifier deeming the label of x is c andlabc(x j c) 1 when the probability that x belongs to label c

calculated by j-th classifier p(x j c) exceeds somethreshold values otherwise labc(x j c) 0

+e soft voting classifier is another fusion strategy whichtreats the average of the probabilities of all classificationprediction samples for a certain label as the standard +ecorresponding label with the highest probability is the finalresult and the voting can be demonstrated as

Svote(x) max1113936ip(x i 1)

ncf

1113936ip(x i 2)

ncf

middot middot middot 1113936ip(x i k)

ncf

1113896 1113897

(11)

where ncf is the total number of classifiers and k is thenumber of labels

+is paper combines different models mentioned inSection 2 in both methods denoted as hard voting modeland soft voting model respectively to compare the finalresults

3 Data Description

+e data source is collected by means of remote sensing +evehicle remote sensing system is a complicated syntheticsystem made up of some subsystems including tail assayenvironmental information monitoring traffic conditionmonitoring and vehicle identification When vehicles passby the remote sensing devices the equipment performsdetection of smoke plume produced by the diffusion ofvehicle exhaust +e specific process can be summarized asfollows the probe light from the remote sensing devicepasses through the air mass and then returns to the detectionunit through the right-angle displacement unit finishing thedetection for carbon dioxide carbon monoxide hydrocar-bons nitrogen oxides etc +e intensity of opaque smoke

exhausted by diesel engine vehicles can also be monitored bythe gas-diesel integration design As the parameters of ve-hiclesrsquo running status the instantaneous velocity and ac-celeration of vehicles are obtained synchronously by remotesensing system +e exhaustion and running data are usedfor the final remote sensing results

Vehicle exhaust includes water vapor oxygen hydrogennitrogen carbon dioxide carbon monoxide hydrocarbonsnitrogen oxides sulfur dioxide and particulate matter+ereare two main methods for remote sensing detection on roadfor vehicle exhaust analysis namely nondispersive infraredred analyzer (NDIR) to measure carbon monoxide carbondioxide and hydrocarbons and the dispersion of the ul-traviolet (DUV) method to measure nitric oxide smokefactors and particulate matter (including opacity) +eexhaust diffuses and dilutes in the air immediately afterbeing discharged and the variation of the dilution con-centration is affected by factors such as air disturbance winddirection and wind speed Direct measurement of theconcentration of each pollutant in the exhaust plume maynot reflect the vehicle emissions accurately and efficiently+erefore carbon dioxide is adopted as the reference gas tomeasure various exhaust pollutants in vehicle remotesensing technology +e same exhaust remote sensing op-tical path (including horizontal or vertical erection) is in-capable of remotely measuring the exhaust of multiplevehicles at the same time It must pass one by one and thetime slot between passing vehicles should be greater thanone second so that enough time could be set aside for theremote sensing device to measure the exhaust of the pre-ceding vehicle +is also allows the exhaust of the frontvehicle to spread out in time without affecting the remotesensing of vehicles behind

+e remote sensing data used in experiments come fromthe real detected data in the database of the EnvironmentalProtection Bureau of a certain city which is derived from theremote sensing database and consists of three parts of


information the first part is the vehicle basic information Vrarr

it contains the license plate number m vehicle license platecolor c and the passing time t V

rarris denoted as V

rarr (m c t)

+e second part is the vehicle condition information Srarr

which includes the vehicle bodywork length cl the speed vand the acceleration a denoted as S

rarr (cl v a) +e third

part is the remote sensing result Crarr It includes the detection

value of carbon dioxide Cco2 carbon monoxide Cco hy-drocarbonChc nitric oxideCno smoke intensityCop and theenvironmental detected values wind speed Cws wind di-rection Cwd temperature Cot humidity Ch and atmosphericpressure Cps the remote sensing data thus is recorded

as Crarr

(Cco2 Cco Chc Cno Cop Cws Cwd Cot Ch Cps) Eachrecord used in the research is composed of the above threeparts of information and the range of vehicle fuel type andthe registration time are the predicted targets

4 Experiments

41 Data Preprocessing

411 Fuel Type Prediction +e data of fuel type predictioncome from two tables vehicle information table and remotesensing record table+e ID and fuel type fields are extractedfrom the former table and the latter contains (a) vehiclerunning conditions (speed acceleration passing time etc)(b) Environmental meteorological conditions (lane winddirection wind speed temperature humidity and atmo-spheric pressure) (c) Remote sensing results (detectionvalue of carbon dioxide carbon monoxide hydrocarbonsnitrogen oxides and the intensity of smoke) +is paperrelates these two tables by vehicle ID of which the specificprocess is described as follows

(1) Data analysis +e preliminary statistical analysis on thenumber of vehicles with different fuel types is made and theresult (shown in Figure 3) demonstrates that vehicles withmultiple fuel types such as mixed oil and natural gas typeother than gasoline and diesel account for a very low pro-portion +e ratio between gasoline and diesel cars in thedata is about 46 1 and the unbalanced ratiomeans this is anunbalanced dataset

(2) Feature layering In the case of unbalanced distributiondataset all the data at the end of the horizontal axis in thedistribution graph are concentrated to one level so that thedata becomes more concentrated and the number of levelsof data can be reduced A temporary feature will be gen-erated for hierarchical sampling dividing the dataset intodifferent sections+is paper divides the vehicle passing timeinto three time periods morning afternoon and evening

(3) Data cleaning Data processing for character data ofvehicles andmissing values are the main components of datacleaning procedure Vehicle character features includelicense plate number license plate color and the test resultsand they are usually converted to one-hot codes in machine

learning +ere are three common methods to deal withmissing values ignoring the record of missing value ig-noring the missing features and medianaverage padding Ifthe license plate number license plate color or detectionresult is missing the record will be directly ignored andcontinuous value parameters are filled with the mean valuein other situations

(4) Feature selection +e main work is to find correlationsand feature combinations and generate a correlation matrixfor the original data Check the features in the original datathat are most relevant to the fuel type and use relatedtechniques to find the features that are positively corre-lated such as license plate color and nitric oxide Featuresthat show obvious negative correlation include validitytransit time and carbon dioxide +en check the featuresthat are most relevant to the date of registration of dieselvehicles in the original data and through technical meansfind that the features that show more obvious positivecorrelation include license plate color etc and the attri-butes that show more negative correlation include validitydetection line and nitric oxide In terms of feature com-bination because the concentration of plume is affected bythe wind speed and the vehiclersquos own speed during theremote sensing process the CO2 gas concentration is usedas a reference when recording the exhaust pollutant con-centration +e new feature combined is the ratio of otherpollution items to CO2 +e amount of data in this study istrillions after filtering the number of features is about 30+e decision tree algorithm is used to calculate featureimportance of fuel type prediction +e calculation resultsare shown in Figure 4 Using random forest to calculate thefeature importance of gasoline vehicle registration timeperiod is shown in Figure 5 and calculating the featureimportance of diesel vehicle registration time period isshown in Figure 6

(5) Feature scaling +is experiment discovered that thedifference in the value range between different features isvery large which brings huge obstacles to the decision part

1379271

8322

Gasoline powered vehicleNatural gas powered vehicle

Diesel powered vehicleOthers

028

Figure 3 +e unbalance data distribution of different fuel typevehicles


during the learning process of the model and larger valuesusually cause greater changes in the model Commonly usedscaling methods such as normalization and standardizationare introduced in experiments+e normalization formula is(xminusmean)num(v) x denotes the value of variance mean isthe average value of all the variances and num(v) is the totalnumber of variance +e normalization formula can bedenoted as xminusminmaxminusmin where min and max are themaximum value and minimum value among variancesrespectively With the normalization process the entirefeature values are mapped into the period of [0 1]

(6) Dataset division +is paper randomly divides the sourcedataset into two parts 85 data as training dataset and 15data as validation dataset

(7) Model training +e processed data is put into the threemodels for training and verification+e comparison amongdifferent results from these models is performed in theexperiment part

412 Registration Time Prediction According to the na-tional and local standards of motor vehicles the followingsubdivisions are made for the vehicle registration timeRelevant departments stipulate that for gasoline-poweredvehicles it is divided into two categories before 2001-10-1and after 2001-10-1 recorded as

GY 00 if before 2001-10-1

01 if after 2001-10-11113896 (12)

+e registration time of diesel-powered cars is dividedinto three periods ie before 2008-7-1 between 2008-7-1and 2013-7-1 and after 2013-7-1 denoted as

FY

10 if before 2008-7-1

11 if between 2008-7-1 and 2013-7-1

01 if after 2013-7-1

⎧⎪⎪⎨

⎪⎪⎩(13)

When the period division is done the prediction ofregistration time can be treated as a multiclassificationproblem

+is remote sensing record table contains three partsinformation (a) remote sensing data of vehicle runningconditions including speed acceleration and passing time(b) environmental meteorological conditions including lanecondition wind direction wind speed temperature hu-midity and atmospheric pressure (c) exhaust remotesensing results that contain detection value of carbon di-oxide carbon monoxide hydrocarbons nitrogen oxidesand the intensity of smoke In the classification of the na-tional landmarks and local landmarks mentioned abovethere exists a large error that it is impossible for theneighboring vehicles around certain segment points tochange a lot +erefore this paper chooses to discard therecord data of three months before and after the segmen-tation point

42 Parameter Optimization

421 Decision Tree Tuning +ere are mainly three pa-rameters to adjust in decision tree algorithm

D_max_depth represents the max depth of a decisiontree According to the principle of decision tree it canbe seen that with deeper layers a decision tree has morepower to thoroughly divide attributes and mine thedeep relationship between data +is paper experi-mentally sets the max_depth range from 1 to 32+e F1curve is shown in Figure 7

000 002001

Importance

004003 006005 007

Real-measured-NO

Real-measured-HC

Real-measured-CO2

Calculated-CO2

Vehicle specific power

Calculated-NO

Calculated-HCOpacityAvg

OpacityMaxOpacityPara

Opacity

Figure 6 Importance of features of diesel vehicle registrationperiod

00 01

Importance

Plate colorAccelerated speed

OpacityAvgSpeed

Vehicle length

Real-measured-CO2

Real-measured-NO

Real-measured-HC

Real-measured-COOpacityPara

Calculated-NOCalculated-CO2

Calculated-HC

Calculated-CO

02 03 04 05

Figure 4 Predicting the importance of fuel type attributes

000 002

Importance

004 006 008

Real-measured-CO2

Real-measured-COReal-measured-HC

Real-measured-NO

Calculated-HCCalculated-NO

OpacityAvg

OpacityOpacityMaxOpacityPara

Vehicle specific powerAccelerated speed

Calculated-CO2Calculated-CO

Figure 5 Importance of features of gasoline vehicle registrationtime period


D_min_samples_split denotes the minimum number ofsamples required to split the internal nodes At leastone sample is required for each node to performsplitting When the number of internal node samplesincreases more samples will participate in the split ofthe tree +e decision tree will suffer more constraintswith more reference data utilized in node splittingwhich could also affect the speed of model execution+is paper sets the minimum number of samples forinternal nodes ranging from 10 to 500 +e F1 curve isshown in Figure 8D_min_samples_leaf is the number of samples requiredfor leaf node splitting which is called the minimumnumber of samples +e leaf node will be pruned if thenumber of leaf node samples is less than the minimumone

In experiments the D_min_samples_leaf is set to rangefrom 1 to 100 and the F1 curve is plotted in Figure 9

According to Figures 7ndash9 the optimal parameter se-lection of the decision tree algorithm obtained by the ex-periment in this paper is as follows D_max_depth is 12D_min_samples_split is 150 and D_min_samples_leaf is 10

422 Random Forest Tuning +ere are four main param-eters to adjust in the random forest model

n_estimators is the number of decision trees in randomforest that plays a significant role in the performance ofthe model Small value of n_estimators means fewerbase classifiers participate in the decision processleading to the decrease in prediction accuracy whilelarge number of decision trees will bring out compu-tational burden to the system and take more runningtime +is paper sets n_estimators from 10 to 100 andplots the F1 curve as shown in Figure 10max_features represents the maximum number offeatures that can be used when the splitting of decisiontrees happens Each node selects all features in the

0

0900

0875

0850

0825

0800

0775

0750

0725

07005 10 15 20

max_depths

f1_score as function of tree max_depths

f1_s

core

25 30

max_depths

Figure 7 Effect of the maximum depth of a decision tree in targetprediction

0

089900

089875

089850

089825

089800

089775

089750

089725100 200 300 400

min_samples_split

min_samples_split maxdepth = 12

f1_score as function of tree min_samples_split

f1_s

core

500

Figure 8 Effect of the minimum number of samples required tosplit the internal nodes for a decision tree in target prediction

0

089900

089875

089850

089825

089800

089775

089950

089925

20 40 60 80min_samples_leaf

f1_score as function of tree min_samples_leaf

f1_s

core

100

min_samples_leaf min_samples_split = 150maxdepth = 12

Figure 9 Effect of the number of samples required for leaf nodesplitting for a decision tree in target prediction

0902

0900

0898

0896

0894

20 40 60n_estimators

f1_score as function of n_estimators ndash RF

f1_s

core

n_estimators

80 100

Figure 10 Effect of the number of decision trees in random forestin target prediction


splitting process when max features NoneAuto andselects no more than logN and

N

radicwhen

max features log and max features sqrt respec-tively N is the total number of features +ere are fewsample attributes in the experiment and this paper setsthe value of max_feature to AutoR_max_depth is the maximum depth of decision treesin random forest+is paper sets R_max_depth rangingfrom 1 to 100 and F1 curve is shown in Figure 11R_min_samples_leaf is the minimum number ofsamples in leaf node for splitting +is paper sets itranging from 1 to 100 and F1 curve is shown inFigure 12

According to the result of Figures 10ndash12 this paperdefines that n_estimators is 100 max_features is 5R_max_depth is 24 and R_min_samples_leaf is 2

423 AdaBoost Parameters Setting +e default base clas-sifier of AdaBoost is decision tree which is a classic ensemblelearning algorithm with a boosting structure +e baseclassifier parameters refer to the optimal parameters ofdecision tree model above where max_depth 12 min_-samples_split 150 and min_samples_leaf 10 Two sig-nificant parameters of AdaBoost are the number of tuningfor base classifiers Ntune and learning rate Lrate +e modelis easy to overfit if the Ntune is too large and underfitting inreverse +is paper sets Ntune 50 and Lrate to be 1 bydefault

43 Experimental Results and Analysis

431 Fuel Type Prediction Model +is section uses decisiontree random forest and AdaBoost algorithm to make fueltype prediction Vehicle fuel type prediction is a typical two-class classification Five classification models are used in theexperiment including decision tree random forest Ada-Boost algorithm hard voting fusion model and soft votingfusion model After parameters optimization of all kinds ofmodels the single classifier models are compared to fusionones to evaluate these models +e whole technologicalprocess is illustrated in Figure 13 More details are given inthe following subsections

In Section 42 this paper optimizes the parameters of asingle model Although the performance of the single modelis already very good it is based on the differences betweenthe single models and the model fusion of the single modelsbecomes very meaningful

It can be seen from Table 1 that random forest performsbest in a single model scene +e fusion model obtained bythe voting method also performed very well Compared withmost single models fusion models have higher predictionaccuracy In the process of predicting the vehiclersquos fuel typethis paper has obtained a good prediction effect+e randomforest algorithm and the fusion model have the best pre-diction results +e F1 value of the random forest is 9041and the F1 value of the soft fusion model is 903 Becausethe random forest prediction speed is faster and the

prediction model is better to explain the random forestprediction model is selected as the final model for fuel typeprediction

432 Registration Time Prediction Model Mixed Fuel TypeVehicle registration time predictions are divided into dieselvehicles registration time prediction and gasoline vehiclesregistration time prediction From the statistical analysis ofthe data it can be known that diesel vehicles are mainlydivided into vehicles registered between 2009 and 2013 andafter 2013 +e proportion of registrations before 2009 wasvery low Gasoline vehicles are mainly vehicles after 2008accounting for about 90 +e purpose of predicting theregistration date and fuel type is to find the limit value ofunknown vehicles to judge whether the emission is qualified+e earlier the registration the higher the limit valuestandard Predicting a car before 2008 to a car after 2008 isequivalent to selecting a low limit value and it is easy to

09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100

max_depths n_estimators = 40

Figure 11 Effect of maximum depth of decision trees in randomforest in target prediction

0902

0904

0900

0898

0896

0894

200 40 60min_samples_leaf

f1_score as function of tree min_samples_leaff1

_sco

re

80 100

min_samples_leaf max_depths = 24 n_estimators = 40

Figure 12 Effect of the minimum number of samples in leaf nodefor splitting for target prediction using random forest


judge a vehicle with a qualified emission as unqualifiedwhich causes a waste of resources for car owners and en-vironmental inspection workstations Based on the aboveanalysis random forest and XgBoost are used in this sectionto create a prediction model of registration time

+e classification periods are as follows gasoline + after2001-10-1 gasoline + before 2001-10-1 diesel + during2008-7-1 and 2013-7-1 diesel + during before 2008-7-1 anddiesel + after 2013-7-1 Since it is a multiclassification

problem this paper uses random forest and XgBoost toperform prediction and the results are shown in Table 2 Inorder to reduce the randomness of the learning algorithmthe results show the mean and variance of the results of 10independent runs

+is paper finds that when the data is divided into fivecategories the verification results are unbalanced +enumber of gasoline-powered vehicles after 2001 is higherand the verification accuracy is much lower than the training

Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing

Decision tree algorithm

Random forest algorithm

AdaBoost algorithm

Decision tree classifier

Random forest classifier

AdaBoost classifier

Running time Accuracy rate

F1 valuePrecision rate

Recall

Indicators

Model parameters tuning Feature selection

Model fusion

e final classification model

Model trainingResults

comparison

Data preprocessing

Figure 13 Architecture diagram of the classification model

Table 1 +e performance of multiple models used in the experiment for fuel type prediction

Model Accuracy rate () Precision rate () Recall () F1 ()Decision tree 9734 9683 8401 8996Random forest 9746 9732 8441 9041AdaBoost 9678 9341 8321 8802Hard voting 9739 9774 8358 9011Soft voting 9743 9742 8416 9030


accuracy +e random forest model is superior to theXgBoost model in terms of training accuracy and verifica-tion accuracy Its training accuracy reaches 990 and itsverification accuracy is about 917 which indicates thatoverfitting occasionally happens leading to the decrease inverification accuracy

433 Registration Time Prediction Model Gasoline Vehicle+e gasoline cars are classified as follows gasoline + after2001-10-1 and gasoline + before 2001-10-1 +e results areshown in Table 3

+e prediction accuracy and verification are more than99 using the random forest model+is is mainly caused bythe severely unbalanced data distribution and the insuffi-cient data amount

5 Conclusions

Environmental protection has been a hot topic in academicand industrial communities +is paper focuses on pre-dicting the missing basic information of vehicles from te-lemetry data to monitor the vehicle emission A variety ofdata mining methods are adopted to perform predictionsbased on the vehicle telemetry data provided by an envi-ronmental protection agency in a certain city and success-fully made precise inferences on fuel type and gasoline-powered vehicle registration time In the prediction for theregistration time of diesel vehicles the prediction accuracyrate just reaches about 70 due to the fact that the division ofregistration time is artificially controlled and the status ofdifferent vehicles varies a lot for different users Furtherwork will be carried out on the basis of more related data andimproved algorithms to make more precise prediction onthe vehicle emission-related information

Data Availability

+e SQL data used to support the findings of this study havenot been made available because the data provider is theMunicipal Bureau of Ecology and Environment of a certaincity and the data is not public

Conflicts of Interest

+e authors declare that there are no conflicts of interestregarding the publication of this paper

Acknowledgments

+is work was supported in part by the National NaturalScience Foundation of China (NSFC) under Grants61772386 and 61862015 and in part by the State Grid HubeiElectric Power Co Ltd under GrantSGHBDK00DWJS1800134

References

[1] S A Yashnik S P Denisov N M Danchenko andZ R Ismagilov ldquoSynergetic effect of Pd addition on catalyticbehavior of monolithic platinum-manganese-alumina cata-lysts for diesel vehicle emission controlrdquo Applied Catalysis BEnvironmental vol 185 no 15 pp 322ndash336 2016

[2] E M Fujita D E Campbell B Zielinska et al ldquoComparisonof the MOVES2010a MOBILE62 and EMFAC2007 mobilesource emission models with on-road traffic tunnel and re-mote sensing measurementsrdquo Journal of the Air amp WasteManagement Association vol 62 no 10 pp 1134ndash1149 2012

[3] J S Fu X Dong Y Gao D C Wong and Y F LamldquoSensitivity and linearity analysis of ozone in East Asia theeffects of domestic emission and intercontinental transportrdquoJournal of the Air amp Waste Management Association vol 62no 9 pp 1102ndash1114 2012

[4] A Kfoury F Ledoux C Roche G Delmaire G Roussel andD Courcot ldquoPM25 source apportionment in a French urbancoastal site under steelworks emission influences usingconstrained non-negative matrix factorization receptormodelrdquo Journal of Environmental Sciences vol 40 pp 114ndash128 2016

[5] Y Cheng Comparative Study on Sino-US Control Systems ofMotor Vehicle Exhaust Pollution China University of Geo-sciences Beijing China 2011

[6] X Wu Detection and Standard Revision of Exhaust Pollutantsby Acceleration Simulation Mode for In-Use Gasoline VehicleChangrsquoan University Xirsquoan China 2012

[7] G A Bishop J R Starkey A Ihlenfeldt W J Williams andD H Stedman ldquoIR long-path photometry a remote sensingtool for automobile emissionsrdquo Analytical Chemistry vol 61no 10 pp 671Andash677A 1989

[8] R D Stephens and S H Cadle ldquoRemote sensing measure-ments of carbon monoxide emissions from on-road vehiclesrdquoJournal of the Air amp Waste Management Association vol 41no 1 pp 39ndash46 1991

[9] D H Stedman G Bishop and S McLaren 1995 AlexandriaVA USA US Patent and Trademark Office US Patent No5401967

Table 2 Temperature and wildlife count in the five areas covered by the study

Model +e ratio of five classes Training accuracy Test accuracyRandom forest 1776631 3326 223333 157129 6376 0990 plusmn 0005 0917 plusmn 0006XgBoost 1776631 3326 223333 157129 6376 0983 plusmn 0004 0903 plusmn 0005

Table 3 Temperature and wildlife count in the three areas covered by the study

Model +e ratio of three classes Training accuracy Test accuracyRandom forest 223333 157129 6376 0985 plusmn 0009 0679 plusmn 0008XgBoost 223333 157129 6376 0987 plusmn 0010 0901 plusmn 0006


[10] P L Guenther D H Stedman G A Bishop S P BeatonJ H Bean and R W Quine ldquoA hydrocarbon detector for theremote sensing of vehicle exhaust emissionsrdquo Review ofScientific Instruments vol 66 no 4 pp 3024ndash3029 1995

[11] Y Zhang D H Stedman G A Bishop P L Guenther andS P Beaton ldquoWorldwide on-road vehicle exhaust emissionsstudy by remote sensingrdquo Environmental Science amp Tech-nology vol 29 no 9 pp 2286ndash2294 1995

[12] P J Popp G A Bishop and D H Stedman ldquoDevelopment ofa high-speed ultraviolet spectrometer for remote sensing ofmobile source nitric oxide emissionsrdquo Journal of the Air ampWaste Management Association vol 49 no 12 pp 1463ndash1468 1999

[13] D A Burgard G A Bishop R S Stadtmuller T R Daltonand D H Stedman ldquoSpectroscopy applied to on-road mobilesource emissionsrdquo Applied Spectroscopy vol 60 no 5pp 135Andash148A 2006

[14] D H Stedman and G A Bishop ldquoOpacity enhancement ofthe on-road remote sensor for HC CO and NOrdquo Final Reportprepared for CRC-E56-2 University of Denver Denver COUSA 2002

[15] H Moosmuller C Mazzoleni P W Barber H D KuhnsR E Keislar and J G Watson ldquoOn-road measurement ofautomotive particle emissions by ultraviolet lidar andtransmissometer Instrumentrdquo Environmental Science ampTechnology vol 37 no 21 pp 4971ndash4978 2003

[16] C Brekke and A H S Solberg ldquoOil spill detection by satelliteremote sensingrdquo Remote Sensing of Environment vol 95no 1 pp 1ndash13 2005

[17] J L Jimenez M D Koplow D D Nelson M S Zahniser andS E Schmidt ldquoCharacterization of on-road vehicle NOemissions by a TILDAS remote sensorrdquo Journal of the Air ampWaste Management Association vol 49 no 4 pp 463ndash4701999

[18] S Marsland Machine Learning An Algorithmic PerspectiveCRC Press Boca Raton FL USA 2005

[19] R Barrett A Facey W Nxumalo et al ldquoDynamic trafficdiversion in SDN testbed vs mininetrdquo in Proceedings of the2017 International Conference on Computing Networking andCommunications (ICNC) pp 167ndash171 IEEE Santa Clara CAUSA January 2017

[20] K Goniewicz M Goniewicz W Pawłowski and P FiedorldquoRoad accident rates strategies and programmes for im-proving road traffic safetyrdquo European Journal of Trauma andEmergency Surgery vol 42 no 4 pp 433ndash438 2016

[21] M Yuan Y Wu and L Lin ldquoFault diagnosis and remaininguseful life estimation of aero engine using LSTM neuralnetworkrdquo in Proceedings of the 2016 IEEE InternationalConference on Aircraft Utility Systems (AUS) pp 135ndash140IEEE Beijing China October 2016

[22] P Cordellieri F Baralla F Ferlazzo et al ldquoGender effects inyoung road users on road safety attitudes behaviors and riskperceptionrdquo Frontiers in Psychology vol 7 p 1412 2016

[23] D A Wiegmann and S A Shappell A Human Error Ap-proach to Aviation Accident Analysis Ce Human FactorsAnalysis and Classification System Routledge Abingdon UK2017

[24] J Ma J Zhao J Jiang H Zhou and X Guo ldquoLocalitypreserving matchingrdquo International Journal of ComputerVision vol 127 no 5 pp 512ndash531 2019

[25] J Ma H Xu J Jiang X Mei and X-P Zhang ldquoDDcGAN adual-discriminator conditional generative adversarial net-work for multi-resolution image fusionrdquo IEEE Transactionson Image Processing vol 29 pp 4980ndash4995 2020

[26] J Ma XWang and J Jiang ldquoImage superresolution via densediscriminative networkrdquo IEEE Transactions on IndustrialElectronics vol 67 no 7 pp 5687ndash5695 2020

[27] J Ma J Wu J Zhao J Jiang H Zhou and Q Z ShengldquoNonrigid point set registration with robust transformationlearning under manifold regularizationrdquo IEEE Transactionson Neural Networks and Learning Systems vol 30 no 12pp 3584ndash3597 2019

[28] Z Shao J Cai P Fu L Hu and T Liu ldquoDeep learning-basedfusion of Landsat-8 and Sentinel-2 images for a harmonizedsurface reflectance productrdquo Remote Sensing of Environmentvol 235 p 111425 2019

[29] Z Shao L Wang Z Wang W Du and W Wu ldquoSaliency-aware Convolution neural network for ship detection insurveillance videordquo IEEE Transactions on Circuits and Systemsfor Video Technology vol 30 no 3 pp 781ndash794 2020

[30] L Zhou Z Wang Y Luo and Z Xiong ldquoSeparability andCompactness network for image recognition and super-resolutionrdquo IEEE Transactions on Neural Networks andLearning Systems vol 30 no 11 pp 3275ndash3286 2019

[31] Z Wang P Yi K Jiang et al ldquoMulti-memory Convolutionalneural network for video super-resolutionrdquo IEEE Transac-tions on Image Processing vol 28 no 5 pp 2530ndash2544 2019

[32] P Yi Z Wang K Jiang Z Shao and J Ma ldquoMulti-temporalultra dense memory network for video super-resolutionrdquoIEEE Transactions on Circuits and Systems for Video Tech-nology vol 30 no 8 p 1 2020

[33] J Jiang C Chen J Ma Z Wang Z Wang and R HuldquoSRLSP a face image super-resolution algorithm usingsmooth regression with local structure priorrdquo IEEE Trans-actions on Multimedia vol 19 no 1 pp 27ndash40 2017

[34] J Jiang X Ma C Chen T Lu Z Wang and J Ma ldquoSingleimage super-resolution via locally regularized anchoredneighborhood regression and nonlocal meansrdquo IEEE Trans-actions on Multimedia vol 19 no 1 pp 15ndash26 2017

[35] K Jiang Z Wang P Yi G Wang T Lu and J Jiang ldquoEdge-enhanced GAN for remote sensing image superresolutionrdquoIEEE Transactions on Geoscience and Remote Sensing vol 57no 8 pp 5799ndash5812 2019

[36] J Xu M Saleh and M Hatzopoulou ldquoA machine learningapproach capturing the effects of driving behaviour and drivercharacteristics on trip-level emissionsrdquo Atmospheric Envi-ronment vol 224 p 117311 2020

[37] J C Ferreira J de Almeida and A R da Silva ldquo+e impact ofdriving styles on fuel consumption a data-warehouse-and-data-mining-based discovery processrdquo IEEE Transactions onIntelligent Transportation Systems vol 16 no 5 pp 2653ndash2662 2015

[38] C Chen X Zhao Y Yao et al ldquoDriverrsquos eco-driving behaviorevaluation modeling based on driving eventsrdquo Journal ofAdvanced Transportation vol 2018 Article ID 953047012 pages 2018

[39] J V Leme W Casaca M Colnago and M A Dias ldquoTowardsassessing the electricity demand in Brazil data-driven analysisand ensemble learning modelsrdquo Energies vol 13 no 6p 1407 2020

[40] Y Y Song and L U Ying ldquoDecision tree methods appli-cations for classification and predictionrdquo Shanghai Archives ofPsychiatry vol 27 no 2 p 130 2015

[41] T K Ho ldquo+e random subspace method for constructingdecision forestsrdquo IEEE Transactions on Pattern Analysis andMachine Intelligence vol 20 no 8 pp 832ndash844 1998

[42] S W Kwok and C Carter ldquoMultiple decision treesrdquo Un-certainty in Artificial Intelligence vol 9 pp 327ndash335 1990


[43] R Kumar and R Verma ldquoClassification algorithms for datamining a surveyrdquo International Journal of Innovations inEngineering and Technology (IJIET) vol 1 no 2 pp 7ndash142012

[44] T Chen and C Guestrin ldquoXgboost a scalable tree boostingsystemrdquo in Proceedings of the 22nd ACM Sigkdd InternationalConference on Knowledge Discovery and Data Miningpp 785ndash794 San Francisco CA USA August 2016

[45] J H Friedman ldquomachinerdquo Ce Annals of Statistics vol 29no 5 pp 1189ndash1232 2001

[46] C Macdonald and I Ounis ldquoVoting for candidates adaptingdata fusion techniques for an expert search taskrdquo in Pro-ceedings of the 15th ACM International Conference on In-formation and Knowledge Management pp 387ndash396Arlington VA USA November 2006


amount of practical exhaust emission data can be obtainedby environmental protection agencies in China +is paperintroduces data mining technology to these valuable data toexplore efficient information in vehicle exhaust emissiondetection +is research has a huge potential contribution inpromoting the environmental protection departmentrsquos ac-curate assessment of unqualified vehicles and providing atheoretical basis for policymakers to learn from

+e first successful vehicle emissions demonstrationsystem was probably an across-road vehicle emissions re-mote sensing system (VERSS) proposed by Gary Bishop andcolleagues in the University of Denver in the late 1980s [7 8]A liquid nitrogen cooled nondispersive infrared was the firstinstrument that can only measure CO and CO2 In the nexttwo decades their team continuously refined the systemadded hydrocarbon H2O and NO channels to their NDIRsystem [9 10] integrated an ultraviolet spectrophotometerand improved it to enhance NO measurement [11 12] andremoved the dependence on the liquid nitrogen cooling [13]+e Denver group designed another commonly used remotesensing device known as fuel efficiency automobile testproviding some of the inchoate comments on across-roadparticulate measurement [14] +ere are also many othersensing systems typically based on multiple spectrometricapproaches proposed for detection of passing vehicleemissions [15ndash17] More recently Hager Environmental andAtmospheric Technologies introduced an infrared laser-based VERSS named Emission Detection and Reporting(EDAR) system which incorporated several new functionsmaking it a particularly interesting system for vehicleemission detection

Important information is buried in the vehicle emissionremote sensing data +is paper exploits data miningmethods to deal with the data and obtain valuable knowl-edge from them +ere are three main directions in datamining the improvements of classical data mining algo-rithms ensemble learning algorithms and data mining withdeep learning+e improvements on classical algorithms areusually performed and employed in multiple applicationscenarios taking additional information into considerationEnsemble learning is actually the integration of multiplelearners with a certain structure which completes learningtasks by constructing and combining different learners Itsgeneral structure can be concluded as follows firstly gen-erate a set of individual learners and then combine themwith some strategies +e combining strategies mainly in-clude average method voting method and learning methodBagging and boosting [18] are the most commonly usedensemble learning algorithms which improve the accuracyand robustness of prediction models As the rapid devel-opment and popularization of deep learning it plays moreand more important roles in data learning with the supportof big data and high-performance computing Many trafficengineering-related researches mainly focus on analyzingrelevant data such as traffic diversion [19] traffic safetymonitoring [20] engine diagnosis [21] road safety [22] andtraffic accident [23] and remote sensing image processing[24ndash35] extracting useful information and digging outvaluable knowledge A few works are proposed in vehicle

emission evaluation in data mining ways which is the keystudy subject in this paper Xu et al [36] used XgBoost todevelop prediction models for CO2eq and PM25 emissionsat a trip level In [37] Ferreira et al applied online analyticalprocessing (OLAP) and knowledge discovery (KD) tech-niques to deal with the high volume of this dataset and todetermine the major factors that influence the average fuelconsumption and then classify the drivers involvedaccording to their driving efficiency Chen et al [38] pro-posed a driving-events-based ecodriving behaviour evalu-ation model and the model was proved to be highly accurate(9672)

Relevant environmental policies have been introduced todefine difficult limitation standards based on the vehicle fueltype and registration time in China +e vehicle license platenumber plate color speed acceleration and VSP (vehiclespecific power) etc will be captured by the surveillancesystem when vehicles pass by the remote survey stations+eanalysis for the smoke plume generated by gas emission issimultaneously conducted by laser gears at the stationswhere the exhaust emission value can be calculatedWith thefuel type and registration time information learned fromvehicle plate numbers it is able to obtain the gas emissionstandard value to judge whether the vehicle emission iseligible However register information of nonlocal vehiclesand partial local vehicles is not recorded in the officialdatabase due to the limitation of environmental policieswhich leads to the failure to provide the fuel type andregistration time information for vehicle emission detectionAccording to the National Telemetry Standard in Chinarelevant departments will treat the information-missingvehicles as the diesel consumption ones and this situationkeeps the limitation criteria of the emission value of partialvehicles unknown resulting in the evaluation for thesevehicles being unable to carry on +erefore the preciseinformation upon fuel types and registration time of vehiclesis an essential prerequisite for finding out the pollution-exceeding vehicles +is paper adopts multiple data miningmethods to learn the fuel type and registration informationof vehicles from remote sensing data and further utilizecascaded classified framework to make accurate predictionon vehicle emission-related information providing valuablereference standards on evaluation of different vehicles

2 Data Mining Models for Analysis

In this section detailed descriptions on the models anddataset used in this study are given [39]

21 Data Mining Methods

211 Decision Tree Model Decision tree model [40] is acommonly used data mining method based on informationtheory and a greedy algorithm-like framework which isproposed for classification or prediction +e model dividesthe whole dataset into branch-like parts to construct aninverted tree with a root node internal nodes and leafnodes +e nonparametric design enhances the efficiency









1113874 1113875 (1)


which is defined as

Em P Gm xi( 1113857neyi( 1113857 1113944n

i1wiI Gm xi( 1113857neyi( 1113857 (2)



Vehicleattribute 1

Vehicleattribute 2

Vehiclefuel type

Vehiclefuel type

Branches

Vehicleattribute N

Y

Y N Y N

N



αm 12ln

1 minus Em( 1113857

Em

1113890 1113891 (3)



Zm

(4)


Zm 1113944n




m1αmGm(x)⎛⎝ ⎞⎠ (6)




k1yklogpk(x) (7)


Pk(x) exp fk(x)( 1113857

1113936Kl1 exp fl(x)( 1113857

(8)



zf xi( 11138571113890 1113891

fk(x)fltminus1(x)

(9)

Trainingdata

Trainingdataset 1



Voting

Result


Trainingdataset 2

Trainingdataset n

Booststrap sampling








lab(x i 1) 1113944i


lab(x i k)⎧⎨

⎩

⎫⎬

⎭ (10)





ncf

1113936ip(x i 2)

ncf


ncf

1113896 1113897

(11)



3 Data Description








rarris denoted as V

rarr (m c t)






as Crarr


4 Experiments









1379271

8322



028








01 if after 2001-10-11113896 (12)


FY

10 if before 2008-7-1

11 if between 2008-7-1 and 2013-7-1

01 if after 2013-7-1

⎧⎪⎪⎨

⎪⎪⎩(13)






000 002001

Importance

004003 006005 007

Real-measured-NO

Real-measured-HC

Real-measured-CO2

Calculated-CO2


Calculated-NO



Opacity


00 01

Importance


OpacityAvgSpeed

Vehicle length

Real-measured-CO2

Real-measured-NO

Real-measured-HC



Calculated-HC

Calculated-CO

02 03 04 05


000 002

Importance

004 006 008

Real-measured-CO2


Real-measured-NO


OpacityAvg











0

0900

0875

0850

0825

0800

0775

0750

0725

07005 10 15 20

max_depths


f1_s

core

25 30

max_depths


0

089900

089875

089850

089825

089800

089775

089750

089725100 200 300 400

min_samples_split



f1_s

core

500


0

089900

089875

089850

089825

089800

089775

089950

089925



f1_s

core

100



0902

0900

0898

0896

0894



f1_s

core

n_estimators

80 100




N

radicwhen










09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100



0902

0904

0900

0898

0896

0894



_sco

re

80 100








Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References





























































1113874 1113875 (1)


which is defined as

Em P Gm xi( 1113857neyi( 1113857 1113944n

i1wiI Gm xi( 1113857neyi( 1113857 (2)



Vehicleattribute 1

Vehicleattribute 2

Vehiclefuel type

Vehiclefuel type

Branches

Vehicleattribute N

Y

Y N Y N

N



αm 12ln

1 minus Em( 1113857

Em

1113890 1113891 (3)



Zm

(4)


Zm 1113944n




m1αmGm(x)⎛⎝ ⎞⎠ (6)




k1yklogpk(x) (7)


Pk(x) exp fk(x)( 1113857

1113936Kl1 exp fl(x)( 1113857

(8)



zf xi( 11138571113890 1113891

fk(x)fltminus1(x)

(9)

Trainingdata

Trainingdataset 1



Voting

Result


Trainingdataset 2

Trainingdataset n

Booststrap sampling








lab(x i 1) 1113944i


lab(x i k)⎧⎨

⎩

⎫⎬

⎭ (10)





ncf

1113936ip(x i 2)

ncf


ncf

1113896 1113897

(11)



3 Data Description








rarris denoted as V

rarr (m c t)






as Crarr


4 Experiments









1379271

8322



028








01 if after 2001-10-11113896 (12)


FY

10 if before 2008-7-1

11 if between 2008-7-1 and 2013-7-1

01 if after 2013-7-1

⎧⎪⎪⎨

⎪⎪⎩(13)






000 002001

Importance

004003 006005 007

Real-measured-NO

Real-measured-HC

Real-measured-CO2

Calculated-CO2


Calculated-NO



Opacity


00 01

Importance


OpacityAvgSpeed

Vehicle length

Real-measured-CO2

Real-measured-NO

Real-measured-HC



Calculated-HC

Calculated-CO

02 03 04 05


000 002

Importance

004 006 008

Real-measured-CO2


Real-measured-NO


OpacityAvg











0

0900

0875

0850

0825

0800

0775

0750

0725

07005 10 15 20

max_depths


f1_s

core

25 30

max_depths


0

089900

089875

089850

089825

089800

089775

089750

089725100 200 300 400

min_samples_split



f1_s

core

500


0

089900

089875

089850

089825

089800

089775

089950

089925



f1_s

core

100



0902

0900

0898

0896

0894



f1_s

core

n_estimators

80 100




N

radicwhen










09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100



0902

0904

0900

0898

0896

0894



_sco

re

80 100








Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References






















































αm 12ln

1 minus Em( 1113857

Em

1113890 1113891 (3)



Zm

(4)


Zm 1113944n




m1αmGm(x)⎛⎝ ⎞⎠ (6)




k1yklogpk(x) (7)


Pk(x) exp fk(x)( 1113857

1113936Kl1 exp fl(x)( 1113857

(8)



zf xi( 11138571113890 1113891

fk(x)fltminus1(x)

(9)

Trainingdata

Trainingdataset 1



Voting

Result


Trainingdataset 2

Trainingdataset n

Booststrap sampling








lab(x i 1) 1113944i


lab(x i k)⎧⎨

⎩

⎫⎬

⎭ (10)





ncf

1113936ip(x i 2)

ncf


ncf

1113896 1113897

(11)



3 Data Description








rarris denoted as V

rarr (m c t)






as Crarr


4 Experiments









1379271

8322



028








01 if after 2001-10-11113896 (12)


FY

10 if before 2008-7-1

11 if between 2008-7-1 and 2013-7-1

01 if after 2013-7-1

⎧⎪⎪⎨

⎪⎪⎩(13)






000 002001

Importance

004003 006005 007

Real-measured-NO

Real-measured-HC

Real-measured-CO2

Calculated-CO2


Calculated-NO



Opacity


00 01

Importance


OpacityAvgSpeed

Vehicle length

Real-measured-CO2

Real-measured-NO

Real-measured-HC



Calculated-HC

Calculated-CO

02 03 04 05


000 002

Importance

004 006 008

Real-measured-CO2


Real-measured-NO


OpacityAvg











0

0900

0875

0850

0825

0800

0775

0750

0725

07005 10 15 20

max_depths


f1_s

core

25 30

max_depths


0

089900

089875

089850

089825

089800

089775

089750

089725100 200 300 400

min_samples_split



f1_s

core

500


0

089900

089875

089850

089825

089800

089775

089950

089925



f1_s

core

100



0902

0900

0898

0896

0894



f1_s

core

n_estimators

80 100




N

radicwhen










09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100



0902

0904

0900

0898

0896

0894



_sco

re

80 100








Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References



























































lab(x i 1) 1113944i


lab(x i k)⎧⎨

⎩

⎫⎬

⎭ (10)





ncf

1113936ip(x i 2)

ncf


ncf

1113896 1113897

(11)



3 Data Description








rarris denoted as V

rarr (m c t)






as Crarr


4 Experiments









1379271

8322



028








01 if after 2001-10-11113896 (12)


FY

10 if before 2008-7-1

11 if between 2008-7-1 and 2013-7-1

01 if after 2013-7-1

⎧⎪⎪⎨

⎪⎪⎩(13)






000 002001

Importance

004003 006005 007

Real-measured-NO

Real-measured-HC

Real-measured-CO2

Calculated-CO2


Calculated-NO



Opacity


00 01

Importance


OpacityAvgSpeed

Vehicle length

Real-measured-CO2

Real-measured-NO

Real-measured-HC



Calculated-HC

Calculated-CO

02 03 04 05


000 002

Importance

004 006 008

Real-measured-CO2


Real-measured-NO


OpacityAvg











0

0900

0875

0850

0825

0800

0775

0750

0725

07005 10 15 20

max_depths


f1_s

core

25 30

max_depths


0

089900

089875

089850

089825

089800

089775

089750

089725100 200 300 400

min_samples_split



f1_s

core

500


0

089900

089875

089850

089825

089800

089775

089950

089925



f1_s

core

100



0902

0900

0898

0896

0894



f1_s

core

n_estimators

80 100




N

radicwhen










09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100



0902

0904

0900

0898

0896

0894



_sco

re

80 100








Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References
























































rarris denoted as V

rarr (m c t)






as Crarr


4 Experiments









1379271

8322



028








01 if after 2001-10-11113896 (12)


FY

10 if before 2008-7-1

11 if between 2008-7-1 and 2013-7-1

01 if after 2013-7-1

⎧⎪⎪⎨

⎪⎪⎩(13)






000 002001

Importance

004003 006005 007

Real-measured-NO

Real-measured-HC

Real-measured-CO2

Calculated-CO2


Calculated-NO



Opacity


00 01

Importance


OpacityAvgSpeed

Vehicle length

Real-measured-CO2

Real-measured-NO

Real-measured-HC



Calculated-HC

Calculated-CO

02 03 04 05


000 002

Importance

004 006 008

Real-measured-CO2


Real-measured-NO


OpacityAvg











0

0900

0875

0850

0825

0800

0775

0750

0725

07005 10 15 20

max_depths


f1_s

core

25 30

max_depths


0

089900

089875

089850

089825

089800

089775

089750

089725100 200 300 400

min_samples_split



f1_s

core

500


0

089900

089875

089850

089825

089800

089775

089950

089925



f1_s

core

100



0902

0900

0898

0896

0894



f1_s

core

n_estimators

80 100




N

radicwhen










09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100



0902

0904

0900

0898

0896

0894



_sco

re

80 100








Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References



























































01 if after 2001-10-11113896 (12)


FY

10 if before 2008-7-1

11 if between 2008-7-1 and 2013-7-1

01 if after 2013-7-1

⎧⎪⎪⎨

⎪⎪⎩(13)






000 002001

Importance

004003 006005 007

Real-measured-NO

Real-measured-HC

Real-measured-CO2

Calculated-CO2


Calculated-NO



Opacity


00 01

Importance


OpacityAvgSpeed

Vehicle length

Real-measured-CO2

Real-measured-NO

Real-measured-HC



Calculated-HC

Calculated-CO

02 03 04 05


000 002

Importance

004 006 008

Real-measured-CO2


Real-measured-NO


OpacityAvg











0

0900

0875

0850

0825

0800

0775

0750

0725

07005 10 15 20

max_depths


f1_s

core

25 30

max_depths


0

089900

089875

089850

089825

089800

089775

089750

089725100 200 300 400

min_samples_split



f1_s

core

500


0

089900

089875

089850

089825

089800

089775

089950

089925



f1_s

core

100



0902

0900

0898

0896

0894



f1_s

core

n_estimators

80 100




N

radicwhen










09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100



0902

0904

0900

0898

0896

0894



_sco

re

80 100








Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References



























































0

0900

0875

0850

0825

0800

0775

0750

0725

07005 10 15 20

max_depths


f1_s

core

25 30

max_depths


0

089900

089875

089850

089825

089800

089775

089750

089725100 200 300 400

min_samples_split



f1_s

core

500


0

089900

089875

089850

089825

089800

089775

089950

089925



f1_s

core

100



0902

0900

0898

0896

0894



f1_s

core

n_estimators

80 100




N

radicwhen










09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100



0902

0904

0900

0898

0896

0894



_sco

re

80 100








Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References























































N

radicwhen










09

08

07

06

05

200 40 60max_depths


f1_s

core

80 100



0902

0904

0900

0898

0896

0894



_sco

re

80 100








Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References


























































Trainingdataset

Testdataset

Sourcedataset

Dividedinto

17 3

Data preprocessing



AdaBoost algorithm



AdaBoost classifier



Recall

Indicators


Model fusion



comparison

Data preprocessing








5 Conclusions


Data Availability




Acknowledgments


References

























































5 Conclusions


Data Availability




Acknowledgments


References


































































































Documents

ResearchArticle Vehicle Emission Detection in Data-Driven ...downloads.hindawi.com/journals/mpe/2020/4875310.pdfamount of practical exhaust emission data can be obtained by environmental