View
215
Download
0
Category
Preview:
Citation preview
Neurocomputing 240 (2017) 183–190
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Optimization enhanced genetic algorithm-support vector regression
for the prediction of compound retention indices in gas
chromatography
Jun Zhang
a , Chun-Hou Zheng
a , Yi Xia
a , Bing Wang
b , Peng Chen
c , ∗
a School of Electronic Engineering and Automation, Institute of Health Sciences, Anhui University, Hefei, Anhui 230601, China b School of Electrical and Information Engineering, Anhui University of Technology, Ma’anshan 243032, China c Institute of Health Sciences, Anhui University, Hefei, Anhui 230601,China
a r t i c l e i n f o
Article history:
Received 28 December 2015
Revised 15 October 2016
Accepted 22 November 2016
Available online 16 February 2017
Communicated by Prof. D.-S. Huang
Keywords:
Support vector regression
Quantitative structure–retention
relationship
Retention indices prediction
Gas chromatography
a b s t r a c t
A new method using genetic algorithm and support vector regression with parameter optimization (GA–
SVR–PO) was developed for the prediction of compound retention indices (RI) in gas chromatography.
The dataset used in this work consists of 252 compounds extracted from the Molecular Operating Envi-
ronment (MOE) boiling point database. Molecular descriptors were calculated by descriptor tools of the
MOE software package. After removing redundant descriptors, 151 descriptors were obtained for each
compound. A genetic algorithm (GA) was used to select the best subset of molecular descriptors and
the best parameters of SVR to optimize the prediction performance of compound retention indices. A
10-fold cross-validation method was used to evaluate the prediction performance. We compared the per-
formance of our proposed model with three existing methods: GA coupled with multiple linear regres-
sion (GA–MLR), the subset selected by GA–MLR used to train SVR (GA–MLR–SVR), and GA on SVR (GA–
SVR). The experimental results demonstrate that our proposed GA–SVR–PO model has better predictive
performance than other existing models with R 2 > 0.967 and RMSE = 49.94. The prediction accuracy of
GA–SVR–PO model is 96% at 10% of prediction variation.
© 2017 Elsevier B.V. All rights reserved.
1
i
c
G
e
T
b
m
A
t
r
m
t
m
m
b
a
s
i
t
n
c
v
q
p
v
m
p
l
M
f
f
m
h
0
. Introduction
Gas chromatography coupled with mass spectrometry (GC–MS)
s a powerful analytical platform for the identification and quantifi-
ation of small molecules in chemistry and biomedical research. A
C–MS system measures the retention time and mass spectrum of
ach molecule. Currently, the National Institute of Standards and
echnology (NIST) MS database (NIST/EPA/NIH Mass Spectral Li-
rary) is widely used for molecular identification using automated
ass spectral deconvolution and identification system (AMDIS).
MDIS identify molecules based on the spectrum similarity be-
ween the experimental mass spectrum and the mass spectrum
ecorded in the NIST MS library [1] .
Retention time is a measure of the interactions between a
olecule and the stationary phase of the GC column. Therefore,
he molecular retention time in GC is actually correlated to the
olecular structure. Unfolding such inherent relation between the
olecular retention time and molecular structure will significantly
∗ Corresponding author.
E-mail address: wwwzhangjun@163.com (P. Chen).
m
t
b
ttp://dx.doi.org/10.1016/j.neucom.2016.11.070
925-2312/© 2017 Elsevier B.V. All rights reserved.
enefit not only the understanding of gas phase chemistry, but
lso the molecular identification in metabolomics and other re-
earch fields. This is often done by converting the retention time
nto the retention index (RI). The RI of a molecule is its reten-
ion time normalized to the retention times of adjacently eluting
-alkanes, which can be achieved by either an internal or external
alibration experiment. While retention times vary with the indi-
idual chromatographic system, the derived retention indices are
uite independent of chromatographic parameters and allow com-
aring values measured by different analytical laboratories under
arying conditions. The Kovats RI is used for isothermal experi-
ents [2] and linear RI is designed for temperature gradient ex-
eriments [3] . However, the current experimental RI data are very
imited compared to the mass spectral data recorded in the NIST
S library. There are only 21,940 molecules that have the RI in-
ormation even though the NIST MS library contains mass spectra
or 192,108 molecules. In order to employ RI as a match factor for
etabolite identification, it is necessary to theoretically predict the
olecular RI values for the molecules that do not have experimen-
al RI information.
Quantitative structure–retention relationship (QSRR) model has
een used to estimate the molecular RI values according to the
184 J. Zhang et al. / Neurocomputing 240 (2017) 183–190
m
R
t
2
c
z
e
o
r
m
c
2
m
d
v
m
p
o
u
Q
s
s
i
t
m
E
m
u
w
b
3
d
m
3
t
f
R
w
o
u
3
m
w
l
molecular descriptors generated from the chemical structure [4–
6] . The success of a QSRR model depends on the accuracy of input
RI data, the selection of appropriate molecular descriptors, and the
statistical tools for retention indices prediction. Most QSRR studies
focus on the selection of suitable statistical tools. The developed
methods for creating a QSRR model include multiple linear regres-
sion (MLR) [7,8] , partial least squares (PLS) [9,10] , artificial neural
network (ANN) [11–14] , radial basis function (RBF) neural network
[15] , random forest (RF) [16] , and support vector regression (SVR)
[17,18] .
There is not much work done to investigate the impact of the
selection methods of the molecular descriptors on the performance
of retention indices prediction. Hancock et al. compared the pred-
ication performance of multiple data mining techniques and found
that GA [19] plus MLR achieved better performance than others
[20] . The optimal descriptors selected by the GA–MLR have been
employed to train the SVR for retention indices prediction (GA–
MLR–SVR) [21] . However, the GA–MLR method only selects molec-
ular descriptors that have linear correlation with the retention
indices. The molecular descriptors having non-linear relationship
with the retention index are excluded. On the other hand, the use
of SVR requires users to tune the SVR parameters. The determina-
tion of the optimal SVR parameters usually is a time consuming
method such as grid search [22] . To address this problem, Ustun
et al. used GA and a simplex optimization to determine the op-
timal SVR parameters [23] , but they did not use the optimization
algorithm to select optimal subset of molecular descriptors. Lin et
al. used the simulated annealing algorithm to select the optimal
features and the parameters of support vector machine (SVM) for
classification problems [24] . However, this research work was not
developed for regression problem. To our knowledge, the GA has
not yet been used to search the optimal parameters of SVR and
the optimal subset of molecular descriptors simultaneously for the
RI prediction.
To develop a QSRR model that could predict molecular reten-
tion indices in gas chromatography more accurate, we present
an algorithm combining genetic algorithm and support vector re-
gression with parameter optimization method (GA–SVR–PO). The
dataset used in this work were extracted from the Molecular Oper-
ating Environment (MOE) boiling point database [25] and the true
RI values of molecules were extracted from the NIST RI08 library,
followed by analysis of the prediction performance of the proposed
GA–SVR–PO method. The performance of the GA–SVR–PO method
was compared with other three existing methods: GA–MLR, GA–
MLR–SVR, and GA–SVR. The experimental results confirm the ef-
fectiveness of our proposed approach.
2. Materials and method
2.1. Experimental RI data
Previous study demonstrated that there is a strong correlation
between the boiling point (BP) and the RI of a molecule [26] .
Therefore, 252 molecules with BP information in the Molecular
Operating Environment (MOE) database are used as our research
subject for the QSRR model construction and testing in this work
[25] . We first extracted the experimental RI values of these com-
pounds acquired on non-polar columns from the NIST08 RI library
[27] . It should be noted that some compounds have multiple RI
entries with a very large range of RI values in the current NIST
RI library. In order to obtain an accurate experimental RI value
for each molecule, two statistical methods were employed to re-
move the outlier values. Grubbs’s test [28] was used to remove the
outliers for molecules with more than 6 RI values in the NIST RI
database, while Q -test [29] was used to remove the outliers for the
molecules with 3–6 RI values. After removing the outliers of each
olecule, the mean RI value of each molecule was used as the true
I value of that molecule. These true RIs are employed for create
he prediction model.
.2. Molecular descriptors
In addition to BP, a set of 297 molecular descriptors were cal-
ulated using the MOE software. We removed the descriptors with
ero or nearly constant values. The descriptors with correlation co-
fficient greater than 0.95 were considered as redundant. Only one
f these redundant descriptors was randomly selected while the
est descriptors were removed. Finally, a dataset consisting of 252
olecules and each molecule with 151 molecular descriptors were
reated.
.3. Architecture of cross-validation
We employed a 10-fold cross-validation strategy for the QSRR
odel construction and validation. The 252 molecules were ran-
omly divided into 10 groups in almost equal size. In each cross
alidation experiment, one group was used as test set and the re-
aining 9 groups were used as training set. This process was re-
eated 10 times so that every group was selected as the test set
nce. Because GA was used to select the optimal subset of molec-
lar descriptors, the validation set must be employed during the
SRR model training step. If all training data were used to train a
ingle regression model, it is very possible to generate over fitting
ituation. So, we further split the training set into three subgroups
n almost equal size. One subgroup was used as GA validation set
o measure the performance of the prediction model while the re-
aining two subgroups were used to train the regression model.
ach of the three subgroups was chosen once as validation set. The
ean fitness which defined in Eq. (6 ) of the validation sets was
sed as the final fitness of GA. Such a 3-fold validation schema
as designed to avoid that the GA selected molecular descriptors
ias to special group of molecules.
. Regression models
Several regression models have been proposed for the RI pre-
iction in previous works. Among them, the MLR model is the
ost popular one. Another important model is SVR.
.1. Multiple linear regression model
The linear relationship between RI and the molecular descrip-
ors are considered in the MLR model using the following linear
unction:
I MLR = c 0 +
m ∑
i =1
c i x i (1)
here c 0 is an adjustable parameter, c i is the regression coefficient
f molecular descriptor x i , and m is the number of selected molec-
lar descriptors.
.2. Support vector regression model
The SVR algorithm developed by Vapnik [30] is based on esti-
ating a linear regression function:
f (x ) = w • ϕ(x ) + b (2)
here w and b represent the slop and offset for the regression
ine, respectively. x is a high dimensional input space, ϕ is a kernel
J. Zhang et al. / Neurocomputing 240 (2017) 183–190 185
Fig. 1. Chromosome design in the GA-SVR-PO model.
f
m
c
w
c
o
b
l
3
s
F
u
t
a
f
t
ɛ
s
r
w
t
t
o
P
l
o
c
z
S
t
p
w
w
a
9
p
b
t
m
R
w
o
m
w
i
t
s
u
3
(
p
o
t
E
Q
R
w
o
a
r
l
d
p
4
d
p
p
S
i
o
m
G
t
t
r
t
unction that can map the input space x to a higher or infinite di-
ensional space. f ( x ) is the linear regression function that can be
alculated by minimizing Eq. (3 ):
1
2
w
T w +
1
n
n ∑
i =1
c( f ( x i ) , y i ) (3)
here 1/2w
T w is a term characterizing the model complexity,
(f(x i ),y i ) is a loss function, y is the target and n is the number
f samples. The details of the theoretical background of SVR can
e found in Refs. [30–32] . In this study, we used spider machine
earning toolbox to implement the SVR.
.3. Genetic algorithm encoding and parameters setup
In our GA–SVR–PO model, the chromosomes in GA indicate the
election of molecular descriptors and the parameters set of SVR.
or each chromosome, we set the length to the number of molec-
lar descriptors 151. Therefore, a 151 bits binary string was used
o represent whether a molecular descriptor is selected or not,
nd the rest represent the parameters of SVR. The radial basis
unction (RBF) was chosen as the kernel function for SVR, where
hree parameters were tuned: the regularization parameter C , the
-insensitive loss function and the width of RBF. We used a 60 bits
tring to represent these 3 parameters. The search ranges of pa-
ameter C , ɛ of the ɛ -insensitive loss function and the width of RBF
ere [0, 2 20 ], [2 −1 , 2 3 ] and [2 −1 , 2 3 ], respectively. Each parame-
er is represented by a 20 bits string mapped to the real value of
he corresponding parameter. Fig. 1 shows a sample chromosome
f the GA diagram for the GA–SVR–PO model.
To compare the performance between our proposed GA–SVR–
O model and the existing models GA–MLR and GA–SVR, the
ength of a chromosome in binary form was also set to the number
f molecular descriptors in these two methods. The allele was en-
oded by one if the corresponding descriptor was included and by
ero if it was excluded. The parameters of SVR used in GA–MLR–
VR and GA–SVR were all optimized by grid search.
The GA toolbox developed by the University of Sheffield writ-
en in MATLAB scripts was used in this work [33] . The crossover
robability of GA was set to 0.9 and the probability for mutation
as 0.01. The population size was set to 200 and 200 generations
ere performed. To obtain a small set of descriptors, Mihaleva et
l. proposed to alter the chance of a mutation direction by setting
0% of mutations flipping from 1 to 0 and 10% of mutation flip-
ing from 0 to 1 [21] . However, an important descriptor may not
e selected again in this approach when its position flips from 1
o 0 at the beginning of evolution. To overcome this problem, we
odified the fitness function of GA as follows:
MSE =
√ √ √ √
n ∑
i =1
( y i − ˆ y i )
n
(4)
factor =
{1 i f m = 15
| m − 15 | otherwise (5)
f itness =
RMSE
factor (6)
here y i is the i th target or observation value, n is the number
f the validation set, ˆ y i is the i th prediction value of a regression
odel, m is the number of selected molecular descriptors. It is
ell known that a regression model can easily cause over fitting
f more molecular descriptors are provided [34] . Therefore, some
echniques were proposed to reduce the number of molecular de-
criptors [35] . In this work, we set the number of selected molec-
lar descriptors to 15.
.4. The evaluation criteria of regression performance
The coefficient of determination ( R 2 ), correlation coefficient
Q
2 ) and RMSE on the test set were used as criteria to evaluate the
redictive power of the proposed regression model. The Q
2 and R 2
n the results of 10-fold cross-validation are also used to evaluate
he final performance of the four models. The RMSE is defined in
q. (4 ). The Q
2 and R 2 are defined as follows:
2 = 1 −
n ∑
i =1
(ˆ y i − y i
)2
n ∑
i =1
(ˆ y i − y
)2 (7)
2 =
n ∑
i =1
(y f it
i − y
)2
n ∑
i =1
( y i − y ) 2
(8)
here y i , n , ˆ y i are defined as before in Eq. (4 ), y is the mean of
bservation value. y f it i
is the fitted value of the i th target. R 2 takes
ny value between 0 and 1, with a value closer to 1 indicating the
egression model has better performance. In contrast, Q
2 has no
ower boundary but the value ranges between −∞ and 1. A small
ifference between Q
2 and R 2 means that the model has better
erformance.
. Results and discussion
In this study, we developed a GA–SVR–PO method for the pre-
iction of molecular retention indices in gas chromatography. The
erformance of our GA–SVR–PO model was compared with the
erformance of other three existing models: GA–MLR, GA–MLR–
VR, and GA–SVR. In the GA–MLR model, the altered RMSE of val-
dation set based on MLR was used as fitness of GA to find the
ptimal subset of molecular descriptors. The GA–MLR–SVR is a
odel that uses the optimal molecular descriptors found by the
A–MLR model to train SVR. In the third model GA–SVR, the al-
ered RMSE of validation set based on SVR was directly used as
he fitness of GA to find the best molecular descriptors. The pa-
ameters of SVR used in the second and third models were all op-
imized by grid search. In our proposed model GA–SVR–PO, the GA
186 J. Zhang et al. / Neurocomputing 240 (2017) 183–190
Table 1
Predictive performance of each QSRR model.
GA–MLR GA–MLR–SVR GA–SVR GA–SVR–PO
Training set 53.82 ± 7.68 48.02 ± 9.38 38.27 ± 4.70 32.26 ± 6.88
RMSE Validation set 54.30 ± 7.23 – 43.36 ± 4.42 38.45 ± 5.82
Test set 65.07 ± 28.87 57.94 ± 27.76 47.31 ± 32.20 41.91 ± 29.50
Training set 0.961 ± 0.012 0.968 ± 0.011 0.979 ± 0.0047 0.985 ± 0.005
R 2 Validation set 0.962 ± 0.012 – 0.975 ± 0.0048 0.980 ± 0.005
Test set 0.927 ± 0.070 0.936 ± 0.068 0.946 ± 0.080 0.955 ± 0.728
w
S
S
e
i
d
S
t
b
e
p
S
s
t
l
a
t
c
S
t
p
n
d
t
d
−
t
c
m
R
t
d
l
f
v
l
t
r
η
η
w
v
m
h
W
M
9
O
was used not only to optimize the subset of molecular descrip-
tors, but also to optimize the SVR parameters. We employed a 10-
fold cross-validation strategy to construct and validate each of the
four models. Each experiment was conducted 10 times by using
different training set and test set. Every RI value was predicted one
time, the final predicted RI values of the 10 experiments were then
combined to evaluate the predictive performance of each QSRR
model.
Table 1 shows the performance of each QSRR model. As ex-
pected, the training set has the smallest RMSE and the largest R 2
values in all of the four models. Compared with the SVR-based
models (GA–SVR and GA–SVR–PO), the MLR–based models (GA–
MLR and GA–MLR–SVR) have large RMSE values and small R 2 val-
ues. The mean values of RMSE and R 2 for the MLR-based models
on the test set are 65.07 and 57.94, 0.927 and 0.936, respectively.
However, the mean values of RMSE and R 2 for the SVR-based mod-
els are 47.31 and 41.91, 0.946 and 0.955, respectively. Furthermore,
there is no significant difference between the performance of the
test and validation sets in the SVR-based models. However, the per-
formance the MLR-based models are relatively poor in the test set
compare to their performance in the validation set.
For all of the four QSRR models, the standard deviations are
small, indicating all of the four methods have a stable predictive
performance. The standard deviations of RMSE and R 2 for the MLR-
based models on the training set and the validation set range from
7.23 to 9.38 and 0.012 to 0.070, respectively. However, the stan-
dard deviations of RMSE and R 2 for the SVR-based methods on
the training set and the validation set range from 4.70 to 6.88 and
0.0047 to 0.0050, respectively. These indicate that the SVR-based
models are stable than the MLR-based models. Among all of the
four tested models, the GA–SVR–PO model has the best perfor-
mance, that is, small RMSE value and large R 2 with smaller stan-
dard deviations.
In order to study the correlation between the predicted reten-
tion indices and the true retention indices extracted from the NIST
RI database, we merged all of the predicted retention indices of
the 10 cross-validation experiments and displayed the results in
Fig. 2 . The R 2 of the GA–MLR model is 0.935. Compared to the
GA–MLR model, the value of R 2 is improved by 0.009, 0.022 and
0.0326 for the GA–MLR–SVR, GA–SVR and GA–SVR–PO models, re-
spectively. The absolute difference between the predicted retention
indices and the true RI values also improved from 45.93 to 42.72,
18.77 and 13.76 for the four models, respectively.
Fig. 3 is residual case order plots of the predicted results
from the 10-fold cross-validation experiments. Each data point in
Fig. 3 represents a residual that is the absolute RI difference be-
tween the predicted RI and the true RI extracted from the NIST RI
database. Each line represents 95% confidence interval. A line in-
tercepting with the zero line means that the predicted RI value is
within 5% variation of the true RI value. It can be seen that the
residuals of the four QSRR models are all randomly distributed on
the two sides of the zero lines, which indicates that these models
describe the data well. The diamond points highlighted in blue line
have prediction variations larger than 5%. There are 14 molecules
with prediction variation larger than 5% in the GA–MLR model,
ohere there are 11, 5 and 6 molecules in the GA–MLR–SVR, GA–
VR and GA–SVR–PO models, respectively. This indicates that the
VR-based model performs better than the MLR-based models.
Fig. 4 displays the box plot of the predictive performance of
ach model on the test set in the 10-fold cross-validation exper-
ments. It can be seen that the RMSE value increases and R 2 value
ecreases in an order of GA–MLR, GA–MLR–SVR, GA–SVR and GA–
VR–PO. Because of the randomness of GA algorithm, the predic-
ive performance of each QSRR model has its variation. The SVR-
ased models have smaller variations than the MLR-based mod-
ls. This means that the model based on GA–SVR have more stable
rediction performance than the MLR-based models. Between the
VR-based models, the performance of the GA–SVR–PO model is
table than the performance of the GA–SVR model.
Each cross in Fig. 4 represents that the predicted RI values of
he test set in a corresponding cross-validation experiment have
arge difference from the true RI values of molecules in the test set,
nd the results of the current cross-validation experiment may be
reated as outliers of the 10-fold cross-validation experiments. Two
ross-validation experiments were detected as outliers in the GA–
VR–PO model. We manually examined the prediction results of
hese two cross-validation experiments and found that two com-
ounds, acetal (CAS number 50-78-2) and methylaldehyde (CAS
umber 50-00-0), had remarkable difference between the pre-
icted RI value and the true RI value. The true RI values for these
wo compounds are 1285 and 260, respectively. The absolute pre-
ictive error of compound acetal and methylaldehyde are 565 and
288, respectively. Further examination of the true RI values of
he entire dataset indicates that such a large prediction error was
aused by the small size of dataset we used. There are not enough
olecules in the training set that have the RI values similar to the
I values of these two compounds. After manually removing these
wo compounds, there was not any cross-validation experiments
etected as outliers. Therefore, we believe that lack of other simi-
ar compounds in training set leads to a large predictive difference
or these two compounds. For the same reason, the standard de-
iations of RMSE and R 2 on the test set of the SVR-based models
isted in Table 1 are larger than these of the MLR-based models.
Prediction variation threshold ηpred is a critical parameter for
he evaluation of the model performance. ηpred is defined as the
elative variation of the predicted RI value from the true RI value.
pred can be calculated as follows:
pred =
∣∣y pred − y ∣∣
y (9)
here y pred is the predicted retention index and y is the true RI
alue. Fig. 5 shows the overall prediction performance of the four
odels on the test set. It can be seen that the GA–SVR–PO model
as the best predictive performance than the other three models.
ith ηpred = 10%, the prediction accuracy of the four models GA–
LR, GA–MLR–SVR, GA–SVR and GA–SVR–PO are 80.16%, 88.10%,
3.65% and 96.03%, respectively.
Table 2 lists the combined predictive results of all four models.
verall, the Q
2 and R 2 values agree to each other. The large values
f Q
2 and R 2 indicate that all of the four QSRR models have good
J. Zhang et al. / Neurocomputing 240 (2017) 183–190 187
Fig. 2. Predictive plots of the four models.
Table 2
The final predictive results of Q 2 and R 2 .
QSRR model Q 2 R 2
GA–MLR 0.9318 0.9351
GA–MLR–SVR 0.9413 0.9441
GA–SVR 0.9562 0.9571
GA–SVR–PO 0.9671 0.9676
p
t
b
t
S
m
s
r
f
t
e
t
l
c
b
t
c
s
o
t
redictive performance. The SVR-based models perform better than
he MLR-based models, and the GA–SVR–PO model performed the
est with Q
2 and R 2 values reached 0.967 and 0.968, respectively.
Table 3 is a partial list of molecular descriptors selected by
he three QSRR models based on 20 experiments. The GA–MLR–
VR model uses the molecular descriptors selected by the GA–MLR
odel, that is, these two models use the same set of molecular de-
criptors. In this table, the linear correlation refers to Pearson’s cor-
elation between a molecular descriptor and the true RI value. The
requency is the number of experiments that a molecular descrip-
or was selected by the model during the 10-fold cross-validation
xperiments. BP has the highest linear correlation with the reten-
ion index (0.94) followed by molecular weight (0.902), Weiner po-
arity number (0.88) and number of carbon atoms (0.846). In most
ases, the frequency of a molecular descriptor selected by the SVR-
ased models is different from that of the MLR-based models, and
he SVR-based models have strong capability in selecting some cru-
ial descriptors.
Despite the randomness nature of the GA, there still have
ome consistency between the three models. The BP, vsurf_ID6 and
pr_brigid are the descriptors that have selected most times by
he three models. The linear correlation coefficients of these three
188 J. Zhang et al. / Neurocomputing 240 (2017) 183–190
Fig. 3. The residual case order plot of the four models.
Fig. 4. The box plot for 20 experiments.
J. Zhang et al. / Neurocomputing 240 (2017) 183–190 189
Table 3
A partial list of molecular descriptors selected by each QSRR model and linear correlation coefficient.
Name Description Correlation Selection frequency
GA–MLR GA–SVR GA–SVR–PO
BP Boiling point 0 .94 10 10 10
vsurf_ID6 Hydrophobic integy moment −0 .096 8 7 9
opr_brigid The number of rigid bonds 0 .56 6 8 7
Weight Molecular weight 0 .902 5 2 3
a_acc Number of hydrogen bond acceptor atoms 0 .103 4 1 3
PEOE_PC- Total negative partial charge −0 .684 3 2 1
PEOE_RPC- Relative negative partial charge −0 .357 3 2 1
E_vdw van der Waals component of the potential energy 0 .796 3 2 1
SlogP_VSA9 Subdivided surface areas −0 .236 3 1 4
PC + Total positive partial charge 0 .494 3 1 1
Q_VSA_FPOS Fractional positive van der Waals surface area −0 .269 3 1 0
E_nb Value of the potential energy with all bonded terms disabled 0 .625 3 1 0
PEOE_VSA_NEG Total negative van der Waals surface area 0 .689 3 0 3
vsurf_DW13 Contact distances of vsurf_EWmin −0 .09 3 0 1
KierA1 First alpha modified shape index 0 .613 2 7 6
vsurf_W2 Hydrophilic volume 0 .509 2 4 0
chi1v_C Carbon valence connectivity index 0 .733 2 0 3
Kier3 Third kappa shape index 0 .259 1 3 2
b_1rotN Number of rotatable single bonds 0 .50 1 3 1
vsurf_CW5 Capacity factor −0 .049 1 3 0
AM1_dipole The dipole moment calculated using the AM1 Hamiltonian −0 .029 1 2 3
E_ele Electrostatic component of the potential energy −0 .19 1 2 1
DASA Absolute value of the difference between ASA + and ASA − 0 .053 1 2 1
weiner Pol Weiner polarity number 0 .88 1 1 3
PEOE_VSA-1 Sum of vi where qi is in the range ( −0.10, −0.05) 0 .457 1 1 3
E_strain The current energy minus the value of the energy at a near local minimum 0 .6 1 1 2
pmiY y component of the principal moment of inertia 0 .591 1 1 2
vsurf_ID1 Hydrophobic integy moment −0 .074 1 1 2
MNDO_IP The ionization potential (kcal/mol) −0 .321 1 0 4
E_stb Bond stretch-bend cross-term potential energy 0 .1 1 0 3
PC- Total negative partial charge −0 .494 1 0 2
E_oop Out-of-plane potential energy 0 .126 1 0 2
a_nC Number of carbon atoms 0 .846 0 3 3
a_ICM Atom information content (mean) 0 .087 0 3 2
Fig. 5. The fraction of molecules with correctly predicted retention indices vs. the
threshold of prediction variation on the test set.
m
0
t
h
c
i
1
G
o
s
n
m
d
r
t
p
h
a
5
v
P
m
w
e
t
t
m
e
P
R
9
A
d
olecular descriptors with the true RI value are 0.94, −0.096 and
.56, respectively. It should be noted that some molecular descrip-
ors with strong linear correlation with the true RI value do not
ave high selection frequency. For example, the correlation coeffi-
ient between the Weiner polarity number and the true RI value
s 0.88. But it was only selected 1, 1, and 3 times during the
0-fold cross-validation experiments in the GA–MLR, GA–SVR and
A–SVR–PO models, respectively. Another example is the number
f carbon atoms, which was selected only 0, 3, and 3 times, re-
pectively. This phenomenon is mainly induced by the randomness
ature of the GA algorithm. Another reason is that all regression
odels select the molecular descriptors based on the overall pre-
iction accuracy. Some molecular descriptors with low linear cor-
elation coefficients have high selection frequency because the con-
ribution of these molecular descriptors to the RI prediction com-
ensate to the other selected descriptors and therefore, generate
igh prediction accuracy on the molecules present in the training
nd validation sets.
. Conclusions
In this study, we developed a genetic algorithm and support
ector regression with parameter optimization model (GA–SVR–
O) for the prediction of molecular retention indices in gas chro-
atography. The performance of our proposed GA–SVR–PO model
as compared with the performance of other three existing mod-
ls: GA–MLR, GA–MLR–SVR, and GA–SVR. Our analyses show that
he MLR-based models can achieve a desired performance and
he SVR-based models have improved performance. The SVR-based
odels also have a stable performance than the MRL-based mod-
ls. In all of the four models, our proposed model GA–SVR–
O achieved the best predictive performance with R 2 > 0.96 and
MSE = 49.94. The prediction accuracy of GA–SVR–PO model is
6% at 10% of prediction variation.
cknowledgments
This work was supported by National Natural Science Foun-
ation of China under grant nos. 61271098 , 61672035 , 6130 0 058 ,
190 J. Zhang et al. / Neurocomputing 240 (2017) 183–190
[
[
[
[
[
[
61472282 and 61032007 and Provincial Natural Science Research
Program of Higher Education Institutions of Anhui Province under
grant no. KJ2012A005 , Anhui Provincial Natural Science Foundation
under grant no. 1508085MF129 .
References
[1] S.E. Stein , An integrated method for spectrum extraction and compound iden-
tification from gas chromatography/mass spectrometry data, J. Am. Soc. MassSpectrom. 10 (1999) 770–781 .
[2] E. Kováts , Gas-chromatographische charakterisierung organischer verbindun-gen. Teil 1: retentionsindices aliphatischer halogenide, alkohole, aldehyde und
ketone, Helv. Chim. Acta 41 (1958) 1915–1932 .
[3] H.K. Van Den Dool , P. Dec , A generalization of the retention index system in-cluding linear temperature programmed gas-liquid partition chromatography,
J. Chromatogr. 11 (1963) 463–471 . [4] K. Heberger , Quantitative structure-(chromatographic) retention relationships,
J. Chromatogr. A 1158 (2007) 273–305 . [5] R. Kaliszan , Quantitative Structure-Chromatographic Retention Relationships,
Wiley, New York, 1987 .
[6] E. Dossin , E. Martin , P. Diana , A. Castellon , A. Monge , P. Pospisil , M. Bentley ,P.A. Guy , Prediction models of retention indices for increased confidence in
structural elucidation during complex matrix analysis: application to gas chro-matography coupled with high-resolution mass spectrometry, Anal. Chem. 88
(2016) 7539–7547 . [7] R.J. Hu , H.X. Liu , R.S. Zhang , C.X. Xue , X.J. Yao , M.C. Liu , Z.D. Hu , B.T. Fan ,
QSPR prediction of GC retention indices for nitrogen-containing polycyclic aro-
matic compounds from heuristically computed molecular descriptors, Talanta68 (2005) 31–39 .
[8] Y.W. Wang , X.J. Yao , X.Y. Zhang , R.S. Zhang , M.C. Liu , Z.D. Hu , B.T. Fan , The pre-diction for gas chromatographic retention indices of saturated esters on sta-
tionary phases of different polarity, Talanta 57 (2002) 641–652 . [9] K. Heberger , M. Gorgenyi , M. Sjostrom , Partial least squares modeling of re-
tention data of oxo compounds in gas chromatography, Chromatographia 51
(20 0 0) 595–60 0 . [10] L.I. Nord , D. Fransson , S.P. Jacobsson , Prediction of liquid chromatographic re-
tention times of steroids by three-dimensional structure descriptors and par-tial least squares modeling, Chemom. Intell. Lab. Syst. 44 (1998) 257–269 .
[11] Z. Garkani-Nejad , Use of self-training artificial neural networks in a QSRRstudy of a diverse set of organic compounds, Chromatographia 70 (2009)
869–874 .
[12] D.-S. Huang , J.-X. Du , A constructive hybrid structure optimization methodol-ogy for radial basis probabilistic neural networks, IEEE Trans. Neural Netw. 19
(2008) 2099–2115 . [13] D.-S. Huang , Systematic Theory of Neural Networks For Pattern Recognition,
Publishing House of Electronic Industry of China, Beijing, 1996, p. 8 . [14] D.-s. Huang , Radial basis probabilistic neural networks: model and application,
Int. J. Pattern Recognit. Artif. Intell. 13 (1999) 1083–1101 . [15] X.J. Yao , X.Y. Zhang , R.S. Zhang , M.C. Liu , Z.D. Hu , B.T. Fan , Prediction of gas
chromatographic retention indices by the use of radial basis function neural
networks, Talanta 57 (2002) 297–306 . [16] C.L. Wang , M.J. Skibic , R.E. Higgs , I.A. Watson , H. Bui , J.B. Wang , J.M. Cintron ,
Evaluating the performances of quantitative structure–retention relationshipmodels with different sets of molecular descriptors and databases for high-
-performance liquid chromatography predictions, J. Chromatogr. A 1216 (2009)5030–5038 .
[17] M.H. Fatemi , E. Baher , M. Ghorbanzade’h , Predictions of chromatographic re-
tention indices of alkylphenols with support vector machines and multiple lin-ear regression, J. Sep. Sci. 32 (2009) 4133–4142 .
[18] C.-C. Chang , C.-J. Lin , LIBSVM: A library for support vector machines, ACMTrans. Intell. Syst. Technol. (TIST) 2 (2011) 27 .
[19] G. Oliveri , A. Massa , Genetic algorithm (GA)-enhanced almost difference set(ADS)-based approach for array thinning, IET Microwaves, Antennas Propag. 5
(2011) 305–315 . [20] T. Hancock , R. Put , D. Coomans , Y. Vander Heyden , Y. Everingham , A perfor-
mance comparison of modem statistical techniques for molecular descriptorselection and retention prediction in chromatographic QSRR studies, Chemom.
Intell. Lab. Syst. 76 (2005) 185–196 . [21] V.V. Mihaleva , H.A. Verhoeven , R.C.H. de Vos , R.D. Hall , R.C.H.J. van Ham , Auto-
mated procedure for candidate compound selection in GC-MS metabolomics
based on prediction of Kovats retention index, Bioinformatics 25 (2009)787–794 .
22] C.W. Hsu , C.C. Chang , C.J. Lin , A Practical Guide to Support Vector Classifica-tion, Department of Computer Science and Information Engineering, National
Taiwan University, 2003 . 23] B. Ustun , W.J. Melssen , M. Oudenhuijzen , L.M.C. Buydens , Determination of op-
timal support vector regression parameters by genetic algorithms and simplex
optimization, Anal. Chim. Acta 544 (2005) 292–305 . [24] S.W. Lin , Z.J. Lee , S.C. Chen , T.Y. Tseng , Parameter determination of support vec-
tor machine and feature selection using simulated annealing approach, Appl.Soft Comput. 8 (2008) 1505–1512 .
25] Chemical Computing Group Inc. Molecular Operating Environment (MOE).2008. http://www.chemcomp.com/ .
26] W.P. Eckel , T. Kind , Use of boiling point-Lee retention index correlation for
rapid review of gas chromatography-mass spectrometry data, Anal. Chim. Acta494 (2003) 235–243 .
[27] S.E. Stein, Retention Indices in NIST Chemistry WebBook. NIST Standard Refer-ence Database Number 69, versions 2005 and 2008 ( http://webbook.nist.gov ).
2008). [28] NIST/SEMATECH e-Handbook of Statistical Methods http://www.itl.nist.gov/
div898/handbook/eda/section3/eda35h.htm) .
29] R.B.D.a.W.J. Dixon , Simplified statistics for small numbers of observations, Anal.Chem. 23 (1951) 636–638 .
[30] V. Vapnik , The Nature of Statistical Learning Theory, Springer-Verlag, New York,USA, 1995 .
[31] B. Scholkopf , K.K. Sung , C.J.C. Burges , F. Girosi , P. Niyogi , T. Poggio , V. Vap-nik , Comparing support vector machines with Gaussian kernels to radial basis
function classifiers, IEEE Trans. Signal Process. 45 (1997) 2758–2765 .
32] A.J. Smola , B. Scholkopf , A tutorial on support vector regression, Stat. Comput.14 (2004) 199–222 .
[33] A.J. Chipperfield , P.J. Fleming , C.M. Fonseca , Genetic algorithm tools for con-trol systems engineering, in: Proceedings of the 1994 Adaptive Computing in
Engineering Design and Control, Plymouth Engineering Design Centre, 1994,pp. 128–133 .
[34] D.M. Hawkins , The problem of overfitting, J. Chem. Inf. Comput. Sci. 44 (2004)
1–12 . [35] R. Todeschini , V. Consonni , A. Mauri , M. Pavan , Detecting "bad" regression
models: multicriteria fitness functions in regression analysis, Anal. Chim. Acta515 (2004) 199–208 .
Jun Zhang was born in Anhui Province, Chin, in 1971.He received M.S. degree in Pattern Recognition and Intel-
ligent System in 2004, from Institute of Intelligent Ma-chines, Chinese Academy of Sciences. He received the
Ph.D. degree from University of Science and Technology
of China, Hefei, China in 2007. Currently, He is associateprofessor in the School of Electrical Engineering and Au-
tomation, Anhui University, China. His research interestsfocus on deep learning, ensemble learning and chemin-
formatics.
Recommended