8
Neurocomputing 240 (2017) 183–190 Contents lists available at ScienceDirect Neurocomputing journal homepage: www.elsevier.com/locate/neucom Optimization enhanced genetic algorithm-support vector regression for the prediction of compound retention indices in gas chromatography Jun Zhang a , Chun-Hou Zheng a , Yi Xia a , Bing Wang b , Peng Chen c,a School of Electronic Engineering and Automation, Institute of Health Sciences, Anhui University, Hefei, Anhui 230601, China b School of Electrical and Information Engineering, Anhui University of Technology, Ma’anshan 243032, China c Institute of Health Sciences, Anhui University, Hefei, Anhui 230601,China a r t i c l e i n f o Article history: Received 28 December 2015 Revised 15 October 2016 Accepted 22 November 2016 Available online 16 February 2017 Communicated by Prof. D.-S. Huang Keywords: Support vector regression Quantitative structure–retention relationship Retention indices prediction Gas chromatography a b s t r a c t A new method using genetic algorithm and support vector regression with parameter optimization (GA– SVR–PO) was developed for the prediction of compound retention indices (RI) in gas chromatography. The dataset used in this work consists of 252 compounds extracted from the Molecular Operating Envi- ronment (MOE) boiling point database. Molecular descriptors were calculated by descriptor tools of the MOE software package. After removing redundant descriptors, 151 descriptors were obtained for each compound. A genetic algorithm (GA) was used to select the best subset of molecular descriptors and the best parameters of SVR to optimize the prediction performance of compound retention indices. A 10-fold cross-validation method was used to evaluate the prediction performance. We compared the per- formance of our proposed model with three existing methods: GA coupled with multiple linear regres- sion (GA–MLR), the subset selected by GA–MLR used to train SVR (GA–MLR–SVR), and GA on SVR (GA– SVR). The experimental results demonstrate that our proposed GA–SVR–PO model has better predictive performance than other existing models with R 2 > 0.967 and RMSE = 49.94. The prediction accuracy of GA–SVR–PO model is 96% at 10% of prediction variation. © 2017 Elsevier B.V. All rights reserved. 1. Introduction Gas chromatography coupled with mass spectrometry (GC–MS) is a powerful analytical platform for the identification and quantifi- cation of small molecules in chemistry and biomedical research. A GC–MS system measures the retention time and mass spectrum of each molecule. Currently, the National Institute of Standards and Technology (NIST) MS database (NIST/EPA/NIH Mass Spectral Li- brary) is widely used for molecular identification using automated mass spectral deconvolution and identification system (AMDIS). AMDIS identify molecules based on the spectrum similarity be- tween the experimental mass spectrum and the mass spectrum recorded in the NIST MS library [1]. Retention time is a measure of the interactions between a molecule and the stationary phase of the GC column. Therefore, the molecular retention time in GC is actually correlated to the molecular structure. Unfolding such inherent relation between the molecular retention time and molecular structure will significantly Corresponding author. E-mail address: [email protected] (P. Chen). benefit not only the understanding of gas phase chemistry, but also the molecular identification in metabolomics and other re- search fields. This is often done by converting the retention time into the retention index (RI). The RI of a molecule is its reten- tion time normalized to the retention times of adjacently eluting n-alkanes, which can be achieved by either an internal or external calibration experiment. While retention times vary with the indi- vidual chromatographic system, the derived retention indices are quite independent of chromatographic parameters and allow com- paring values measured by different analytical laboratories under varying conditions. The Kovats RI is used for isothermal experi- ments [2] and linear RI is designed for temperature gradient ex- periments [3]. However, the current experimental RI data are very limited compared to the mass spectral data recorded in the NIST MS library. There are only 21,940 molecules that have the RI in- formation even though the NIST MS library contains mass spectra for 192,108 molecules. In order to employ RI as a match factor for metabolite identification, it is necessary to theoretically predict the molecular RI values for the molecules that do not have experimen- tal RI information. Quantitative structure–retention relationship (QSRR) model has been used to estimate the molecular RI values according to the http://dx.doi.org/10.1016/j.neucom.2016.11.070 0925-2312/© 2017 Elsevier B.V. All rights reserved.

Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

  • Upload
    lydieu

  • View
    215

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

Neurocomputing 240 (2017) 183–190

Contents lists available at ScienceDirect

Neurocomputing

journal homepage: www.elsevier.com/locate/neucom

Optimization enhanced genetic algorithm-support vector regression

for the prediction of compound retention indices in gas

chromatography

Jun Zhang

a , Chun-Hou Zheng

a , Yi Xia

a , Bing Wang

b , Peng Chen

c , ∗

a School of Electronic Engineering and Automation, Institute of Health Sciences, Anhui University, Hefei, Anhui 230601, China b School of Electrical and Information Engineering, Anhui University of Technology, Ma’anshan 243032, China c Institute of Health Sciences, Anhui University, Hefei, Anhui 230601,China

a r t i c l e i n f o

Article history:

Received 28 December 2015

Revised 15 October 2016

Accepted 22 November 2016

Available online 16 February 2017

Communicated by Prof. D.-S. Huang

Keywords:

Support vector regression

Quantitative structure–retention

relationship

Retention indices prediction

Gas chromatography

a b s t r a c t

A new method using genetic algorithm and support vector regression with parameter optimization (GA–

SVR–PO) was developed for the prediction of compound retention indices (RI) in gas chromatography.

The dataset used in this work consists of 252 compounds extracted from the Molecular Operating Envi-

ronment (MOE) boiling point database. Molecular descriptors were calculated by descriptor tools of the

MOE software package. After removing redundant descriptors, 151 descriptors were obtained for each

compound. A genetic algorithm (GA) was used to select the best subset of molecular descriptors and

the best parameters of SVR to optimize the prediction performance of compound retention indices. A

10-fold cross-validation method was used to evaluate the prediction performance. We compared the per-

formance of our proposed model with three existing methods: GA coupled with multiple linear regres-

sion (GA–MLR), the subset selected by GA–MLR used to train SVR (GA–MLR–SVR), and GA on SVR (GA–

SVR). The experimental results demonstrate that our proposed GA–SVR–PO model has better predictive

performance than other existing models with R 2 > 0.967 and RMSE = 49.94. The prediction accuracy of

GA–SVR–PO model is 96% at 10% of prediction variation.

© 2017 Elsevier B.V. All rights reserved.

1

i

c

G

e

T

b

m

A

t

r

m

t

m

m

b

a

s

i

t

n

c

v

q

p

v

m

p

l

M

f

f

m

h

0

. Introduction

Gas chromatography coupled with mass spectrometry (GC–MS)

s a powerful analytical platform for the identification and quantifi-

ation of small molecules in chemistry and biomedical research. A

C–MS system measures the retention time and mass spectrum of

ach molecule. Currently, the National Institute of Standards and

echnology (NIST) MS database (NIST/EPA/NIH Mass Spectral Li-

rary) is widely used for molecular identification using automated

ass spectral deconvolution and identification system (AMDIS).

MDIS identify molecules based on the spectrum similarity be-

ween the experimental mass spectrum and the mass spectrum

ecorded in the NIST MS library [1] .

Retention time is a measure of the interactions between a

olecule and the stationary phase of the GC column. Therefore,

he molecular retention time in GC is actually correlated to the

olecular structure. Unfolding such inherent relation between the

olecular retention time and molecular structure will significantly

∗ Corresponding author.

E-mail address: [email protected] (P. Chen).

m

t

b

ttp://dx.doi.org/10.1016/j.neucom.2016.11.070

925-2312/© 2017 Elsevier B.V. All rights reserved.

enefit not only the understanding of gas phase chemistry, but

lso the molecular identification in metabolomics and other re-

earch fields. This is often done by converting the retention time

nto the retention index (RI). The RI of a molecule is its reten-

ion time normalized to the retention times of adjacently eluting

-alkanes, which can be achieved by either an internal or external

alibration experiment. While retention times vary with the indi-

idual chromatographic system, the derived retention indices are

uite independent of chromatographic parameters and allow com-

aring values measured by different analytical laboratories under

arying conditions. The Kovats RI is used for isothermal experi-

ents [2] and linear RI is designed for temperature gradient ex-

eriments [3] . However, the current experimental RI data are very

imited compared to the mass spectral data recorded in the NIST

S library. There are only 21,940 molecules that have the RI in-

ormation even though the NIST MS library contains mass spectra

or 192,108 molecules. In order to employ RI as a match factor for

etabolite identification, it is necessary to theoretically predict the

olecular RI values for the molecules that do not have experimen-

al RI information.

Quantitative structure–retention relationship (QSRR) model has

een used to estimate the molecular RI values according to the

Page 2: Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

184 J. Zhang et al. / Neurocomputing 240 (2017) 183–190

m

R

t

2

c

z

e

o

r

m

c

2

m

d

v

m

p

o

u

Q

s

s

i

t

m

E

m

u

w

b

3

d

m

3

t

f

R

w

o

u

3

m

w

l

molecular descriptors generated from the chemical structure [4–

6] . The success of a QSRR model depends on the accuracy of input

RI data, the selection of appropriate molecular descriptors, and the

statistical tools for retention indices prediction. Most QSRR studies

focus on the selection of suitable statistical tools. The developed

methods for creating a QSRR model include multiple linear regres-

sion (MLR) [7,8] , partial least squares (PLS) [9,10] , artificial neural

network (ANN) [11–14] , radial basis function (RBF) neural network

[15] , random forest (RF) [16] , and support vector regression (SVR)

[17,18] .

There is not much work done to investigate the impact of the

selection methods of the molecular descriptors on the performance

of retention indices prediction. Hancock et al. compared the pred-

ication performance of multiple data mining techniques and found

that GA [19] plus MLR achieved better performance than others

[20] . The optimal descriptors selected by the GA–MLR have been

employed to train the SVR for retention indices prediction (GA–

MLR–SVR) [21] . However, the GA–MLR method only selects molec-

ular descriptors that have linear correlation with the retention

indices. The molecular descriptors having non-linear relationship

with the retention index are excluded. On the other hand, the use

of SVR requires users to tune the SVR parameters. The determina-

tion of the optimal SVR parameters usually is a time consuming

method such as grid search [22] . To address this problem, Ustun

et al. used GA and a simplex optimization to determine the op-

timal SVR parameters [23] , but they did not use the optimization

algorithm to select optimal subset of molecular descriptors. Lin et

al. used the simulated annealing algorithm to select the optimal

features and the parameters of support vector machine (SVM) for

classification problems [24] . However, this research work was not

developed for regression problem. To our knowledge, the GA has

not yet been used to search the optimal parameters of SVR and

the optimal subset of molecular descriptors simultaneously for the

RI prediction.

To develop a QSRR model that could predict molecular reten-

tion indices in gas chromatography more accurate, we present

an algorithm combining genetic algorithm and support vector re-

gression with parameter optimization method (GA–SVR–PO). The

dataset used in this work were extracted from the Molecular Oper-

ating Environment (MOE) boiling point database [25] and the true

RI values of molecules were extracted from the NIST RI08 library,

followed by analysis of the prediction performance of the proposed

GA–SVR–PO method. The performance of the GA–SVR–PO method

was compared with other three existing methods: GA–MLR, GA–

MLR–SVR, and GA–SVR. The experimental results confirm the ef-

fectiveness of our proposed approach.

2. Materials and method

2.1. Experimental RI data

Previous study demonstrated that there is a strong correlation

between the boiling point (BP) and the RI of a molecule [26] .

Therefore, 252 molecules with BP information in the Molecular

Operating Environment (MOE) database are used as our research

subject for the QSRR model construction and testing in this work

[25] . We first extracted the experimental RI values of these com-

pounds acquired on non-polar columns from the NIST08 RI library

[27] . It should be noted that some compounds have multiple RI

entries with a very large range of RI values in the current NIST

RI library. In order to obtain an accurate experimental RI value

for each molecule, two statistical methods were employed to re-

move the outlier values. Grubbs’s test [28] was used to remove the

outliers for molecules with more than 6 RI values in the NIST RI

database, while Q -test [29] was used to remove the outliers for the

molecules with 3–6 RI values. After removing the outliers of each

olecule, the mean RI value of each molecule was used as the true

I value of that molecule. These true RIs are employed for create

he prediction model.

.2. Molecular descriptors

In addition to BP, a set of 297 molecular descriptors were cal-

ulated using the MOE software. We removed the descriptors with

ero or nearly constant values. The descriptors with correlation co-

fficient greater than 0.95 were considered as redundant. Only one

f these redundant descriptors was randomly selected while the

est descriptors were removed. Finally, a dataset consisting of 252

olecules and each molecule with 151 molecular descriptors were

reated.

.3. Architecture of cross-validation

We employed a 10-fold cross-validation strategy for the QSRR

odel construction and validation. The 252 molecules were ran-

omly divided into 10 groups in almost equal size. In each cross

alidation experiment, one group was used as test set and the re-

aining 9 groups were used as training set. This process was re-

eated 10 times so that every group was selected as the test set

nce. Because GA was used to select the optimal subset of molec-

lar descriptors, the validation set must be employed during the

SRR model training step. If all training data were used to train a

ingle regression model, it is very possible to generate over fitting

ituation. So, we further split the training set into three subgroups

n almost equal size. One subgroup was used as GA validation set

o measure the performance of the prediction model while the re-

aining two subgroups were used to train the regression model.

ach of the three subgroups was chosen once as validation set. The

ean fitness which defined in Eq. (6 ) of the validation sets was

sed as the final fitness of GA. Such a 3-fold validation schema

as designed to avoid that the GA selected molecular descriptors

ias to special group of molecules.

. Regression models

Several regression models have been proposed for the RI pre-

iction in previous works. Among them, the MLR model is the

ost popular one. Another important model is SVR.

.1. Multiple linear regression model

The linear relationship between RI and the molecular descrip-

ors are considered in the MLR model using the following linear

unction:

I MLR = c 0 +

m ∑

i =1

c i x i (1)

here c 0 is an adjustable parameter, c i is the regression coefficient

f molecular descriptor x i , and m is the number of selected molec-

lar descriptors.

.2. Support vector regression model

The SVR algorithm developed by Vapnik [30] is based on esti-

ating a linear regression function:

f (x ) = w • ϕ(x ) + b (2)

here w and b represent the slop and offset for the regression

ine, respectively. x is a high dimensional input space, ϕ is a kernel

Page 3: Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

J. Zhang et al. / Neurocomputing 240 (2017) 183–190 185

Fig. 1. Chromosome design in the GA-SVR-PO model.

f

m

c

w

c

o

b

l

3

s

F

u

t

a

f

t

ɛ

s

r

w

t

t

o

P

l

o

c

z

S

t

p

w

w

a

9

p

b

t

m

R

w

o

m

w

i

t

s

u

3

(

p

o

t

E

Q

R

w

o

a

r

l

d

p

4

d

p

p

S

i

o

m

G

t

t

r

t

unction that can map the input space x to a higher or infinite di-

ensional space. f ( x ) is the linear regression function that can be

alculated by minimizing Eq. (3 ):

1

2

w

T w +

1

n

n ∑

i =1

c( f ( x i ) , y i ) (3)

here 1/2w

T w is a term characterizing the model complexity,

(f(x i ),y i ) is a loss function, y is the target and n is the number

f samples. The details of the theoretical background of SVR can

e found in Refs. [30–32] . In this study, we used spider machine

earning toolbox to implement the SVR.

.3. Genetic algorithm encoding and parameters setup

In our GA–SVR–PO model, the chromosomes in GA indicate the

election of molecular descriptors and the parameters set of SVR.

or each chromosome, we set the length to the number of molec-

lar descriptors 151. Therefore, a 151 bits binary string was used

o represent whether a molecular descriptor is selected or not,

nd the rest represent the parameters of SVR. The radial basis

unction (RBF) was chosen as the kernel function for SVR, where

hree parameters were tuned: the regularization parameter C , the

-insensitive loss function and the width of RBF. We used a 60 bits

tring to represent these 3 parameters. The search ranges of pa-

ameter C , ɛ of the ɛ -insensitive loss function and the width of RBF

ere [0, 2 20 ], [2 −1 , 2 3 ] and [2 −1 , 2 3 ], respectively. Each parame-

er is represented by a 20 bits string mapped to the real value of

he corresponding parameter. Fig. 1 shows a sample chromosome

f the GA diagram for the GA–SVR–PO model.

To compare the performance between our proposed GA–SVR–

O model and the existing models GA–MLR and GA–SVR, the

ength of a chromosome in binary form was also set to the number

f molecular descriptors in these two methods. The allele was en-

oded by one if the corresponding descriptor was included and by

ero if it was excluded. The parameters of SVR used in GA–MLR–

VR and GA–SVR were all optimized by grid search.

The GA toolbox developed by the University of Sheffield writ-

en in MATLAB scripts was used in this work [33] . The crossover

robability of GA was set to 0.9 and the probability for mutation

as 0.01. The population size was set to 200 and 200 generations

ere performed. To obtain a small set of descriptors, Mihaleva et

l. proposed to alter the chance of a mutation direction by setting

0% of mutations flipping from 1 to 0 and 10% of mutation flip-

ing from 0 to 1 [21] . However, an important descriptor may not

e selected again in this approach when its position flips from 1

o 0 at the beginning of evolution. To overcome this problem, we

odified the fitness function of GA as follows:

MSE =

√ √ √ √

n ∑

i =1

( y i − ˆ y i )

n

(4)

factor =

{1 i f m = 15

| m − 15 | otherwise (5)

f itness =

RMSE

factor (6)

here y i is the i th target or observation value, n is the number

f the validation set, ˆ y i is the i th prediction value of a regression

odel, m is the number of selected molecular descriptors. It is

ell known that a regression model can easily cause over fitting

f more molecular descriptors are provided [34] . Therefore, some

echniques were proposed to reduce the number of molecular de-

criptors [35] . In this work, we set the number of selected molec-

lar descriptors to 15.

.4. The evaluation criteria of regression performance

The coefficient of determination ( R 2 ), correlation coefficient

Q

2 ) and RMSE on the test set were used as criteria to evaluate the

redictive power of the proposed regression model. The Q

2 and R 2

n the results of 10-fold cross-validation are also used to evaluate

he final performance of the four models. The RMSE is defined in

q. (4 ). The Q

2 and R 2 are defined as follows:

2 = 1 −

n ∑

i =1

(ˆ y i − y i

)2

n ∑

i =1

(ˆ y i − y

)2 (7)

2 =

n ∑

i =1

(y f it

i − y

)2

n ∑

i =1

( y i − y ) 2

(8)

here y i , n , ˆ y i are defined as before in Eq. (4 ), y is the mean of

bservation value. y f it i

is the fitted value of the i th target. R 2 takes

ny value between 0 and 1, with a value closer to 1 indicating the

egression model has better performance. In contrast, Q

2 has no

ower boundary but the value ranges between −∞ and 1. A small

ifference between Q

2 and R 2 means that the model has better

erformance.

. Results and discussion

In this study, we developed a GA–SVR–PO method for the pre-

iction of molecular retention indices in gas chromatography. The

erformance of our GA–SVR–PO model was compared with the

erformance of other three existing models: GA–MLR, GA–MLR–

VR, and GA–SVR. In the GA–MLR model, the altered RMSE of val-

dation set based on MLR was used as fitness of GA to find the

ptimal subset of molecular descriptors. The GA–MLR–SVR is a

odel that uses the optimal molecular descriptors found by the

A–MLR model to train SVR. In the third model GA–SVR, the al-

ered RMSE of validation set based on SVR was directly used as

he fitness of GA to find the best molecular descriptors. The pa-

ameters of SVR used in the second and third models were all op-

imized by grid search. In our proposed model GA–SVR–PO, the GA

Page 4: Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

186 J. Zhang et al. / Neurocomputing 240 (2017) 183–190

Table 1

Predictive performance of each QSRR model.

GA–MLR GA–MLR–SVR GA–SVR GA–SVR–PO

Training set 53.82 ± 7.68 48.02 ± 9.38 38.27 ± 4.70 32.26 ± 6.88

RMSE Validation set 54.30 ± 7.23 – 43.36 ± 4.42 38.45 ± 5.82

Test set 65.07 ± 28.87 57.94 ± 27.76 47.31 ± 32.20 41.91 ± 29.50

Training set 0.961 ± 0.012 0.968 ± 0.011 0.979 ± 0.0047 0.985 ± 0.005

R 2 Validation set 0.962 ± 0.012 – 0.975 ± 0.0048 0.980 ± 0.005

Test set 0.927 ± 0.070 0.936 ± 0.068 0.946 ± 0.080 0.955 ± 0.728

w

S

S

e

i

d

S

t

b

e

p

S

s

t

l

a

t

c

S

t

p

n

d

t

d

t

c

m

R

t

d

l

f

v

l

t

r

η

η

w

v

m

h

W

M

9

O

was used not only to optimize the subset of molecular descrip-

tors, but also to optimize the SVR parameters. We employed a 10-

fold cross-validation strategy to construct and validate each of the

four models. Each experiment was conducted 10 times by using

different training set and test set. Every RI value was predicted one

time, the final predicted RI values of the 10 experiments were then

combined to evaluate the predictive performance of each QSRR

model.

Table 1 shows the performance of each QSRR model. As ex-

pected, the training set has the smallest RMSE and the largest R 2

values in all of the four models. Compared with the SVR-based

models (GA–SVR and GA–SVR–PO), the MLR–based models (GA–

MLR and GA–MLR–SVR) have large RMSE values and small R 2 val-

ues. The mean values of RMSE and R 2 for the MLR-based models

on the test set are 65.07 and 57.94, 0.927 and 0.936, respectively.

However, the mean values of RMSE and R 2 for the SVR-based mod-

els are 47.31 and 41.91, 0.946 and 0.955, respectively. Furthermore,

there is no significant difference between the performance of the

test and validation sets in the SVR-based models. However, the per-

formance the MLR-based models are relatively poor in the test set

compare to their performance in the validation set.

For all of the four QSRR models, the standard deviations are

small, indicating all of the four methods have a stable predictive

performance. The standard deviations of RMSE and R 2 for the MLR-

based models on the training set and the validation set range from

7.23 to 9.38 and 0.012 to 0.070, respectively. However, the stan-

dard deviations of RMSE and R 2 for the SVR-based methods on

the training set and the validation set range from 4.70 to 6.88 and

0.0047 to 0.0050, respectively. These indicate that the SVR-based

models are stable than the MLR-based models. Among all of the

four tested models, the GA–SVR–PO model has the best perfor-

mance, that is, small RMSE value and large R 2 with smaller stan-

dard deviations.

In order to study the correlation between the predicted reten-

tion indices and the true retention indices extracted from the NIST

RI database, we merged all of the predicted retention indices of

the 10 cross-validation experiments and displayed the results in

Fig. 2 . The R 2 of the GA–MLR model is 0.935. Compared to the

GA–MLR model, the value of R 2 is improved by 0.009, 0.022 and

0.0326 for the GA–MLR–SVR, GA–SVR and GA–SVR–PO models, re-

spectively. The absolute difference between the predicted retention

indices and the true RI values also improved from 45.93 to 42.72,

18.77 and 13.76 for the four models, respectively.

Fig. 3 is residual case order plots of the predicted results

from the 10-fold cross-validation experiments. Each data point in

Fig. 3 represents a residual that is the absolute RI difference be-

tween the predicted RI and the true RI extracted from the NIST RI

database. Each line represents 95% confidence interval. A line in-

tercepting with the zero line means that the predicted RI value is

within 5% variation of the true RI value. It can be seen that the

residuals of the four QSRR models are all randomly distributed on

the two sides of the zero lines, which indicates that these models

describe the data well. The diamond points highlighted in blue line

have prediction variations larger than 5%. There are 14 molecules

with prediction variation larger than 5% in the GA–MLR model,

o

here there are 11, 5 and 6 molecules in the GA–MLR–SVR, GA–

VR and GA–SVR–PO models, respectively. This indicates that the

VR-based model performs better than the MLR-based models.

Fig. 4 displays the box plot of the predictive performance of

ach model on the test set in the 10-fold cross-validation exper-

ments. It can be seen that the RMSE value increases and R 2 value

ecreases in an order of GA–MLR, GA–MLR–SVR, GA–SVR and GA–

VR–PO. Because of the randomness of GA algorithm, the predic-

ive performance of each QSRR model has its variation. The SVR-

ased models have smaller variations than the MLR-based mod-

ls. This means that the model based on GA–SVR have more stable

rediction performance than the MLR-based models. Between the

VR-based models, the performance of the GA–SVR–PO model is

table than the performance of the GA–SVR model.

Each cross in Fig. 4 represents that the predicted RI values of

he test set in a corresponding cross-validation experiment have

arge difference from the true RI values of molecules in the test set,

nd the results of the current cross-validation experiment may be

reated as outliers of the 10-fold cross-validation experiments. Two

ross-validation experiments were detected as outliers in the GA–

VR–PO model. We manually examined the prediction results of

hese two cross-validation experiments and found that two com-

ounds, acetal (CAS number 50-78-2) and methylaldehyde (CAS

umber 50-00-0), had remarkable difference between the pre-

icted RI value and the true RI value. The true RI values for these

wo compounds are 1285 and 260, respectively. The absolute pre-

ictive error of compound acetal and methylaldehyde are 565 and

288, respectively. Further examination of the true RI values of

he entire dataset indicates that such a large prediction error was

aused by the small size of dataset we used. There are not enough

olecules in the training set that have the RI values similar to the

I values of these two compounds. After manually removing these

wo compounds, there was not any cross-validation experiments

etected as outliers. Therefore, we believe that lack of other simi-

ar compounds in training set leads to a large predictive difference

or these two compounds. For the same reason, the standard de-

iations of RMSE and R 2 on the test set of the SVR-based models

isted in Table 1 are larger than these of the MLR-based models.

Prediction variation threshold ηpred is a critical parameter for

he evaluation of the model performance. ηpred is defined as the

elative variation of the predicted RI value from the true RI value.

pred can be calculated as follows:

pred =

∣∣y pred − y ∣∣

y (9)

here y pred is the predicted retention index and y is the true RI

alue. Fig. 5 shows the overall prediction performance of the four

odels on the test set. It can be seen that the GA–SVR–PO model

as the best predictive performance than the other three models.

ith ηpred = 10%, the prediction accuracy of the four models GA–

LR, GA–MLR–SVR, GA–SVR and GA–SVR–PO are 80.16%, 88.10%,

3.65% and 96.03%, respectively.

Table 2 lists the combined predictive results of all four models.

verall, the Q

2 and R 2 values agree to each other. The large values

f Q

2 and R 2 indicate that all of the four QSRR models have good

Page 5: Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

J. Zhang et al. / Neurocomputing 240 (2017) 183–190 187

Fig. 2. Predictive plots of the four models.

Table 2

The final predictive results of Q 2 and R 2 .

QSRR model Q 2 R 2

GA–MLR 0.9318 0.9351

GA–MLR–SVR 0.9413 0.9441

GA–SVR 0.9562 0.9571

GA–SVR–PO 0.9671 0.9676

p

t

b

t

S

m

s

r

f

t

e

t

l

c

b

t

c

s

o

t

redictive performance. The SVR-based models perform better than

he MLR-based models, and the GA–SVR–PO model performed the

est with Q

2 and R 2 values reached 0.967 and 0.968, respectively.

Table 3 is a partial list of molecular descriptors selected by

he three QSRR models based on 20 experiments. The GA–MLR–

VR model uses the molecular descriptors selected by the GA–MLR

odel, that is, these two models use the same set of molecular de-

criptors. In this table, the linear correlation refers to Pearson’s cor-

elation between a molecular descriptor and the true RI value. The

requency is the number of experiments that a molecular descrip-

or was selected by the model during the 10-fold cross-validation

xperiments. BP has the highest linear correlation with the reten-

ion index (0.94) followed by molecular weight (0.902), Weiner po-

arity number (0.88) and number of carbon atoms (0.846). In most

ases, the frequency of a molecular descriptor selected by the SVR-

ased models is different from that of the MLR-based models, and

he SVR-based models have strong capability in selecting some cru-

ial descriptors.

Despite the randomness nature of the GA, there still have

ome consistency between the three models. The BP, vsurf_ID6 and

pr_brigid are the descriptors that have selected most times by

he three models. The linear correlation coefficients of these three

Page 6: Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

188 J. Zhang et al. / Neurocomputing 240 (2017) 183–190

Fig. 3. The residual case order plot of the four models.

Fig. 4. The box plot for 20 experiments.

Page 7: Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

J. Zhang et al. / Neurocomputing 240 (2017) 183–190 189

Table 3

A partial list of molecular descriptors selected by each QSRR model and linear correlation coefficient.

Name Description Correlation Selection frequency

GA–MLR GA–SVR GA–SVR–PO

BP Boiling point 0 .94 10 10 10

vsurf_ID6 Hydrophobic integy moment −0 .096 8 7 9

opr_brigid The number of rigid bonds 0 .56 6 8 7

Weight Molecular weight 0 .902 5 2 3

a_acc Number of hydrogen bond acceptor atoms 0 .103 4 1 3

PEOE_PC- Total negative partial charge −0 .684 3 2 1

PEOE_RPC- Relative negative partial charge −0 .357 3 2 1

E_vdw van der Waals component of the potential energy 0 .796 3 2 1

SlogP_VSA9 Subdivided surface areas −0 .236 3 1 4

PC + Total positive partial charge 0 .494 3 1 1

Q_VSA_FPOS Fractional positive van der Waals surface area −0 .269 3 1 0

E_nb Value of the potential energy with all bonded terms disabled 0 .625 3 1 0

PEOE_VSA_NEG Total negative van der Waals surface area 0 .689 3 0 3

vsurf_DW13 Contact distances of vsurf_EWmin −0 .09 3 0 1

KierA1 First alpha modified shape index 0 .613 2 7 6

vsurf_W2 Hydrophilic volume 0 .509 2 4 0

chi1v_C Carbon valence connectivity index 0 .733 2 0 3

Kier3 Third kappa shape index 0 .259 1 3 2

b_1rotN Number of rotatable single bonds 0 .50 1 3 1

vsurf_CW5 Capacity factor −0 .049 1 3 0

AM1_dipole The dipole moment calculated using the AM1 Hamiltonian −0 .029 1 2 3

E_ele Electrostatic component of the potential energy −0 .19 1 2 1

DASA Absolute value of the difference between ASA + and ASA − 0 .053 1 2 1

weiner Pol Weiner polarity number 0 .88 1 1 3

PEOE_VSA-1 Sum of vi where qi is in the range ( −0.10, −0.05) 0 .457 1 1 3

E_strain The current energy minus the value of the energy at a near local minimum 0 .6 1 1 2

pmiY y component of the principal moment of inertia 0 .591 1 1 2

vsurf_ID1 Hydrophobic integy moment −0 .074 1 1 2

MNDO_IP The ionization potential (kcal/mol) −0 .321 1 0 4

E_stb Bond stretch-bend cross-term potential energy 0 .1 1 0 3

PC- Total negative partial charge −0 .494 1 0 2

E_oop Out-of-plane potential energy 0 .126 1 0 2

a_nC Number of carbon atoms 0 .846 0 3 3

a_ICM Atom information content (mean) 0 .087 0 3 2

Fig. 5. The fraction of molecules with correctly predicted retention indices vs. the

threshold of prediction variation on the test set.

m

0

t

h

c

i

1

G

o

s

n

m

d

r

t

p

h

a

5

v

P

m

w

e

t

t

m

e

P

R

9

A

d

olecular descriptors with the true RI value are 0.94, −0.096 and

.56, respectively. It should be noted that some molecular descrip-

ors with strong linear correlation with the true RI value do not

ave high selection frequency. For example, the correlation coeffi-

ient between the Weiner polarity number and the true RI value

s 0.88. But it was only selected 1, 1, and 3 times during the

0-fold cross-validation experiments in the GA–MLR, GA–SVR and

A–SVR–PO models, respectively. Another example is the number

f carbon atoms, which was selected only 0, 3, and 3 times, re-

pectively. This phenomenon is mainly induced by the randomness

ature of the GA algorithm. Another reason is that all regression

odels select the molecular descriptors based on the overall pre-

iction accuracy. Some molecular descriptors with low linear cor-

elation coefficients have high selection frequency because the con-

ribution of these molecular descriptors to the RI prediction com-

ensate to the other selected descriptors and therefore, generate

igh prediction accuracy on the molecules present in the training

nd validation sets.

. Conclusions

In this study, we developed a genetic algorithm and support

ector regression with parameter optimization model (GA–SVR–

O) for the prediction of molecular retention indices in gas chro-

atography. The performance of our proposed GA–SVR–PO model

as compared with the performance of other three existing mod-

ls: GA–MLR, GA–MLR–SVR, and GA–SVR. Our analyses show that

he MLR-based models can achieve a desired performance and

he SVR-based models have improved performance. The SVR-based

odels also have a stable performance than the MRL-based mod-

ls. In all of the four models, our proposed model GA–SVR–

O achieved the best predictive performance with R 2 > 0.96 and

MSE = 49.94. The prediction accuracy of GA–SVR–PO model is

6% at 10% of prediction variation.

cknowledgments

This work was supported by National Natural Science Foun-

ation of China under grant nos. 61271098 , 61672035 , 6130 0 058 ,

Page 8: Optimization enhanced genetic algorithm-support vector ... · Optimization enhanced genetic algorithm-support vector regression ... a b s t r a c t A algorithm supportmethod genetic

190 J. Zhang et al. / Neurocomputing 240 (2017) 183–190

[

[

[

[

[

[

61472282 and 61032007 and Provincial Natural Science Research

Program of Higher Education Institutions of Anhui Province under

grant no. KJ2012A005 , Anhui Provincial Natural Science Foundation

under grant no. 1508085MF129 .

References

[1] S.E. Stein , An integrated method for spectrum extraction and compound iden-

tification from gas chromatography/mass spectrometry data, J. Am. Soc. MassSpectrom. 10 (1999) 770–781 .

[2] E. Kováts , Gas-chromatographische charakterisierung organischer verbindun-gen. Teil 1: retentionsindices aliphatischer halogenide, alkohole, aldehyde und

ketone, Helv. Chim. Acta 41 (1958) 1915–1932 .

[3] H.K. Van Den Dool , P. Dec , A generalization of the retention index system in-cluding linear temperature programmed gas-liquid partition chromatography,

J. Chromatogr. 11 (1963) 463–471 . [4] K. Heberger , Quantitative structure-(chromatographic) retention relationships,

J. Chromatogr. A 1158 (2007) 273–305 . [5] R. Kaliszan , Quantitative Structure-Chromatographic Retention Relationships,

Wiley, New York, 1987 .

[6] E. Dossin , E. Martin , P. Diana , A. Castellon , A. Monge , P. Pospisil , M. Bentley ,P.A. Guy , Prediction models of retention indices for increased confidence in

structural elucidation during complex matrix analysis: application to gas chro-matography coupled with high-resolution mass spectrometry, Anal. Chem. 88

(2016) 7539–7547 . [7] R.J. Hu , H.X. Liu , R.S. Zhang , C.X. Xue , X.J. Yao , M.C. Liu , Z.D. Hu , B.T. Fan ,

QSPR prediction of GC retention indices for nitrogen-containing polycyclic aro-

matic compounds from heuristically computed molecular descriptors, Talanta68 (2005) 31–39 .

[8] Y.W. Wang , X.J. Yao , X.Y. Zhang , R.S. Zhang , M.C. Liu , Z.D. Hu , B.T. Fan , The pre-diction for gas chromatographic retention indices of saturated esters on sta-

tionary phases of different polarity, Talanta 57 (2002) 641–652 . [9] K. Heberger , M. Gorgenyi , M. Sjostrom , Partial least squares modeling of re-

tention data of oxo compounds in gas chromatography, Chromatographia 51

(20 0 0) 595–60 0 . [10] L.I. Nord , D. Fransson , S.P. Jacobsson , Prediction of liquid chromatographic re-

tention times of steroids by three-dimensional structure descriptors and par-tial least squares modeling, Chemom. Intell. Lab. Syst. 44 (1998) 257–269 .

[11] Z. Garkani-Nejad , Use of self-training artificial neural networks in a QSRRstudy of a diverse set of organic compounds, Chromatographia 70 (2009)

869–874 .

[12] D.-S. Huang , J.-X. Du , A constructive hybrid structure optimization methodol-ogy for radial basis probabilistic neural networks, IEEE Trans. Neural Netw. 19

(2008) 2099–2115 . [13] D.-S. Huang , Systematic Theory of Neural Networks For Pattern Recognition,

Publishing House of Electronic Industry of China, Beijing, 1996, p. 8 . [14] D.-s. Huang , Radial basis probabilistic neural networks: model and application,

Int. J. Pattern Recognit. Artif. Intell. 13 (1999) 1083–1101 . [15] X.J. Yao , X.Y. Zhang , R.S. Zhang , M.C. Liu , Z.D. Hu , B.T. Fan , Prediction of gas

chromatographic retention indices by the use of radial basis function neural

networks, Talanta 57 (2002) 297–306 . [16] C.L. Wang , M.J. Skibic , R.E. Higgs , I.A. Watson , H. Bui , J.B. Wang , J.M. Cintron ,

Evaluating the performances of quantitative structure–retention relationshipmodels with different sets of molecular descriptors and databases for high-

-performance liquid chromatography predictions, J. Chromatogr. A 1216 (2009)5030–5038 .

[17] M.H. Fatemi , E. Baher , M. Ghorbanzade’h , Predictions of chromatographic re-

tention indices of alkylphenols with support vector machines and multiple lin-ear regression, J. Sep. Sci. 32 (2009) 4133–4142 .

[18] C.-C. Chang , C.-J. Lin , LIBSVM: A library for support vector machines, ACMTrans. Intell. Syst. Technol. (TIST) 2 (2011) 27 .

[19] G. Oliveri , A. Massa , Genetic algorithm (GA)-enhanced almost difference set(ADS)-based approach for array thinning, IET Microwaves, Antennas Propag. 5

(2011) 305–315 . [20] T. Hancock , R. Put , D. Coomans , Y. Vander Heyden , Y. Everingham , A perfor-

mance comparison of modem statistical techniques for molecular descriptorselection and retention prediction in chromatographic QSRR studies, Chemom.

Intell. Lab. Syst. 76 (2005) 185–196 . [21] V.V. Mihaleva , H.A. Verhoeven , R.C.H. de Vos , R.D. Hall , R.C.H.J. van Ham , Auto-

mated procedure for candidate compound selection in GC-MS metabolomics

based on prediction of Kovats retention index, Bioinformatics 25 (2009)787–794 .

22] C.W. Hsu , C.C. Chang , C.J. Lin , A Practical Guide to Support Vector Classifica-tion, Department of Computer Science and Information Engineering, National

Taiwan University, 2003 . 23] B. Ustun , W.J. Melssen , M. Oudenhuijzen , L.M.C. Buydens , Determination of op-

timal support vector regression parameters by genetic algorithms and simplex

optimization, Anal. Chim. Acta 544 (2005) 292–305 . [24] S.W. Lin , Z.J. Lee , S.C. Chen , T.Y. Tseng , Parameter determination of support vec-

tor machine and feature selection using simulated annealing approach, Appl.Soft Comput. 8 (2008) 1505–1512 .

25] Chemical Computing Group Inc. Molecular Operating Environment (MOE).2008. http://www.chemcomp.com/ .

26] W.P. Eckel , T. Kind , Use of boiling point-Lee retention index correlation for

rapid review of gas chromatography-mass spectrometry data, Anal. Chim. Acta494 (2003) 235–243 .

[27] S.E. Stein, Retention Indices in NIST Chemistry WebBook. NIST Standard Refer-ence Database Number 69, versions 2005 and 2008 ( http://webbook.nist.gov ).

2008). [28] NIST/SEMATECH e-Handbook of Statistical Methods http://www.itl.nist.gov/

div898/handbook/eda/section3/eda35h.htm) .

29] R.B.D.a.W.J. Dixon , Simplified statistics for small numbers of observations, Anal.Chem. 23 (1951) 636–638 .

[30] V. Vapnik , The Nature of Statistical Learning Theory, Springer-Verlag, New York,USA, 1995 .

[31] B. Scholkopf , K.K. Sung , C.J.C. Burges , F. Girosi , P. Niyogi , T. Poggio , V. Vap-nik , Comparing support vector machines with Gaussian kernels to radial basis

function classifiers, IEEE Trans. Signal Process. 45 (1997) 2758–2765 .

32] A.J. Smola , B. Scholkopf , A tutorial on support vector regression, Stat. Comput.14 (2004) 199–222 .

[33] A.J. Chipperfield , P.J. Fleming , C.M. Fonseca , Genetic algorithm tools for con-trol systems engineering, in: Proceedings of the 1994 Adaptive Computing in

Engineering Design and Control, Plymouth Engineering Design Centre, 1994,pp. 128–133 .

[34] D.M. Hawkins , The problem of overfitting, J. Chem. Inf. Comput. Sci. 44 (2004)

1–12 . [35] R. Todeschini , V. Consonni , A. Mauri , M. Pavan , Detecting "bad" regression

models: multicriteria fitness functions in regression analysis, Anal. Chim. Acta515 (2004) 199–208 .

Jun Zhang was born in Anhui Province, Chin, in 1971.He received M.S. degree in Pattern Recognition and Intel-

ligent System in 2004, from Institute of Intelligent Ma-chines, Chinese Academy of Sciences. He received the

Ph.D. degree from University of Science and Technology

of China, Hefei, China in 2007. Currently, He is associateprofessor in the School of Electrical Engineering and Au-

tomation, Anhui University, China. His research interestsfocus on deep learning, ensemble learning and chemin-

formatics.