QSAR model reproducibility and applicability: A case study of rate constants of hydroxyl radical reaction models applied to polybrominated diphenyl ethers and (benzo-)triazoles

QSAR Model Reproducibility and Applicability: A Case

Study of Rate Constants of Hydroxyl Radical Reaction

Models Applied to Polybrominated Diphenyl Ethers

and (Benzo-)Triazoles

PARTHA PRATIM ROY, SIMONA KOVARICH, PAOLA GRAMATICA

Department of Structural and Functional Biology, University of Insubria,Via Dunant 3, 21100, Varese, Italy

Received 14 February 2011; Revised 22 March 2011; Accepted 22 March 2011DOI 10.1002/jcc.21820

Published online 3 May 2011 in Wiley Online Library (wileyonlinelibrary.com).

Abstract: The crucial importance of the three central OECD principles for quantitative structure-activity relationship

(QSAR) model validation is highlighted in a case study of tropospheric degradation of volatile organic compounds

(VOCs) by OH, applied to two CADASTER chemical classes (PBDEs and (benzo-)triazoles). The application of any

QSAR model to chemicals without experimental data largely depends on model reproducibility by the user. The repro-

ducibility of an unambiguous algorithm (OECD Principle 2) is guaranteed by redeveloping MLR models based on both

updated version of DRAGON software for molecular descriptors calculation and some freely available online descrip-

tors. The Genetic Algorithm has confirmed its ability to always select the most informative descriptors independently

on the input pool of variables. The ability of the GA-selected descriptors to model chemicals not used in model develop-

ment is verified by three different splittings (random by response, K-ANN and K-means clustering), thus ensuring the

external predictivity of the new models, independently of the training/prediction set composition (OECD Principle 4).

The relevance of checking the structural applicability domain (OECD Principle 3) becomes very evident on comparing

the predictions for CADASTER chemicals, using the new models proposed herein, with those obtained by EPI Suite.

q 2011 Wiley Periodicals, Inc. J Comput Chem 32: 2386–2396, 2011

Key words: reproducible algorithm; molecular descriptors; external validation; applicability domain; CADASTER

chemicals

Introduction

Quantitative structure-activity relationships (QSARs) are predic-

tive models derived from the application of statistical tools

correlating biological activity, physico-chemical properties or

reactivity of chemicals (drugs/industrial chemicals/environmental

pollutants) with descriptors representative of molecular structure

and/or property. QSAR models have demonstrated their utility

for a long time, initially in drug design and more recently also

in general chemical screening of big libraries of compounds. It

is important to distinguish between ‘‘descriptive QSARs’’ and

‘‘predictive QSARs.’’1 In ‘‘descriptive QSARs,’’ the main atten-

tion is focused on modelling the existing data, fitting them as

best as possible, using molecular descriptors that are mostly

selected by a supposed ‘‘understanding’’ of the correlation/cau-

sality, in terms of mechanism interpretability. These kinds of

QSAR models are highly useful for mechanism interpretation,

particularly in local models developed on homogeneous data

sets of congeneric compounds, and are widely applied, mainly

in drug design. However, in virtual screening a ‘‘predictive

QSAR’’ approach should be preferred: global models exploit the

limited existing experimental information to predict information

relative to chemicals without experimental data. This can be

highly useful to screen big data sets and prioritize, for experi-

mental tests, compounds that are in silico highlighted as poten-

tially more dangerous. Thus, the check of predictivity should be

the most important and primary aspect of ‘‘predictive QSARs.’’

The recent European legislation REACH (Registration

Evaluation Authorization and restriction of Chemicals)2 includes

the use of QSAR models for the prediction of data not

Additional Supporting Information may be found in the online version of

this article.

Correspondence to: P. Gramatica; e-mail: [email protected]

Contract/grant sponsor: European Union (CADASTER); contract/grant

numbers: FP7-ENV-2007-1-212668

q 2011 Wiley Periodicals, Inc.

experimentally available. However the predicted values must be

reliably obtained by QSAR models validated according to

OECD principles for the validation, for regulatory purposes, of

(Q)SAR models.3 These principles, defined after much discus-

sion in QSAR and regulatory communities, are an optimum

summary of the most important points that need to be addressed

to obtain reliable QSAR models. A guidance document on the

validation of QSAR models,4 including useful information on

good practices in QSAR modeling, has been prepared from the

collaborative work of various international experts.

In this article, we focussed on three central principles: Princi-

ple 2) an unambiguous algorithm; Principle 3) a defined domain

of applicability; Principle 4) appropriate measures of goodness-of

fit, robustness and predictivity, in the context of a specific case

study, i.e., the reactivity with hydroxyl radicals in troposphere.

The intent of Principle 2 (unambiguous algorithm) is to

ensure transparency in the model algorithm that generates pre-

dictions of an endpoint from information on chemical structure

and/or physicochemical properties, so that others can reproduce

the model. In fact, without information on how QSAR estimates

are derived, the performance of a model cannot be

independently established. The algorithms used in QSAR model-

ling (in terms of methods and molecular descriptors) should be

described thoroughly, so that the user can understand exactly

how the estimated value was produced, and be able to reproduce

the calculations, if desired. Thus, the important issue of predic-

tion reproducibility is covered by this OECD principle.

The need to define an applicability domain (Principle 3)

expresses the fact that QSARs models are inevitably associated

with limitations in terms of types of chemical structures, phys-

ico-chemical properties and mechanisms of action for which the

models can generate reliable predictions. Even a robust, signifi-

cant and validated QSAR model cannot be expected to reliably

predict a studied end-point for the entire universe of chemicals.

The applicability domain of a QSAR model has been defined5 as

the response and chemical structure space in which the model

makes predictions with a given reliability and is defined by the

nature of the chemicals in the training set. It is generally

felt that if a new molecule is somehow similar, or is in the

‘‘domain’’ or ‘‘space’’ of the training set, it is likely to be well-

predicted (interpolation), otherwise there is significant ‘‘extrapo-

lation’’ and the prediction could be less reliable: it is highly

useful that a user has this type of information.

The Principle 4 expresses the need to perform statistical vali-

dation to establish the performance of a model, which consists

of internal model performance (goodness-of-fit and robustness)

and external model performance (predictivity).6–8

The real utility of QSAR models in the REACH context, and

in specific EU-funded projects dedicated to develop or apply

QSAR models for the REACH legislation, is to obtain reliable

predicted data for compounds without experimental data. In the

CADASTER Project,9 in which the authors are involved, some

classes of emerging pollutants (flame retardants including

PBDEs, perfluorinated chemicals, fragrances and (benzo)tria-

zoles) are studied for the possible application of QSAR predic-

tions in their risk assessment. This offered us the opportunity to

apply some of our previously developed and published QSAR

models10 to CADASTER chemicals to study their persistence,

even without experimental data. Reactions with hydroxyl radi-

cals in the troposphere is the dominant removal pathway for

many industrial chemicals, and for this reason it is crucial to

determine the chemical persistence in air.

Recently QSAR models with validated external predictivity

for degradation by OH of a big set (460) of Volatile organic com-

pounds (VOCs)10 had been developed in our laboratory, using the

version 5.0 (2004) of DRAGON for molecular descriptors calcu-

lation. However, when we tried to apply those models for the pre-

diction of new data sets in the CADASTER Project, we found

that the models were no longer reproducible due to the lack of

some descriptors, or their changing, in versions 5.5 (2007)11 and

the last 6 (2009) of DRAGON software. This is a serious draw-

back for QSAR modeling, both for model developers and users.

So this raises the question of what to do when molecular

descriptors of already developed and validated models are no lon-

ger reproducible because of the new software versions used for

molecular descriptors calculation (deletion of some descriptors,

changes of their values, etc). This problem was already faced and

solved in a previous publication when a BCF model had to be

updated for the above reason.12 In fact, DRAGON is a software

for molecular descriptors calculation that is continuously updated,

not only in terms of addition of new descriptors, but also in the

revision of the old ones. This highlights the fact that QSAR mod-

eling is a dynamic process and continuous updating of QSAR

models is useful to have constantly applicable models. Neverthe-

less it is problematic that descriptors, which had demonstrated

their ability to model some data sets can no longer be calculated,

or are calculated differently, by the newer software versions used

by QSAR model developers: thus the practical utility of that spe-

cific model, which is no longer reproducible, is lost, and it is

therefore no longer suitable for new users to apply.

Thus, we decided to verify if new predictive QSAR models

could be proposed, based on the same data set, but using more and

different descriptors, both by the new versions of DRAGON11 and

by some free-calculable descriptors by web.13 In this work, we

aimed also to verify the ability of our method of variable selection,

Genetic Algorithms, to select descriptors for predictive models

from a changed pool of input descriptors, but with similar mecha-

nistic meaning. The final goal is the proposal of reproducible

QSAR models for OH tropospheric degradation, with external pre-

dictivity6–8 rigorously verified on different splittings and also by

applying various statistical parameters,14–18 some of them15–18

proposed after our previous work. These new models will be prac-

tically and regularly applicable to chemicals in CADASTER Pro-

ject (here PBDEs and (benzo)triazoles) and also for regulation in

REACH to wide set of new chemicals, verifying always the

applicability domain. A comparison of our predicted data with

predictions for the same classes obtained by the widely used EPI

Suite software is also performed and commented.

Materials and Methods

Experimental Data Set

Experimental data of the OH radical degradation rate constants

of 460 heterogeneous organic compounds were obtained from

2387OH Degradation QSAR Model Reproducibility and Applicability

Journal of Computational Chemistry DOI 10.1002/jcc

literature reported by Atkinson.19 The selected data were for

reactions at 258C and 1 atm; all the rate constants, reported in

cm3 s21 per molecule, were transformed to logarithmic units

and multiplied by 21 to obtain positive values (higher the value

in –logarithm scale lower will be the reactivity and vice versa)

and used as response variable for subsequent QSAR analyses.

The data set includes alkanes, alkenes, alcohols, halogenated

chemicals, amines, aromatics, and other functional groups.

In Supporting Information Table S-I all the chemicals in the

experimental data set, ordered according to their CAS number,

are listed with names, SMILES, molecular descriptor values,

experimental and predicted response values.

Molecular Descriptors

The molecular descriptors for the given compounds were mainly

calculated using DRAGON software11 on the (x, y, z)-atomic

coordinates of the minimal energy conformations determined by

the AM1 method in HYPERCHEM Package.20

In this study we consider only zero-, mono-, bi-dimensional

descriptors in DRAGON 5.5 version. Then we deleted those

descriptors that are no longer available or that have somewhat

different values in the updated version (DRAGON 6.0). Finally

constant values and descriptors found to be correlated pair-wise

were excluded in a pre-reduction step (one of any two descrip-

tors with a K correlation greater than 0.95 was removed to

reduce redundant and not useful information), thus obtaining a

pruned set of 341 molecular descriptors.

For the calculation of Online descriptors we used the online

platform of molecular descriptors available at CADASTER

web.13 Different 2D-descriptors (E-state, ALogPS, Molprint

fragment, AMBIT Descriptors, GSFragment, ISIDA fragments

etc) were calculated, and were pruned by deleting descriptors

with less than 2 unique values as well as a correlation [0.95. In

addition, we added ETA descriptors,21 obtaining a large pool of

1023 input descriptors.

Furthermore, to provide energy information, the following

electronic descriptors were added: three quantum-chemical

descriptors (Highest Occupied Molecular Orbital (HOMO) and

Lowest Unoccupied Molecular Orbital (LUMO) energies,

HOMO-LUMO gap), calculated by the semi empirical molecu-

lar orbital program MOPAC (AM1 method for energy minimi-

zation) in the software HYPERCHEM. We used quantum

chemical descriptors previously10,22 calculated in our group,

but the same descriptors can also be freely calculated on the

web.13 Respectively, input sets of 344 descriptors and 1026

descriptors underwent the subsequent selection for the best

modeling variables.

QSAR Modeling

Multiple linear regression (MLR) and variable selection were

performed by Ordinary Least Squares regression (OLS).23 The

Genetic Algorithm-Variable Subset Selection (GA-VSS)24

approach was applied separately on a set of 344 (DRAGON and

MOPAC) and 1026 (Online and MOPAC) descriptors to select

those most relevant to obtain models with the highest predictive

power. First of all, models with 1–2 variables were developed

by the all-subset-method procedure to explore all the low dimen-

sion combinations. The number of descriptors was subsequently

increased one by one, and new models were formed. The out-

come of the Genetic Algorithms in MOBY DIGS software is a

population of 100 regression models, ordered according to their

decreasing internal predictive performance. The coefficient of

determination (R2) was reported as a measure of the total

variance of the response explained by the regression models (fit-

ting). All the models were validated internally by the leave-one-

out procedure (Q2LOO), and the robustness of the models was fur-

ther evaluated by bootstrap (Q2BOOT). The GA was stopped when

increasing the model size did not increase the Q2LOO value to any

significant degree.

Evidence that the proposed models were well founded,

and not just the result of chance correlation, was provided by

Y-scrambling testing: new models, based on the GA-selected

descriptors, were recalculated for a randomly reordered

response, which resulted in a significantly lower R2 than the

originally proposed models. The averaged scrambled R2 (R2YS)

was calculated after 500 scrambling iterations.6,7 Additionally,

another parameter cR2p (cR2

p ¼ R � ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiR2 � R2

r

p) (R2

r being

squared mean correlation coefficient of random models) was

also calculated25 to check the distance of our

developed models from chance models (Supporting Information

Table S-II).

Particular attention was devoted to the collinearity of the

selected molecular descriptors: in fact, to avoid multicollinearity

without, or with, ‘‘apparent’’ prediction power (due to chance

correlation), regression was calculated only for variable subsets

with an acceptable multivariate correlation with response, by

applying the QUIK rule (Q Under Influence of K).26 The accept-

able models were only those with a global correlation of [X 1y]

block (KXY) greater than the global correlation of the X block

(KXX) variable, X being the molecular descriptors, and y the

response variable. The collinearity in the original set of molecu-

lar descriptors resulted in many similar models that more or less

yield the same predictive power (in MOBY-DIGS software23

100 models of different dimensionality). Therefore, when there

were models of similar performance those with higher DK (KXY-

KXX) were selected and further verified.

Data Splitting for External Validation

For this study three different splitting techniques were applied

to select the training set for model development and the pre-

diction set for model external validation: Random by response,

Kohonen Artificial Neural Network (K-ANN) and K-means

clustering.

The random by response splitting was obtained by ordering

the chemicals according to their descending kinetic constant

value, and then putting the most and the least reactive in the

training set and one out every two chemicals in the prediction

set (50% of the full dataset). This splitting guarantees that the

prediction set spans the entire range of the experimental meas-

urements and is numerically representative of the dataset.

However, such splitting does not guarantee that the training set

represents the entire molecular descriptor space of the original

dataset, being only dependent on response values.

2388 Roy, Kovarich, and Gramatica • Vol. 32, No. 11 • Journal of Computational Chemistry


The splitting of the data set realized by Kohonen Artificial

Neural Network (K-ANN)27 takes advantage of the clustering

capabilities of K-ANN, allowing the selection of a structurally

meaningful training set and a representative prediction set. The

211 most significant principal components, calculated from each

group of DRAGON molecular descriptors, were used to describe

the relevant structural information of the chemicals. This struc-

tural information and the response were used as variables to build

a Kohonen map (12 3 12 neurons, 500 epochs). At the end of

500 epochs of the net training, similar chemicals fall within the

same neuron, i.e., they carry the same information. To select the

training set of chemicals, it is assumed that the compound closest

to each neuron centroid is the most representative of all the chem-

icals within the same neuron. Thus, the selection of the training

set chemicals was performed by the minimal distance from the

centroid of each cell in the top map. The remaining objects, close

to the training set chemicals, were used for the prediction set.

Another approach for splitting into training and prediction

sets is by using K-means clustering28 based on the standardized

predictor variables (DRAGON zero-, mono-, and bi-dimensional

descriptors). This nonhierarchical approach (clustering) ensures

that the similarity principle can be employed for grouping chem-

icals and splitting them in balanced training and prediction sets.

It must be supplied with the number of clusters (K) into which

the data are to be grouped and it expresses only the final cluster

membership for each compound. This procedure ensures that

any chemical classes (as determined by the clusters derived

from the K-means clustering technique) will be represented in

both series of compounds, i.e., training and prediction sets,

selecting randomly from each cluster.

External Validation

The statistically internally optimized models were further eval-

uated for their real predictive power on the prediction set chemi-

cals not used in the model building process. The developed

models were judged by different external validation parameters

like Q2-F114, Q2-F215, Q2-F316, r2m17 and a recently proposed pa-

rameter by our group.18

Q2-F114, which is widely used as a metric for external valida-

tion for long time, was calculated for all the developed models

for their external predicivity. The two other variant of Q2 for

external validation (Q2-F215, Q2-F316) were also calculated.

They are expressed as follows

Q2F1 ¼ 1�

PnEXTi¼1 ðyi � yiÞ2PnEXTi¼1 ðyi � �yTRÞ2

¼ 1� PRESS

SSEXTð�yTRÞ (i)

Q2F2 ¼ 1�

PnEXTi¼1 ðyi � yiÞ2PnEXT

i¼1 ðyi � �yEXTÞ2¼ 1� PRESS

SSEXTð�yEXTÞ (ii)

Q2F3 ¼ 1�

PnEXTi¼1 ðyi � yiÞ2

h i=nEXT

PnTRi¼1 ðyi � �yTRÞ2

h i=nTR

¼ 1� PRESS=nEXTTSS=nTR

(iii)

where yi and yi indicate calculated and observed activity values

respectively.

�yTR, �yEXT indicates the response means of the training and

external test set respectively. PRESS is the predictive sum of

squares, SSEXT (�yTR) and SSEXT (�yEXT) are the total sum of

squares of the external set calculated by means of the training

set mean and the external set mean, respectively. TSS is the

total sum of squares.

An additional parameter r2m,17 which penalizes a model for

large differences between observed and predicted values of the

prediction set compounds, as well as independent of the mean of

training and prediction set, was also calculated for model exter-

nal predictivity. The expression of r2m is defined as:

r2m ¼ r2 ð1�ffiffiffiffiffiffiffiffiffiffiffiffiffiffir2 � r20

qÞ (iv)

where r2 and r20 are determination coefficients of linear relations

between the observed and predicted values of the prediction set

compounds with and without intercept respectively.

Finally, an additional measure of the accuracy of the pro-

posed QSARs is the Root Mean Squared of Errors (RMSE) that

summarizes the overall error of the model. It is calculated as the

square root of the sum of squared errors in prediction divided by

their total number. This parameter was used to compare the ac-

curacy and the stability of our models in the training (RMSET)

and in the prediction (RMSEP) sets.

Applicability Domain

In this study, the AD was defined by the leverage approach29

(for the structural domain), and by the identification of response

outliers (compounds with cross-validated standardized residuals

greater than 2.5 standard deviation units).

Graphically, the plot of hat values (h) versus standardized

residuals, i.e., the Williams graph, verified the presence of

response outliers and training set chemicals that are structurally

very infiuential in determining model parameters (compounds

with leverage value (h) greater than 3p0/n (h*), where p0 is the

number of the model variables plus one, and n is the number of

the objects used to calculate the model). The data predicted for

high leverage chemicals in the prediction set are extrapolated

and could be less reliable.

Results and Discussion

The studied data set of the kinetic constant for degradation by

OH (kOH) of VOCs has been modeled in the past by some

authors,10,30–34 including our group,10,31 with similar performan-

ces, and it is also included in the training set of the AOPWIN

package in the widely used software EPI Suite.35 We have rede-

veloped QSAR models on this data set to apply them to

CADASTER chemicals, and to have an updated reproducible

model, rigorously validated for its external predictivity and

applicability domain, for possible application in REACH. A

wide range of theoretical molecular descriptors (zero-, mono-,

and bi-dimensional) were here used as input descriptors (some

calculated from new versions of DRAGON software11 and some

freely-available online13) to find the statistical correlation

with the studied response, based on updated and reproducible



descriptors. Additionally quantum chemical descriptors, like

HOMO and LUMO energies and HOMO-LUMO gap, were used

in the input pool of variables, as they had already demonstrated

in previous works10,31,33,34 to have a pivotal role in reactivity

modeling. The Genetic Algorithm, as Variable Subset Selection

(GA-VSS), was applied to select only the best combination of

descriptors from both pools, affording models with the highest

internal predictive power (verified by cross-validation). Since the

main utility of QSAR models, mainly for virtual screening, is

their ability to make accurate predictions for new query com-

pounds, never used in model development but within their

applicability domain, we supposed that a part of the experimen-

tally available data were not known and put them in the predic-

tion set, which was not used for model development, but was

used only later to check the predictive power of our models

developed on the reduced training set. Three different splitting

procedures were adopted, two based on structural similarity

analysis (K-ANN, K-means) and one random by sorting the

response, to propose models that have a demonstrated high per-

formance in predicting external chemicals of different typology,

avoiding the bias derived from an unique split. The selection of

modeling variable by GA was performed by Multiple Linear

Regression separately in the three different training sets, obtain-

ing three parallel populations of good models with similar inter-

nal predictivity (Q2 [ 0.7) and verified for performance on the

corresponding external prediction sets. Those models based on

the same combination of descriptor, selected independently in

three splittings and demonstrating high predictivity on the re-

spective prediction set chemicals, were chosen as the best for

external predictive performance. In fact, similar good perform-

ance in the prediction of ‘‘supposed unknown’’ chemical, in

each splitting, demonstrates the validity of that particular combi-

nation of the structural information in the studied response pre-

diction, regardless of the composition of the training sets (thus

unbiased of structure and response value).

For external predictivity check, because of the recent increase

of various statistical parameters, proposed and preferred by vari-

ous authors14–17 and because we have verified that they are not

always concordant,18 we have applied all the parameters

reported in Table 1, those already published14–17 and one that

we recently adopted in our lab.18 Finally, the set of combined

descriptors, which had been demonstrated as useful for the pre-

diction of chemicals not used in model development, was

applied to derive a full model from the complete data set, in

order not to lose any available information36 (Scheme of the

procedure in Fig. 1).

Model Based on DRAGON Descriptors

The chosen predictive models selected from a population of 100

different models were based on the same 4 variables in the three

split training sets (by K-ANN, by K-means algorithm and by

random on response). They are listed in Table 1 with their statis-

tical parameters.

It is evident that all models perform similarly in their

ability to predict external chemicals, independently on the

splitting. Additionally, similar values of RMSE both in train-

ing and prediction sets are guarantee of model generalizabil- Table

1.ComparativeStatistical

Perform

ancesofDifferentDeveloped

Models.

Descriptors

Splittingmethod(no.ofchem

icals)

R2

Q2 LOO

Q2 F114

Q2 F215

Q2 F316

R2 m17

Conc.

coeff18

RMSE(prediction)

Rys

DRAGON

descriptors

HOMO,nX,ID

E,nCbH

K-A

NN

(191a/269b)

0.867

0.856

0.797

0.794

0.766

0.77

0.89

0.47

0.021

Random

(230a/230b)

0.826

0.817

0.819

0.819

0.810

0.80

0.90

0.44

0.018

K-m

eans(230a/230b)

0.836

0.827

0.804

0.802

0.836

0.75

0.90

0.43

0.017

Fullmodel

(460)

0.824

0.819

0.901

0.431

0.009

Onlinedescriptors

HOMO,SeaC2C2aa,G_([Cl,Br,I]),

D_PathSum(F,rel)

K-A

NN

(191a/269b)

0.847

0.834

0.778

0.775

0.745

0.76

0.88

0.49

0.020

Random(230a/230b)

0.814

0.803

0.796

0.795

0.786

0.76

0.89

0.47

0.017

K-m

eans(230a/230b)

0.813

0.803

0.795

0.793

0.829

0.75

0.89

0.44

0.018

Fullmodel

(460)

0.806

0.801

0.891

0.451

0.008

aTrainingcompounds,

bPredictioncompounds,1

representresultsforallthechem

icals.



ity. The difference of the split models lies in the regression

coefficients depending on the training set composition. By

Principal Component Analysis of the compounds, represented

by the selected modeling descriptors, it is possible to verify

that in all three splittings the distribution between the train-

ing and prediction sets is balanced and representative of the

chemical domain. The PCA score plots of first three compo-

nents of the selected descriptor matrix (Supporting Informa-

tion Fig. S1) show the distribution of training and prediction

set compounds in 3D space: each prediction set member is

close to at least one training set member in the multidimen-

sional space.

Figure 1. Scheme of the QSAR procedure for model development and external validation.



All the plots of experimental vs. predicted values in the three

splittings, as well as the corresponding Williams plots for analy-

sis of the applicability domain (AD) are in Supporting Informa-

tion (Figs. S2 and S3), here we report the equation and the

graphs for the full model [eq. (i); Fig. 2]:

� logðOHÞ ¼ 4:07ð60:48Þ � 0:72ð60:04ÞHOMO

þ 0:37ð60:04ÞnXþ0:16ð60:02ÞnCbH� 0:34ð60:07ÞIDE eq: ðiÞ

n ¼ 460; R2 ¼ 0:824; Q2LOO ¼ 0:819; Q2

BOOT ¼ 0:817;

RMSEtr ¼ 0:43; RMSECV ¼ 0:43

It is important to note that the applied variable selection proce-

dure, GA, was able to select, from a wider and slightly different set

of descriptors developed from updated versions of DRAGON, also

in the current model, four descriptors (HOMO, nX, nCbH, IDE) ei-

ther identical or with almost similar information of those in the

previously published model10 (HOMO, nX, CIC0, nCaH) and was

able to confirm their respective negative or positive influence on

the studied response. Highest occupied molecular orbital (HOMO)

energy, already a well recognized molecular property for OH mod-

eling, was again found to be the best descriptor in all the models,

negatively correlated to the response (here standardized regr. coef-

ficient 5 20.755). This descriptor characterizes the susceptibility

of a molecule toward the attack by the electrophile OH radical,

more reactive chemicals having higher HOMO energy. Further,

nX (standardized regr. coefficient 5 0.356) is the number of halo-

gen atoms. Molecules with more halogen atoms tend to have less

reactivity (higher log kOH values). No longer present in the

updated versions of DRAGON software is nCaH, which was the

number of unsubstituted sp2-carbon in any ring, mainly aromatics.

The new descriptor, selected here as alternative to nCaH, is nCbH

(the number of unsubstituted sp2-carbon only in benzene-type

rings (standardized regr. coefficient 5 0.324)). These descriptors,

which are negatively correlated to the response in univariate mod-

els, are both able to condense information on possible reactive sites

in aromatic rings. The chemicals with higher number of hydrogen

atoms can be more attacked by the hydroxyl radical and are, for

this reason, more reactive. Less important are the topological

descriptors, CIC0 in the old version and IDE in the new version

(standardized regr. Coefficient 5 20.201). They are the informa-

tion containing indices carrying similar structural information, and

are interchangeable without significant loss of model quality. Thus

it can be stated that Genetic Algorithm reliably extracted structural

information included in the above combination of descriptors,

which was obtained from different training set input for model de-

velopment and demonstrated its ability also in external prediction.

Model Based on Online Descriptors

Moreover, we have developed QSAR models, based on freely

available online 2D-descriptors,13 to propose models that can be

also applicable without a commercial software for descriptor cal-

culation. The best predictive models were found in a population

of 100 models with 4 variables, using separately K-ANN, ran-

dom and K-means algorithm for splitting; they are listed in Ta-

ble 1 with their statistical parameters. The final stable combina-

tion of descriptors, present in all the model populations obtained

from three different training set inputs and with maximum pre-

dictive performance on the prediction set compounds, was:

HOMO, SeaC2C2aa, D_path(F, rel), G_([Cl, Br, I]). Finally, a

full model with significant statistical quality [eq. (ii)] was devel-

oped based on the above mentioned descriptors (Fig. 3):

Figure 2. (a) Plot of experimental vs. calculated values for the full model based on DRAGON

descriptors; (b) Williams plot for the AD of the DRAGON full model.



� logðOHÞ ¼ 3:83ð60:48Þ � 0:69ð60:05ÞHOMO

þ 1:26ð60:17ÞD PathSumðF; relÞþ 0:43ð60:07ÞG ð½Cl; Br; I�Þ þ 0:06ð60:01ÞSeaC2C2aa eq: ðiiÞ

n ¼ 460; R2 ¼ 0:806; Q2LOO ¼ 0:801; Q2

BOOT ¼ 0:797;

RMSEtr ¼ 0:45RMSECV ¼ 0:45

Also from this completely different pool of input descriptors GA

selected HOMO as the most relevant (Std coeff.520.718). The E-

state index37 SeaC2C2aa (Std coeff.5 0.266) is the sum of the bond

electro topological values of carbon–carbon aromatic bonds in

which the carbons are not substituted. This descriptor, which is

inversely correlated with the modeled response in the univariate

model, gives similar information as nCbH DRAGON descriptor.

The remaining two descriptors D_pathSum(F, rel) (Std coeff. 50.319) and G_([Cl, Br, I]) (Std coeff. 5 0.265), both positively cor-

related to the response as nX, are AMBIT descriptors38 and are

counts of the number of halogen atoms in the molecules. Thus the

above descriptors cumulatively gave us the same information as we

obtained from nX in the current and previous DRAGON descriptor

model. Interestingly, it can again be stated that GA identified the

useful variables for the modeling of hydroxyl radical rate constants

irrespective of the different input descriptors.

Applicability Domain

QSAR models are developed on a defined domain of compounds

based on properties and structures of training set compounds.

Therefore new chemicals outside the chemical domain are ex-

trapolated and have a higher possibility of being predicted

worse. Thus, there is the need for a quantitative measure of the

applicability domain (AD) to identify problematic chemicals5 in

the modeled data set, both to highlight chemicals that could be

outliers for the response (not well predicted) or for the peculiar

structure (influential or high leverage outliers). An interesting

extension of applicability domain study, particularly for ‘‘pre-

dictive’’ QSAR models, is the check of possible belonging to

the training chemical space for new chemicals without experi-

mental data, to verify if the predicted data could be interpolation

or extrapolation of the proposed model.

The outliers compounds in training and prediction sets in dif-

ferent splittings (Supporting Information Figs. S2 and S3) are

somehow different due to the dissimilar combination of com-

pounds and modeling descriptors. However, on analyzing the

applicability domain for the above models, and also in full mod-

els (Figs. 2b and 3b), some common compounds have been

found as outliers or influential in all the models:

i. Triethyl phosphate (61) and 2-(chloromethyl)-3-chloro-1-pro-

pene (403) are two response outliers that were predicted as

less reactive by all the models;

ii. Bromomethane (18), dimethylsulfide (37), diethyl sulfide

(263), ethyl methyl sulfide (353), 3-methyl-1,2-butadiene

(342), are response outliers, that were predicted as more

reactive by all the models, raising some doubts with regard

to the quality of the experimental data of these compounds,

for which new experimental measures are suggested;

iii. Fluorinated chemicals: 1,1,2,2-tetrachloroethene (232), 1,1-

dichloro-2,2,2-trifluoroethane (262), 1,1,1,2,2-pentafluoro-

ethane (265), hexafluorobenzene (267), 1-chloro-1,2,2,2-tet-

rafluoroethane (414) and propylpentafluorobenzene (457), are

highly structurally influential compounds in all the models.

This was already found in our previous study.10

Figure 3. (a) Plot of experimental vs. calculated values for the full model based on Online descriptors;

(b) Williams plot for the AD of the online full model.



Application to CADASTER Chemicals: PBDEs

and (B)TAZs

Two classes of CADASTER chemicals, namely Polybromi-

nated diphenylethers (PDBEs) and (benzo)triazoles (BTAZs),

were used to verify the applicability of our models in the

prediction of chemicals without experimental data. Contempo-

raneously, we verified whether the new studied chemicals lie

within the structural AD of our models by verifying their

leverage (hat value in comparison to h* cut-off value). In

Figure 4 two plots of predicted values vs. hat values are

reported for both sets (PBDEs and BTAZs) for the DRAGON

model. It is evident from these plots that all the PBDEs are

outside the applicability domain of our model, whereas,

for BTAZs almost 75% of the chemicals are within its applic-

ability domain.

From the PBDE plot (Fig. 4a) it can be verified that chemi-

cals with an increasing number of bromine atoms have the tend-

ency to go far from the domain, and were extrapolated as less

reacting chemical then those with fewer Br atoms, which were

extrapolated as higher reacting chemicals.

On evaluating the domain of applicability for BTAZs (Fig.

4b), we did not observed any significant trend. We verified that

chemicals within the applicability domain, interpolated as high

reacting chemicals, have a thio linkage in their structures,

whereas, chemicals far from the AD, extrapolated as less react-

ing chemicals, have more fluorine atoms or have a metal atom

in their chemical structure.

In addition, we obtained predicted data for the same chemi-

cals by applying the widely used online package EPI Suite35. In

Supporting Information Table S-III it is possible to verify that

the difference in the predictions for PDBEs is within 0.8 log

unit between our models and those of EPI Suite (91 % into 0.5

log units), indeed a good correlation (94%) between the two sets

of predicted values is observed. The dominant trend in both

modeling approaches is determined mainly by the number of

bromines. Thus, we can conclude that our model and EPI Suite

have similar predicted data, but our AD check can inform that

all these data are extrapolated and, for this reason, could be

unreliable; similar information on reliability for AD is not avail-

able in EPI Suite.

Larger prediction differences were observed for (benzo)tria-

zoles between our models and EPI Suite (Supporting Informa-

tion Table S-IV).

It is interesting to note that the majority of chemicals within

the applicability domain were overestimated as high reacting

chemicals compared to EPI Suite predictions. On the contrary,

most of the compounds outside the applicability domain were

underestimated as less reacting chemicals by our model com-

pared to EPI Suite.

The information on AD for completely new chemicals is an

advantageous aspect of our approach in comparison to EPI

Suite: predicted values can always be obtained by QSAR mod-

els, but the crucial information regarding the interpolation or

extrapolation is needed. Also important to note is that PBDEs

and BTAZs are structurally quite different from the volatile

organic compounds present in the training sets of our models

and EPI Suite.

Our models were not considered to be reliably applicable to

perfluorinated chemicals (PFCs), another CADASTER class, as

the fluorinated compounds present in the original data set were

always structurally influential. Our models tended to predict

PFCs, which however were all out of model AD, as being highly

less reactive than EPI Suite. We verified big discrepancies

between our predicted values and those obtained by EPI

Suite, with differences higher than 1 log unit for 84% of the

checked compounds

Comparison with Published Models

The statistical qualities of the different published models and

current models are listed in Table 2. Comparative comments can

Figure 4. (a) Plot of predicted values vs. hat values for PBDEs (b) Plot of predicted values vs. hat

values for BTAZs.



be made, although it is not possible to make a perfect compari-

son of the published models, as different data sets and different

algorithms were used for model building and validation. It is

interesting to note that the descriptors selected in different

models, mainly in those obtained from training sets similar in

dimension and typology, have comparable structural and mecha-

nistic meaning. Also the statistical quality of all these models is

similar and satisfactory.

Bakken and Jurs30 used their non linear artificial neural net-

work (CNN) model to provide accurate predictions over a wide

range of functionalities. Neural Networks is a more complex

method but generally gives better statistics. The peculiarity of

Oberg model32 is its application for screening a big set of chem-

icals with half life ranging from days to years, considering also

the percentage of compounds in or outside the applicability

domain of the model. Recently Wang et al.33 developed statisti-

cally validated models for the constant rate of degradation by

OH of phenols, alkenes and alcohols, with the applicability

domain limited to the chemical domain of the model. These

authors also developed global PLS models34 with an extended

applicability domain. But no comment was made on influential

chemicals with high leverage values.

Our models, developed on a big data set as those of Oberg32

and Wang et al.33 are in perfect accordance with three central

OECD Principles: i) Principle 2: simple and now easily repro-

ducible unambiguous algorithms [eqs. (i) and (ii)], obtained by

the simplest MLR method based on only 4, easily interpretable,

molecular descriptors; ii) Principle 3: possibility to verify AD,

not only for the split training and prediction sets, but also for

new chemicals without experimental data; iii) Principle 4: rigor-

ous external validation by different splittings, and the application

of different statistical parameters.

Conclusions

The need for regular check and updating of published QSAR

models is again demonstrated, if these models are to be useful

for practical applications and not just for scientific purposes.

Indeed, QSAR models must be reproducible, and must be practi-

cally applicable to new chemicals that have no experimental

data, in this case CADASTER classes.

The newly developed models, both from the more recent

DRAGON versions and the online descriptors plus HOMO

energy, were found to be statistically valid both internally and

especially externally, considering different composition of the

external prediction sets, obtained by applying different splitting

methods for leaving out some chemicals (those of prediction

sets) from the model development procedure. The present work

also confirmed the ability of Genetic Algorithms to extract, and

not by chance, important information related to the studied

response, from different pools of input descriptors. The relevant

information included in the selected descriptors has interpretable

mechanistic meaning.

Furthermore, our study placed special emphasis on the

applicability domain of the models, identifying not only

response outliers or structurally influential chemicals in the orig-

inal set, but also verifying which of the CADASTER chemicals,Table

2.ComparisonofthePresentModelswithPreviouslyPublished

QSAsR

s.

Reference

Modeling

technique

No.ofdescriptors/PLS

components

No.of

compounds

Descriptors

Q2 LOO

RMSEtr

Q2 EXT

RMSEExt

30

CNN

552/5

aTopological

0.071

0.064

CNN

10

281/31a

Topological,electronic,

0.230

0.250

10

MLR

4234/226a

HOMO,nX,nCaH

,CIC0

0.816

0.422

0.813

0.436

31

MLR

6460

HOMO,MATS1m,nDB,

nO,CIC2,RTeÞ

0.841

0.407

32

PLS

333/7

495/238a

–0.875

0.449

0.840

0.501

33

MLR

444/11a

HOMO,QH,MSA

andl

0.806

0.139

0.922

0.079

34

PLS

22/3

576/146a

(Ds,HOMO,nX,BELm2)b

0.865

0.391

0.872

0.430

Thisstudy(external

validationin

Table

1)

MLR

4460

HOMO,nX,nCbH,ID

E0.819

0.430

Thisstudy(external

validationin

Table

1)

MLR

4460

HOMO,SeaC2C2aa,D_path(F,

rel),G_([Cl,Br,I])

0.801

0.450

aNumber

ofexternal

setcompounds,

bInfluential

descriptors

inPLSlatentvariables.



for which no experimental reactivity values are available, are

within or out the AD of our models. We compared our predic-

tions with those of the widely used software EPI Suite, and

found some (PBDEs) to be in good agreement, whereas, others

(BTAZs) had limited comparability. One of the advantages of

our model is that a chemical’s position inside or outside the

model AD is known, which is not the case for the EPI Suite

software. However, such AD information is highly important to

users of QSAR predictions as it facilitates their decision-making.

Acknowledgments

We wish to thank Ester Papa, Nicola Chirico and Stefano Cassani

for their support to P.P. Roy. We thank the University of Insubria

for providing a post-doc fellowship to Dr. P.P. Roy.

References

1. Zefirov, N. S.; Palyulin, V. A. J Chem Inf Comput Sci 2001, 41,

1022.

2. http://ec.europa.eu/environment/chemicals/reach/reach_intro.htm (accessed

27 January 2011).

3. http://www.oecd.org/dataoecd/33/37/37849783.pdf (accessed 27 Jan-

uary 2011)

4. http://www.oecd.org/officialdocuments/displaydocumentpdf (accessed

27 January 2011).

5. Netzeva, T. I.; Worth, A. P.; Aldenberg, T.; Benigni, R.; Cronin, M.

T. D.; Gramatica, P.; Jaworska, J. S.; Kahn, S.; Klopman, G.; March-

ant, C. A.; Myatt, G.; Nikolova-Jeliazkova, N.; Patlewicz, G. Y.; Per-

kins, R.; Roberts, D. W.; Schultz, T. W.; Stanton, D. T.; van de

Sandt, J. J. M.; Tong, W.; Veith, G.; Yang, C. ATLA 2005, 33, 155.

6. Tropsha, A.; Gramatica, P.; Gombar, V. K. QSAR Comb Sci 2003,

22, 69.

7. Gramatica, P. QSAR Comb Sci 2007, 26, 694.

8. Tropsha, A. Mol Inf 2010, 29, 476.

9. http//www.cadaster.eu (accessed 27 January 2011).

10. Gramatica, P.; Pilutti, P.; Papa, E. J Chem Inf Comput Sci 2004, 44,

1794.

11. DRAGON for Windows, ver.5.5, 2007, Talete srl, Milano, Italy.

12. Gramatica, P.; Papa, E. QSAR Comb Sci 2005, 24, 953.

13. www.cadaster.eu/database (accessed 27 January 2011).

14. Shi, L. M.; Fang, H.; Tong, W.; Wu, J.; Perkins, R.; Blair, R. M.;

Branham, W. S.; Dial, S. L.; Moland, C. L.; Sheehan, D. M. J Chem

Inf Comput Sci 2001, 41, 186.

15. Schuurmann, G.; Ebert, R. U.; Chen, J.; Wang, B.; Kuhne, R.

J Chem Inf Model 2008, 48, 2140.

16. Consonni, V.; Ballabio, D.; Todeschini, R. J Chem Inf Model 2009,

49, 1669.

17. Roy, P. P.; Roy, K. QSAR Comb Sci 2008, 27, 302.

18. Chirico, N.; Papa, E.; Gramatica, P. Presented at the 21 SETAC

Europe Meeting, May 2011, Milan, Italy.

19. Atkinson, R. J Phys Ref Data 1989, Monograph 1, 1.

20. HyperChem, Rel. 7.03 for Windows, 2002. Hypercube. Inc. Gaines-

ville, Florida, USA.

21. Roy, K.; Ghosh, G. Int Electron J Mol Des 2003, 2, 599.

22. Papa, E.; Kovarich, S.; Gramatica, P. QSAR Comb Sci 2009, 28, 790.

23. MOBYDIGS Professional for Windows Ver. 1.0 beta, 2004. Talete

srl, Milano, Italy.

24. Leardi, R.; Boggia, R.; Terrile, M. J Chemom 1992, 6, 267.

25. Mitra, I.; Saha, A.; Roy, K. Mol Simul 2010, 36, 1067.

26. Todeschini, R.; Consonni, V.; Maiocchi, A. Chemom Int Lab Syst

1999, 46, 13.

27. Gasteiger, J.; Zupan, J. Angew Chem Int Ed Engl 1993, 32, 503.

28. Leonard, J. T.; Roy, K. QSAR Comb Sci 2006, 25, 235.

29. Atkinson, A. C. Plots, Transformations and Regression; Clarendon

Press: Oxford, 1985.

30. Bakken, G.; Jurs, P. J Chem Inf Comput Sci 1999, 39, 1064.

31. Gramatica, P.; Pilutti, P.; Papa, E. Atmos Environ 2004, 38, 6167.

32. Oberg, T. Atmos Environ 2005, 39, 2189.

33. Wang, Y.; Chen, J.; Li, X.; Zhang, S.; Qiao, X. QSAR Comb Sci

2009, 28, 1309.

34. Wang, Y.; Chen, J.; Li, X.; Wang, B.; Cai, X.; Huang, L. Atmos En-

viron 2009, 43, 1131.

35. EPI Suite. http://www.epa.gov/oppt/exposure/pubs/EPI Suite.htm

(accessed 27 January 2011).

36. Bhhatarai, B.; Gramatica, P. Chem Res Toxicol 2010, 23, 528.

37. Hall, L. H.; Kier, L. B. J Chem Inf Comput Sci 2000, 30, 784.

38. http://ambit.sourceforge.net/intro.html (accessed 27 January 2011).



Documents

QSAR model reproducibility and applicability: A case study of rate constants of hydroxyl radical reaction models applied to polybrominated diphenyl ethers and (benzo-)triazoles