QSAR modeling - Katedra fyzikální chemie UPOLfch.upol.cz/wp-content/uploads/2018/01/ADD_12_Polishchuk_QSAR.pdfOverall QSAR workflow Input data Bioassays Databases Preprocessing Feature

QSAR modeling

Pavel Polishchuk

Institute of Molecular and Translational MedicineFaculty of Medicine and Dentistry

Palacky University

[email protected]

Activity = F(structure)

M – mapping functionE – encoding function

Modeling of compounds properties

Activity = M(E(structure))

D1 D2 D3 D4 D5 D6 … DN

1 0 9 0 11 1 … 1

4 0 1 0 0 0 … 1

0 0 0 0 0 4 … 6

0 2 3 6 0 0 … 3

… … … … … … … …

4 0 0 0 1 2 … 1

QSAR modeling workflow

D1 D2 D3 D4 D5 D6 … DN

1 0 9 0 11 1 … 1

4 0 1 0 0 0 … 1

0 0 0 0 0 4 … 6

0 2 3 6 0 0 … 3

… … … … … … … …

4 0 0 0 1 2 … 1

Structure Descriptors (features) Model

Encoding(represent structure with

numerical features)

Mapping(machine learning)

Overall QSAR workflow

Input data

Bioassays

Databases

PreprocessingFeature

engineeringModel

trainingModel

validation

Classification

Regression

Clustering

Cross-validation

Bootstrap

Test set

Applicability Domain

Feature selection

Feature combination

Data normalization & curation

Feature extraction

Interpretation

j

j

ii

z

xxx '

4

1) a defined endpoint2) an unambiguous algorithm3) a defined domain of applicability4) appropriate measures of goodness-of–fit, robustness and predictivity5) a mechanistic interpretation, if possible

OECD principles for the validation, for regulatory purposes, of (Q)SAR models

Step 1. Data collection

Scientific literature and patentsDatabases (ChEMBL, PubChem, BindingDB, etc)

Traditionally modeled compounds should have the same mechanism of action, however using of complex non-linear machine learning method allows to model data sets with mixed or even unknown mechanism of action with reasonable accuracy.

Conditions may substantially influence the results of bioassays (change in temperature, activators, detectors, etc)

Units checking

Step 2. Data curation (normalization)



Removal of mixtures, inorganics, metalorganics, etc

Strip of salts, counterions, etc

Ionization, if necessary (at the particular pH level)

Chemotype normalization, resonance structure and tautomers

Duplicates removal

Manual checking

Step 3. Descriptors: classificationObject type:

molecular descriptors (single molecules)descriptors of molecular ensemble (mixtures, materials)reaction descriptors (reactions)

Descriptor origin:calculated from the structureempirical (Hammet constants, lipophilicity chemical shifts in NMR, etc)

Locality:local (atom charge)global (molecular weight, molecular volume, lipophilicity, etc)

Dimensionality:1D (number of methyl groups, molecular weight, etc)2D (topological indices, fragmental descriptors)3D (molecular volume, quantum chemical descriptors)4D (based on a set of conformers)

Calculation method:physico-chemical (lipophilicity, etc)topological (invariants of molecular graph, Randic index, Wiener index, etc)fragmental (fingerprints, etc)pharmacophorespatial (moment of inertia, etc)quantum-chemical (energy of HOMO/LUMO, etc)etc. R. Todeschini and V. Consonni Handbook of Molecular Descriptors, 2008

Atom-centric (augmented atoms) fingerprints

Generate substructures starting from each atom and considering all its neighbors up to the specified distance (radius or diameter).

Morgan fingerprints, Extended-connectivity fingerprints (ECFP). Functional-class fingerprints (FCFP) , etc.

radius=2 (diameter=4)

Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)

Atom-centric (augmented atoms) fingerprints

Fingerprints

Each molecule has variable length set of substructures – variable length fingerprints

N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C

1 1 1 0 0 0 0

1 1 1 1 0 0 0

0 0 0 0 1 1 1

2-bond sequences

Hashed fingerprints

Have fixed length (usually 512, 1024 or 2048 bits)

hash code

N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N

13823 9740 37278 28478 874 283764

0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1

pseudo-randomnumber generator

fixed-length bit string

Each substructure activates several bits (usually 4-5) to avoid collisions and produce bit string of enough density

Missing bits mean that certain substructures are not presented, Active bits mean that certain substructure may be present (but due to possible collisions one cannot be sure)

Folding of fingerprints and keys bit strings

To increase density of very sparse bit vectors

0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1

0 0 0 1 0 0 1 1 0 1

0 0 1 0 1 0 0 0 1 1

OR

n bit fingerprints

n/2 bits

n/2 bits

0 0 1 1 1 0 1 1 1 1 n/2 bit fingerprint

Usually optimal density is 0.2-0.3

Step 4. Feature processing

Feature transformations:linear and non-linear scaling

Feature combinations:

Feature selection:removal of correlated descriptors (important e.g. for linear regression models)removal of descriptors with missing valuesremoval constant and near constant featuresremoval of descriptors with low correlation with the target propertystep-wise feature selectionevolutionary feature selection (genetic algorithm, etc)

sd

xxz i

i

n

i

i xxn

nsd

1

2)(1

ixie

z

1

1range (0; 1)

jiij xxz

2

ii xz add quadratic term

Step 5. Model building

Unsupervisedclustering

Supervised

Regression Classification

Multiple linear regression (MLR)

Partial linear regression (PLS) Logistic regression

Gaussian Process (GP) Naïve Bayes (NB)

Decision trees (DT)

Support vector machine (SVM)

Neural nets (NN)

Random forest (RF)

k-Nearest neighbors (kNN)

1/C = 4.08π – 2.14π2 + 2.78σ + 3.38

π = logPX – logPH

σ - Hammet constant

plant growth inhibition activity of phenoxyacetic acids

Hansch equation

R is H or CH3;

X is Br, Cl, NO2 and

Y is NO2, NH2, NHC(=O)CH3

Inhibition activity of compounds against Staphylococcus aureus

Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2

Free-Wilson models

QSAR model building

18

Simulated data set of actives and inactives with two descriptors – MW and logP

logP

MW

Decision tree

IF logP >= 3.2

AND MW >= 255

THENcompound is

19

logP < 3.2YES NO

MW < 358

YES NO

MW < 255

YES NO

Decision tree

20

LogP

MW

Decision tree

21

Initial dataset

Bootstrapsample

Bootstrapsample

Bootstrapsample

CART tree1 CART tree2 CART tree3

Combined prediction

…

Random feature subspace in each node

Random Forest

Consensus (ensemble) modeling

Consensus (ensemble) modeling

Step 6. Validation

Test set (usually 20-25% of the work set)

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2working set

random test set 1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

stratified test set

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

Cross-validation

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2working set

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

fold 1

fold 2

fold 3

predictions of different folds are combined to calculate the final predictive measure

Step 6. Measures of predictive ability of models

Classification

FPTN

TNySpecificit

FNTP

TPySensitivit

2

ySensitivitySpecificitaccuracy Balanced

baseline-1

baseline -accuracy Kappa

N

TNTPaccuracy

2N

FP)FN)(TP(TPFN)FP)(TN(TNbaseline

Confusion matrix

Predicted

positive class (1)

negative class (0)

observed

positive class (1)

truepositive

(TP)

false negative

(FN)

negative class (0)

false positive

(FP)

true negative

(TN)

))()()(( FNTNFPTNFNTPFPTP

FNFPTNTPMCC

[-1; 1]

[0; 1]

[0; 1]

[0; 1]

[0; 1]

[0; 1]

Step 6. Measures of predictive ability of models

Regression

i

2

obspredi,

i

2

obsi,predi,2

)y(y

)y(y

1Q

1N

)y(y

RMSE i

2

obsi,predi,

N

1i

obsi,predi, yyN

1MAE

Determination coefficient

Root mean squared error

Mean absolute error

Step 7. Applicability domain (AD)Extrapolation to very distant objects is dangerous

There is a need to define the domain where our model is reliable (models are not universal!)

Only compounds which are similar to the training set compounds should be included in applicability domain of the model. One should estimate similarity of new compounds (test set, etc) to the training set compounds.

Step 7. Applicability domain (AD) measures

Bounding box - based on descriptor range

Desc_1

Desc_2

AD• internal regions are usually empty, especially if the number of descriptors is big

• it doesn’t take into account descriptor correlation

Distance from training set compounds in descriptor space

Desc_1

Desc_2Distance from training set compounds in model space

Model_1

Model_2

Requires several models (e.g. consensus model, bootstrap models)

One-class SVM

Conformal predictions

Step 7. Applicability domain (AD) example

J. Chem. Inf. Model., Vol. 50, No. 12, 2010

Step 8. Interpretation of QSAR models

Why interpretation is important?

Found active/inactive patterns which can be used for optimization of compound properties

Retrieve trends of stricture-activity relationships which can be used for knowledge-base model validation

Regulatory purposes


Principles and issues

Model should be predictive

Interpretation is valid within the applicability domain of the model

Interpretation results are data set dependent


1/C = 4.08π – 2.14π2 + 2.78σ + 3.38

π = logPX – logPH

σ - Hammet constant

plant growth inhibition activity of phenoxyacetic acids

electronic factorsrate of penetration of membranes in the plant cell

Hansch equation

R is H or CH3;

X is Br, Cl, NO2 and

Y is NO2, NH2, NHC(=O)CH3

Inhibition activity of compounds against Staphylococcus aureus

Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2

Free-Wilson models


347 agonists of 5-HT1A receptor

Ar - substituted (hetero)arylsL - polymethylene chainR - various (poly)cyclic residues

O O OMe Cl Cl CF3

N

N

CH3

F

O2N

Cl

PLS 0.84 0.18 0.03 -0.04 -0.06 -0.09 -0.11 -0.66 -0.73 -0.94 -0.96RF 0.27 0.24 0.04 0.07 -0.02 0.11 0.04 -0.04 -0.55 -0.66 -0.66

Ar

L -(CH2)6- -(CH2)5- -(CH2)4- -(CH2)3- -(CH2)2- -CH2-

PLS 0.8 0.71 0.81 0.08 -0.04 0.06

RF 0.14 0.19 0.14 -0.01 -0.03 0.05

Kuz’min, V.E., et al. Molecular Informatics, 2011, 30, 593-603.



=–

A B C

Activitypred(A) Activitypred(B) Contribution(C)

f(A) = x f(B) = y W(C) = x – y

Polishchuk, P. G.; Kuz'min, V. E.; Artemenko, A. G.; Muratov, E. N. Universal Approach for Structural Interpretation of QSAR/QSPR Models. Molecular Informatics 2013, 32, 843-853

Universal approach



Polishchuk P, Tinkov O, Khristova T, Ognichenko L, Kosinskaya A, Varnek A, Kuz’min V. Journal of Chemical Information and Modeling 2016, 56, 1455-1469.

Acute oral toxicity on rats

?

?




Interpretation results of valid predictive models should converge independent of:

interpretation approach

descriptors

machine learning method

All models are interpretable but not all end-points


Overall QSAR workflow

Input data

Bioassays

Databases

PreprocessingFeature

engineeringModel

learningModel

validation

Classification

Regression

Clustering

Cross-validation

Bootstrap

Test set

Applicability Domain

Feature selection

Feature combination

Data normalization

Feature extraction

Interpretation

j

j

ii

z

xxx '


2δ

δ)f(xδ)f(xC

f(x)δ)f(xC

δ)f(xf(x)C

forward difference

backward difference

central difference

Partial derivatives is used to calculate contributions of single features.It is applicable to any model built on meaningful (interpretable) descriptors.May be calculated analytically or numerically.

Estimated contributions of separate substructural features can be mapped back on the structure to reveal contribution of fragments.

Marcou, G.; Horvath, D.; Solov'ev, V.; Arrault, A.; Vayer, P.; Varnek, A. Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions. Molecular Informatics 2012, 31, 639-642

Documents

QSAR modeling - Katedra fyzikální chemie UPOLfch.upol.cz/wp-content/uploads/2018/01/ADD_12_Polishchuk_QSAR.pdfOverall QSAR workflow Input data Bioassays Databases Preprocessing Feature