QSAR modeling - Katedra fyzikální chemie...

QSAR modeling

Pavel Polishchuk

Institute of Molecular and Translational MedicineFaculty of Medicine and Dentistry

Palacky University

pavlo.polishchuk@upol.czqsar4u.com

Activity = F(structure)

M – mapping functionE – encoding function

Modeling of compounds properties

Activity = M(E(structure))

D1 D2 D3 D4 D5 D6 … DN

1 0 9 0 11 1 … 1

4 0 1 0 0 0 … 1

0 0 0 0 0 4 … 6

0 2 3 6 0 0 … 3

… … … … … … … …

4 0 0 0 1 2 … 1

QSAR modeling workflow

D1 D2 D3 D4 D5 D6 … DN

1 0 9 0 11 1 … 1

4 0 1 0 0 0 … 1

0 0 0 0 0 4 … 6

0 2 3 6 0 0 … 3

… … … … … … … …

4 0 0 0 1 2 … 1

Structure Descriptors (features) Model

Encoding(represent structure with

numerical features)

Mapping(machine learning)

Overall QSAR workflow

Input data

Bioassays

Databases

PreprocessingFeature

engineeringModel

trainingModel

validation

Classification

Regression

Clustering

Cross-validation

Bootstrap

Test set

Applicability Domain

Feature selection

Feature combination

Data normalization & curation

Feature extraction

Interpretation

1) a defined endpoint2) an unambiguous algorithm3) a defined domain of applicability4) appropriate measures of goodness-of–fit, robustness and predictivity5) a mechanistic interpretation, if possible

OECD principles for the validation, for regulatory purposes, of (Q)SAR models

Step 1. Data collection

Scientific literature and patentsDatabases (ChEMBL, PubChem, BindingDB, etc)

Traditionally modeled compounds should have the same mechanism of action, however using of complex non-linear machine learning method allows to model data sets with mixed or even unknown mechanism of action with reasonable accuracy.

Conditions may substantially influence the results of bioassays (change in temperature, activators, detectors, etc)

Units checking

Step 2. Data curation (normalization)

Removal of mixtures, inorganics, metalorganics, etc

Strip of salts, counterions, etc

Ionization, if necessary (at the particular pH level)

Chemotype normalization, resonance structure and tautomers

Duplicates removal

Manual checking

Step 3. Descriptors: classificationObject type:

molecular descriptors (single molecules)descriptors of molecular ensemble (mixtures, materials)reaction descriptors (reactions)

Descriptor origin:calculated from the structureempirical (Hammet constants, lipophilicity chemical shifts in NMR, etc)

Locality:local (atom charge)global (molecular weight, molecular volume, lipophilicity, etc)

Dimensionality:1D (number of methyl groups, molecular weight, etc)2D (topological indices, fragmental descriptors)3D (molecular volume, quantum chemical descriptors)4D (based on a set of conformers)

Calculation method:physico-chemical (lipophilicity, etc)topological (invariants of molecular graph, Randic index, Wiener index, etc)fragmental (fingerprints, etc)pharmacophorespatial (moment of inertia, etc)quantum-chemical (energy of HOMO/LUMO, etc)etc. R. Todeschini and V. Consonni Handbook of Molecular Descriptors, 2008

Atom-centric (augmented atoms) fingerprints

Generate substructures starting from each atom and considering all its neighbors up to the specified distance (radius or diameter).

Morgan fingerprints, Extended-connectivity fingerprints (ECFP). Functional-class fingerprints (FCFP) , etc.

radius=2 (diameter=4)

Rogers, D. & Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 50, 742-754 (2010)

Atom-centric (augmented atoms) fingerprints

Fingerprints

Each molecule has variable length set of substructures – variable length fingerprints

N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N C:N:C

1 1 1 0 0 0 0

1 1 1 1 0 0 0

0 0 0 0 1 1 1

2-bond sequences

Hashed fingerprints

Have fixed length (usually 512, 1024 or 2048 bits)

hash code

N-C-C C-C-O C-C=O C-C-C C:C:C C:C:N

13823 9740 37278 28478 874 283764

0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1

pseudo-randomnumber generator

fixed-length bit string

Each substructure activates several bits (usually 4-5) to avoid collisions and produce bit string of enough density

Missing bits mean that certain substructures are not presented, Active bits mean that certain substructure may be present (but due to possible collisions one cannot be sure)

Folding of fingerprints and keys bit strings

To increase density of very sparse bit vectors

0 0 0 1 0 0 1 1 0 1 0 0 1 0 1 0 0 0 1 1

0 0 0 1 0 0 1 1 0 1

0 0 1 0 1 0 0 0 1 1

n bit fingerprints

n/2 bits

0 0 1 1 1 0 1 1 1 1 n/2 bit fingerprint

Usually optimal density is 0.2-0.3

Step 4. Feature processing

Feature transformations:linear and non-linear scaling

Feature combinations:

Feature selection:removal of correlated descriptors (important e.g. for linear regression models)removal of descriptors with missing valuesremoval constant and near constant featuresremoval of descriptors with low correlation with the target propertystep-wise feature selectionevolutionary feature selection (genetic algorithm, etc)

1range (0; 1)

jiij xxz

ii xz add quadratic term

Step 5. Model building

Unsupervisedclustering

Supervised

Regression Classification

Multiple linear regression (MLR)

Partial linear regression (PLS) Logistic regression

Gaussian Process (GP) Naïve Bayes (NB)

Decision trees (DT)

Support vector machine (SVM)

Neural nets (NN)

Random forest (RF)

k-Nearest neighbors (kNN)

1/C = 4.08π – 2.14π2 + 2.78σ + 3.38

π = logPX – logPH

σ - Hammet constant

plant growth inhibition activity of phenoxyacetic acids

Hansch equation

R is H or CH3;

X is Br, Cl, NO2 and

Y is NO2, NH2, NHC(=O)CH3

Inhibition activity of compounds against Staphylococcus aureus

Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2

Free-Wilson models

QSAR model building

Simulated data set of actives and inactives with two descriptors – MW and logP

Decision tree

IF logP >= 3.2

AND MW >= 255

THENcompound is

logP < 3.2YES NO

MW < 358

YES NO

MW < 255

YES NO

Decision tree

Initial dataset

Bootstrapsample

CART tree1 CART tree2 CART tree3

Combined prediction

Random feature subspace in each node

Random Forest

Consensus (ensemble) modeling

Step 6. Validation

Test set (usually 20-25% of the work set)

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2working set

random test set 1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

stratified test set

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

Cross-validation

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2working set

1.2 1.3 1.7 2.0 2.2 2.8 3.1 3.2 3.2 3.6 4.7 5.7 5.8 6.4 7.2 8.1 9.0 9.1 9.2

fold 1

fold 2

fold 3

predictions of different folds are combined to calculate the final predictive measure

Step 6. Measures of predictive ability of models

Classification

TNySpecificit

TPySensitivit

ySensitivitySpecificitaccuracy Balanced

baseline-1

baseline -accuracy Kappa

TNTPaccuracy

FP)FN)(TP(TPFN)FP)(TN(TNbaseline

Confusion matrix

Predicted

positive class (1)

negative class (0)

observed

positive class (1)

truepositive

false negative

negative class (0)

false positive

true negative

))()()(( FNTNFPTNFNTPFPTP

FNFPTNTPMCC

[-1; 1]

[0; 1]

Step 6. Measures of predictive ability of models

Regression

obspredi,

obsi,predi,2

RMSE i

obsi,predi,

obsi,predi, yyN

Determination coefficient

Root mean squared error

Mean absolute error

Step 7. Applicability domain (AD)Extrapolation to very distant objects is dangerous

There is a need to define the domain where our model is reliable (models are not universal!)

Only compounds which are similar to the training set compounds should be included in applicability domain of the model. One should estimate similarity of new compounds (test set, etc) to the training set compounds.

Step 7. Applicability domain (AD) measures

Bounding box - based on descriptor range

Desc_1

Desc_2

AD• internal regions are usually empty, especially if the number of descriptors is big

• it doesn’t take into account descriptor correlation

Distance from training set compounds in descriptor space

Desc_1

Desc_2Distance from training set compounds in model space

Model_1

Model_2

Requires several models (e.g. consensus model, bootstrap models)

One-class SVM

Conformal predictions

Step 7. Applicability domain (AD) example

J. Chem. Inf. Model., Vol. 50, No. 12, 2010

Step 8. Interpretation of QSAR models

Why interpretation is important?

Found active/inactive patterns which can be used for optimization of compound properties

Retrieve trends of stricture-activity relationships which can be used for knowledge-base model validation

Regulatory purposes

Principles and issues

Model should be predictive

Interpretation is valid within the applicability domain of the model

Interpretation results are data set dependent

1/C = 4.08π – 2.14π2 + 2.78σ + 3.38

π = logPX – logPH

σ - Hammet constant

plant growth inhibition activity of phenoxyacetic acids

electronic factorsrate of penetration of membranes in the plant cell

Hansch equation

R is H or CH3;

X is Br, Cl, NO2 and

Y is NO2, NH2, NHC(=O)CH3

Inhibition activity of compounds against Staphylococcus aureus

Act = 75RH – 112RCH3 + 84XCl – 16XBr – 26XNO2 + 123YNH2 + 18YNHC(=O)CH3 – 218YNO2

Free-Wilson models

347 agonists of 5-HT1A receptor

Ar - substituted (hetero)arylsL - polymethylene chainR - various (poly)cyclic residues

O O OMe Cl Cl CF3

PLS 0.84 0.18 0.03 -0.04 -0.06 -0.09 -0.11 -0.66 -0.73 -0.94 -0.96RF 0.27 0.24 0.04 0.07 -0.02 0.11 0.04 -0.04 -0.55 -0.66 -0.66

L -(CH2)6- -(CH2)5- -(CH2)4- -(CH2)3- -(CH2)2- -CH2-

PLS 0.8 0.71 0.81 0.08 -0.04 0.06

RF 0.14 0.19 0.14 -0.01 -0.03 0.05

Kuz’min, V.E., et al. Molecular Informatics, 2011, 30, 593-603.

Activitypred(A) Activitypred(B) Contribution(C)

f(A) = x f(B) = y W(C) = x – y

Polishchuk, P. G.; Kuz'min, V. E.; Artemenko, A. G.; Muratov, E. N. Universal Approach for Structural Interpretation of QSAR/QSPR Models. Molecular Informatics 2013, 32, 843-853

Universal approach

Polishchuk P, Tinkov O, Khristova T, Ognichenko L, Kosinskaya A, Varnek A, Kuz’min V. Journal of Chemical Information and Modeling 2016, 56, 1455-1469.

Acute oral toxicity on rats

Interpretation results of valid predictive models should converge independent of:

interpretation approach

descriptors

machine learning method

All models are interpretable but not all end-points

Overall QSAR workflow

Input data

Bioassays

Databases

PreprocessingFeature

engineeringModel

learningModel

validation

Classification

Regression

Clustering

Cross-validation

Bootstrap

Test set

Applicability Domain

Feature selection

Feature combination

Data normalization

Feature extraction

Interpretation

δ)f(xδ)f(xC

f(x)δ)f(xC

δ)f(xf(x)C

forward difference

backward difference

central difference

Partial derivatives is used to calculate contributions of single features.It is applicable to any model built on meaningful (interpretable) descriptors.May be calculated analytically or numerically.

Estimated contributions of separate substructural features can be mapped back on the structure to reveal contribution of fragments.

Marcou, G.; Horvath, D.; Solov'ev, V.; Arrault, A.; Vayer, P.; Varnek, A. Interpretability of SAR/QSAR Models of any Complexity by Atomic Contributions. Molecular Informatics 2012, 31, 639-642

QSAR modeling - Katedra fyzikální chemie...

Documents

metodologie qsar

DISEÑO DE FÁRMACOS - doctoradoquimed.es · 3 SARMC Diseño de Fármacos. QSAR y QSAR-3D. QSAR y QSAR-3D I- Introducción. II- Caracterización de receptores. Mapeo del receptor

QSAR itq21 param

Quantitative Structure Activity Relationships QSAR and 3D-QSAR

Ecotoxicological Bioassays with the microalga

Analysis of Bioassays

QSAR (Summary)

Qsar Sabah

Applicazione delle metodologie QSAR a problematiche ...dipbsf.uninsubria.it/qsar/education/Mat Didattico/MaterialeCorsi... · Prof. Paola Gramatica -QSAR Research Unit -DBSF -University

Therapeutically relevant bioassays

Quantitative Structure Activity Relationships QSAR and 3D-QSAR Chapter 18

QSAR Present 2010

Qsar Lecture

Measurement in Science Bioassays and Immunoassays

Potency Bioassays IndustryPerspPatankar

Bioassays praveen tk

Introducción Al QSAR

BIOASSAYS - scaht.org

ResearchArticle 2D-QSAR and 3D-QSAR Analyses for EGFR ...downloads.hindawi.com/journals/bmri/2017/4649191.pdfResearchArticle 2D-QSAR and 3D-QSAR Analyses for EGFR Inhibitors ManmanZhao,1

QSAR Design of Discovery Libraries for Solids Based on QSAR Models 2005 QSAR and rial Science