Upload
vuongtu
View
231
Download
1
Embed Size (px)
Citation preview
Ahmad Reza Mehdipour 07.11.2017
Quantitative Structure-Activity Relationship(QSAR)
07.11.2017
http://www.biophys.mpg.de/en/theoretical-biophysics/computational-drug-design.html
Course Outline
1.Ligand-‐based approaches 1.(Quantitative) structure-‐activity relationship (SAR & QSAR)2.Pharmacophore modeling
2.Bioinformatics approaches (target recognition and structural modeling) 1.Sequence alignments and searches2.Gene identiBication and prediction3.Homology modeling
3.Structure-‐based approaches 1.Molecular docking
1.Ligand docking: theory and scoring functions2.Virtual screening3.Protein-‐protein docking and interaction
2.Molecular dynamics simulation1.Introduction into molecular dynamics
3.Estimation of ligand binding afBinity1.Free energy perturbation2.Enhance sampling methods
1.
Ligand-based approach
• Structure-Activity Relationships (SAR)
• Quantitative Structure-Activity Relationships (QSAR)
Molecular descriptors
( )= fBiological activity
QSAR: Historical perspective
1900. Meyer-Overton
Public Domain, https://commons.wikimedia.org/w/index.php?curid=6597630
QSAR: Historical perspective
1964. Hansch analysis
Hansch & Fujita, JACS 1964
log 1! = −!!! + !!!!! − !!!!!! + log !! + !!!!!
Quantitative Structure-Activity Relationships (QSAR)
Definition
QSAR is building a mathematical model correlating a set
of structural descriptors of a set of chemical compounds
to their biological activity.
QYXR is building a mathematical model correlating a set of
independent variables of a set of samples to a set of dependent
variables.
Quantitative Structure-Activity Relationships (QSAR)
1. Set of compounds
4. Biological activities
Considerations
All compounds should belong to congeneric series
Same mechanism of action
A similar binding mechanism
Biological activity should be exactly the same
Biological activity is correlated to binding affinity
Quantitative Structure-Activity Relationships (QSAR)
1. Set of compounds
2. Molecular descriptors
4. Biological activities
Quantitative Structure-Activity Relationships (QSAR)
1. Set of compounds
2. Molecular descriptors
3. Mathematical models
4. Biological activities
! = !! + !!!! + !!!! +⋯+ !!!! !Mul$ple Linear Regression (MLR)
Par$al Least Square (PLS)
Ar$ficial Neural Network (ANN)
Gene$c Algorithm (GA)
Molecular descriptors
Molecular descriptors
Molecular descriptors
1D descriptors
2D descriptors
3D descriptors
Molecular weight, LogP, No. of functional groups
Topological indices
Geometrical parameters, Molecular surfaces, Quantum
chemistry descriptors
2D descriptors
Topological indices based on adjacency matrix
1
3 4
6
5
21 3 4 652
1
3
4
6
5
20 22 01 12 23 33 3
!!!!!
1 21 20 11 02 12 1
!!!!!
3 33 32 21 10 22 0
!! !
!! = 12 !!"
!
!!!
!
!!!!TI = 29
3D descriptors
Quantum chemical descriptors
Descriptors calculated by Quantum Mechanic methods
(semi empirical, Ab initio or DFT )
Partial atomic charges
Lowest occupied molecular orbital energy (LUMO)
Highest occupied molecular orbital energy (HOMO)
Electrostatic potential
Molecular polarizability
Molecular descriptors Softwares
Dragon
GAUSSIAN
HyperChem
CODESSA
MOE
Quantitative Structure-Activity Relationships (QSAR)
1. Set of compounds
2. Molecular descriptors
3. Mathematical models
4. Biological activities
! = !! + !!!! + !!!! +⋯+ !!!! !Mul$ple Linear Regression (MLR)
Par$al Least Square (PLS)
Ar$ficial Neural Network (ANN)
Gene$c Algorithm (GA)
Multiple Linear Regression (MLR)
InterceptCoefficients
! = (!!!)!!!′!!
! = !! = !(!!!)!!!′!!
! = !! + !!!! + !!!! +⋯+ !!!! !
!! − ! − !!!!,! −⋯ !!!!,! !
!!Objective Function
Multiple Linear Regression (MLR)
! = !!!!
/(! − ! − 1)!
! = !! + !!!! + !!!! +⋯+ !!!! !
! =!! − ! !!
!!! !!! − !! !!
!!! ! − ! − 1!!! = 1− !! − !! !!
!!!!! − ! !!
!!!!
ȓ = -
!!!!…!!
!!!!!!!!!!!1!2…!"
!
Expr Estimated
!"# = ! log !!!!! !
! + 2(! + 1)!Akaike Information Criterion
Multiple Linear Regression (MLR)
X1 X2 X3 X4 Yexp Ycalc Residual
1 3.42 38.51 6.62 6.63 3 2.9 0.12 3.05 38.91 6.61 6.04 3.15 3.37 -‐0.223 2.52 54.28 6.58 6.23 3.28 3.07 0.214 3.29 54.27 6.63 6.09 4.24 3.91 0.335 2.25 54.62 6.61 6.03 3.28 3.14 0.146 2.42 55.37 6.59 5.67 4.35 3.75 0.67 3.15 70.6 6.67 6.51 3.88 3.69 0.198 1.67 69.77 6.49 5.79 3.64 3.3 0.349 2.91 70.03 6.64 6.11 4.35 3.99 0.3610 1.73 70.57 6.61 6.04 3.4 3.11 0.2911 1.36 86.18 6.64 6.12 3.3 3.12 0.1812 2.81 85.83 6.62 6.05 4.7 4.38 0.3213 2.96 102.96 6.66 6.52 4.67 4.35 0.3214 0.65 102.7 6.61 6.04 3.34 3.06 0.2815 2.22 117.89 6.62 6.04 4.11 4.74 -‐0.6316 0.19 118.98 6.61 6.18 3.37 2.92 0.4517 2.85 135.34 6.67 6.52 5.93 5.1 0.8318 0.39 134.08 6.65 6.32 3.65 3.31 0.3419 3.58 22.34 6.7 6.6 2.7 2.69 0.0120 3.41 54.34 6.62 6.64 3.49 3.29 0.221 0.43 77.39 1.87 4.37 1.99 1.87 0.1222 0.35 93.05 1.88 4.34 2.38 2.25 0.1323 0.09 109.53 1.87 4.34 2.76 2.46 0.324 -‐0.2 125.8 1.88 4.34 3.29 2.65 0.6425 1.41 87.61 0.35 -‐14.65 0.87 0.85 0.02
∂2=0.170 R2=0.899 F=42.4
! = 4.224− 1.305!! + 0.535!! + 0.026!! + 0.817!!!
∂2Y=0.712
Variable selection
1. Systematic approaches
1. Forward selection
2. Backward elimination
2. Heuristic approaches
1. Genetic algorithm
2. Simulated annealing
Forward selection
Y X1 X2 X3 X4 X5X1 X2 X3 X4 X5
AIC 57.7 60.70 54.7 56.1 56.5Y=a+Xn
X1 X2 X3 X4 X5
AIC 56.3 47.55 56.7 56.5Y=a+X3+Xn
X1 X2 X3 X4 X5
AIC 29.4 49.5 48.3Y=a+X3+X2+Xn
X1 X2 X3 X4 X5
AIC 13.8 25.1Y=a+X3+X2+X1+Xn
X1 X2 X3 X4 X5
AIC 15.7Y=a+X3+X2+X1+X4+Xn
!"# = ! log !!!!! !
! + 2(! + 1)!
Backward elimination
Y X1 X2 X3 X4 X5 !"# = ! log !!!!! !
! + 2(! + 1)!
X1 X2 X3 X4 X5
AIC 15.7Y=a+X1+X2+X3+X4+X5
X1 X2 X3 X4 X5
AIC 21.8 50.6 59.9 25.1 13.8Y=a+X1+X2+X3+X4
X1 X2 X3 X4
AIC 31.9 49.5 58.0 29.4Y=a+X1+X2+X3+X4
Genetic algorithm
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 GENOME
0 1 0 0 1 0 0 1 0 0
0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 0
1 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 1 1
AIC
! = !! + !!!! + !!!! + !!!!!
0 1 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0
Mutation Mutation
Partial least square
The X-variables are correlated
The number of X-variables is relatively high compared with the number of samples
X = TPT Y =UQT
Y =ß X + ℇ
U =ß T + ℇ
Other modeling methods
Non-linear regression
Artificial neural network
Classification methods
Multiple logistic regression
Support vector machine
! = !! + !!!!! + !!!!! +⋯+ !!!!! ! Y
X1
X2
X3
! !!
!
!!!!! !
Validation
Valida&on is required to ensure model quality
Over-‐fi6ng
Chance correla&on
1. Cross-validation
1. Leave-one-out
2. Leave-N-out
2. Bootstrapping
3. External validation (prediction set)
4. Y randomization
Cross-validation
Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y19Y20
Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y19
Y20
Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y20
Y19
P Tim
es
Leave-one-out
Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y19Y20
Leave-N-out
Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16
Y20
Y19
Y18
Y17
Y1
Y2
Y7
Y8
Y9
Y10
Y11
Y12
Y13
Y14
Y15
Y16
Y17
Y18
Y20
Y3
Y4
Y5
Y6P/
N T
imes
Rcv2LOO Rcv2LNO
Bootstrapping
Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y14Y15Y16Y17Y18Y19Y20
Y1Y2Y3Y4Y5Y6Y7Y8Y9Y10Y11Y12Y13Y15Y17Y19
Y20
Y16
Y18
Y14
Y2
Y3
Y4
Y5
Y8
Y9
Y11
Y12
Y13
Y14
Y15
Y16
Y17
Y18
Y20
Y7
Y10
Y1
Y6
N T
imes
RBS2
External validation
Y1
Y2
Y3
Y4
Y5
Y6
Y7
Y8
Y9
Y10
Y11
Y12
Y13
Y14
Y15
Y16
Y17
Y18
Y19
Y20
Y1
Y3
Y4
Y5
Y6
Y7
Y8
Y10
Y11
Y12
Y13
Y15
Y16
Y17
Y19
Y2
Y9
Y14
Y18
Y20
Variable selection
Cross-validation
Final model
Predic
t
R2EV
Y-randomization
Y1
Y2
Y3
Y4
Y5
Y6
Y7
Y8
Y9
Y10
Y11
Y12
Y13
Y14
Y15
Y16
Y17
Y18
Y19
Y20
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
Y =ß X + ℇ
Y20
Y19
Y18
Y17
Y16
Y15
Y14
Y13
Y12
Y11
Y10
Y9
Y8
Y7
Y6
Y5
Y4
Y3
Y2
Y1
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
X12
X13
X14
X15
X16
X17
X18
X19
X20
Ynew =ß X + ℇ RYrand2
N T
imes
Good model?
! = !! + !!!! + !!!! +⋯+ !!!! !∂2 R2 F (R)MSEModel Robustness
!"#$ = !"#$ !! − ! !
! − 1
!
!!!!
Model Quality Rcv2LOO Rcv2LNO RBS2 RMSEcv
Model Reliability RYrand2 RMSEYrand
Model Predictability REV2 RMSEEv
Good model?
! = !! + !!!! + !!!! +⋯+ !!!! !∂2 R2 >0.8 F (R)MSEModel Robustness
Model Quality Rcv2LOO >0.6 Rcv2LNO >0.6 RBS2 >0.6 RMSEcv
R2 - Rcv2 < 0.3
Model Reliability RYrand2 <0.3 RMSEYrand
R2 - RYrand2 > 0.4
Model Predictability REV2 >0.6 RMSEEV
R2 - REV2 < 0.3
Applicability domain
! = !! + !!!! + !!!! +⋯+ !!!! !
X1
X2
Principal component analysis
Prediction Vs Description
VE_b(e): coefficient sum of the last eigenvector from Burden matrix weighted by Sanderson electronegativityATS1v: Broto-Moreau autocorrelation of lag 1 (log function) weighted by van der Waals volumeSM02_AEA: spectral moment of order 2 from augmented edge adjacency mat. weighted by resonance integral
! = 2.34+ 3.5!!! ! − 0.87!"!1! + 3.76!!02_!"!!
! = 8.34+ 2.5!"#$ + 0.93!"#!
∂2=0.003 R2=0.951 F=260.2 REV2=0.891
∂2=0.113 R2=0.811 F=43.2 REV2=0.761
LogP: water-oil partition coefficientNAR: Number of aromatic rings