Upload
licia
View
66
Download
0
Embed Size (px)
DESCRIPTION
CZ3253: Computer Aided Drug design Lecture 7: Drug Design Methods II: SVM Prof. Chen Yu Zong Tel: 6874-6877 Email: [email protected] http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. Classification of Drugs by SVM. - PowerPoint PPT Presentation
Citation preview
CZ3253: Computer Aided Drug designCZ3253: Computer Aided Drug design
Lecture 7: Drug Design Methods II: SVM Lecture 7: Drug Design Methods II: SVM
Prof. Chen Yu ZongProf. Chen Yu Zong
Tel: 6874-6877Tel: 6874-6877Email: Email: [email protected]@nus.edu.sghttp://xin.cz3.nus.edu.sghttp://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1, Room 07-24, level 7, SOC1, National University of SingaporeNational University of Singapore
22
Classification of Drugs by SVMClassification of Drugs by SVM
• A drug is classified as either belong (+) or not belong (-) to a class
Examples of drug class: inhibitor of a protein, BBB penetrating, genotoxicExamples of protein class: enzyme EC3.4 family, DNA-binding
• By screening against all classes, the property of a drug or the function of a protein can be identified
Drug
Class-1 SVM
Class-2 SVM
Class-3 SVM
Drugbelongs toFamily-3
-
-
+
--
33
Classification of Drugs or Proteins by SVMClassification of Drugs or Proteins by SVM
What is SVM?
• Support vector machines, a machine learning method, learning by examples, statistical learning, classify objects into one of the two classes.
Advantages of SVM:
• Diversity of class members (no racial discrimination).
• Use of structure-derived physico-chemical features as basis for drug classification (no structure-similarity required in the algorithm).
44
SVM ReferencesSVM References• C. Burges, "A tutorial on support vector machines for pattern recognition",
Data Mining and Knowledge Discovery, Kluwer Academic Publishers,1998 (on-line).
• R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley, 2nd edition, 2001 (section 5.11, hard-copy).
• S. Gong et al. Dynamic Vision: From Images to Face Recognition, Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy).
• Online lecture notes (http://www.cs.unr.edu/~bebis/MathMethods/SVM/lecture.pdf )
• Publications of SVM drug prediction: – J. Chem. Inf. Comput. Sci. 44,1630 (2004) – J. Chem. Inf. Comput. Sci. 44, 1497 (2004) – Toxicol. Sci. 79,170 (2004).
55
Machine Learning MethodMachine Learning Method Inductive learning:
Example-based learning
Descriptor
Positive examples
Negative examples
66
Machine Learning MethodMachine Learning Method
A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)
Feature vectors: Descriptor
Feature vector
Positive examples
Negative examples
77
SVM MethodSVM Method Feature vectors in input space:
A=(1, 1, 1)B=(0, 1, 1)C=(1, 1, 1)D=(0, 1, 1)E=(0, 0, 0)F=(1, 0, 1)
Z
Input space
X
Y
BAE
F
Feature vector
88
SVM MethodSVM Method
BorderNew border
Project to a higher dimensional space
Protein familymembers
Nonmembers
Protein familymembers
Nonmembers
99
SVM methodSVM method
Support vector
Support vector
New border
Protein familymembers
Nonmembers
1010
SVM MethodSVM Method
Protein familymembers
Nonmembers
New border
Support vector
Support vector
1111
Best Linear Separator?Best Linear Separator?
1212
Best Linear Separator?Best Linear Separator?
1313
Find Closest Points in Convex Find Closest Points in Convex HullsHulls
c
d
1414
Plane Bisect Closest Points Plane Bisect Closest Points
x w b
w d c
d
c
1515
Find using quadratic programFind using quadratic program
21
2
1 1
1 1
min
1 1. .
0 1,...,
i i i ii i
i ii i
i
c d
c x d x
s t
i
Many existing and new solvers.
1616
Best Linear Separator:Best Linear Separator:Supporting Plane MethodSupporting Plane Method
1x w b
1x w b
Maximize distanceBetween two parallel supporting planes
Distance = “Margin” = ||||
2
w
1717
Best Linear Separator?Best Linear Separator?
1818
SVM MethodSVM Method
Border line is nonlinear
1919
SVM methodSVM method
Non-linear transformation: use of kernel function
2020
SVM methodSVM method
Non-linear transformation
2121
SVM MethodSVM Method
2222
SVM MethodSVM Method
2323
SVM MethodSVM Method
2424
SVM MethodSVM Method
2525
SVM for Classification of DrugsSVM for Classification of DrugsHow to represent a drug?
• Each structure represented by specific feature vector assembled from structural, physico-chemical properties:– Simple molecular properties (molecular weight, no. of rotatable bonds
etc. 18 in total)– Molecular Connectivity and shape (28 in total)– Electro-topological state polarity (84 in total)– Quantum chemical properties (electric charge, polaritability etc. 13 in
total)– Geometrical properties (molecular size vector, van der Waals volume,
molecular surface etc. 16 in total)
J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
Toxicol. Sci. 79,170 (2004).
2626
-8 -7 -6 -5 -4 -3
-8
-7
-6
-5
-4
-3
Observed RT (min)
Pre
dict
ed R
T (
min
)SVM Feature SelectionSVM Feature Selection
CACO2 - 718 descriptorsCACO2 - 718 descriptorsAverage of 10 ModelsAverage of 10 Models
Test Q2 = .7073
Q2 is MSE scaled by variance:
= (mean square error) / (true variance)
2727
Feature SelectionFeature Selection
Using subset of descriptors might greatly improve results.
• Do feature selection using Linear SVM with 1-norm regularization
1-norm2-norm
2828
Feature Selection via Feature Selection via Sparse SVM/LPSparse SVM/LP
• Construct linear -SVM using 1-norm LP:
• Pick best C, for SVM• Keep descriptors
with nonzero coefficients
* 1*
, , , , 1
*
*
min
.
, , 0 1,.
|| ||
.,
i iw b z z i
i i i
i i i
i i
Cz z C
x b y zs
w
tx b y z
z z
w
w
i
| | 0iw
2929
Bagged Feature SelectionBagged Feature SelectionPartition Training Data
Training Set Validation Set
Linear SVM AlgorithmFor Feature Selection
A Linear Regression Model
Bag B Models and Obtain Subset of Features
Repeat B times
1 2 7181 2 718
i r
Make 20 models of the form
- ...
with only a few 0
Keep attributes with w w
r
i
w x b w x w x w x w r b
w
Random Variable - r
3030
-8 -7 -6 -5 -4 -3
-8
-7
-6
-5
-4
-3
Observed RT (min)
Pre
dict
ed R
T (
min
)
Bagged SVM (RBF)Bagged SVM (RBF)CACO2 - 31 DescriptorsCACO2 - 31 Descriptors
Test Q2 = .134
3131
Starplot Caco2 - 31 DescriptorsStarplot Caco2 - 31 Descriptors
ABSDRN6
a.don
KB54
SMR.VSA2
BNP8
DRNB10
KB11
PEOE.VSA.FPPOS
ANGLEB45
PIPB53
DRNB00
PEOE.VSA.4
SlogP.VSA6
apol
ABSFUKMIN
PIPB04
PEOE.VSA.FPOL
PIPMAX
BNPB50
BNPB21
PEOE.VSA.FHYD
PEOE.VSA.PPOS
EP2
SlogP.VSA9
ABSKMIN
PEOE.VSA.FNEG
BNPB31
FUKB14
pmiZ
SIKIA
SlogP.VSA0
3232
Chemistry In/Out ModelingChemistry In/Out Modeling
Feature Selection
Visualize Features
Assess Chemistry
Construct SVM Nonlinear model
Data +Descriptors
SVM Model
Test Data
Predict bioactivities
ChemistryInterpretation
3333
-8 -7 -6 -5 -4 -3
-8
-7
-6
-5
-4
-3
Observed RT (min)
Pre
dict
ed R
T (
min
)
Bagged SVM (RBF)Bagged SVM (RBF)CACO2 - 15 DescriptorsCACO2 - 15 Descriptors
Test Q2 = .166
3434
CACO2 – 15 Variables CACO2 – 15 Variables
a.don
KB54
SMR.VSA2
ANGLEB45
DRNB10
ABSDRN6
PEOE.VSA.FPPOS
DRNB00
PEOE.VSA.FNEG
ABSKMIN
SIKIA
pmiZ
BNPB31
FUKB14
SlogP.VSA0
3535
Chemical InsightsChemical Insights
• Hydrophobicity - a.don• SIZE and Shape ABSDRN6, SMR.VSA2, ANGLEB45, PmiZ Large is bad. Flat is bad. Globular is good.• Polarity – PEOE.VSA.FPPOS, PEOE.VSA.FNEG:
negative partial charge good.
Correspond to conventional wisdom – rule of 5.
3636
Hybrid TAE/SHAPEHybrid TAE/SHAPE
• Shape important overall factor– DRNB10, DRNB00: del rho dot N– BNP31: bare nuclear potential – KB54: kinetic energy descriptors very large lipophilic molecules don’t work– FUKB14: Fukui Surface
• Interpretations difficult• Point to chemistry challenges/hypotheses
3737
Final SVM ApproachFinal SVM Approach
• Construct large set of descriptors.• Perform feature selection:
– Sensitivity Analysis or SVM-LP
• Construct many SVM models– Optimize using QP or LP– Evaluate by Validation Set or Leave-one-out – Select best models by grid or pattern search
• Bag best k models to create final function
3838
Drug Discovery Results (LOO)Drug Discovery Results (LOO)
Data # Sampl
e
# Var.
Full
# Var.
FS (Avg)
Q2
Full
Q2
FS
Caco2 27 713 41 0.33 0.29
Barrier 62 569 51 0.31 0.28
HIV 64 561 17 0.46 0.40
Cancer 46 362 34 0.50 0.16LCCK 66 350 69 0.40 0.37
Aquasol 197 525 57 0.08 0.06
SVM-based drug design and property prediction softwareSVM-based drug design and property prediction softwareUseful for inhibitor/activator/substrate prediction, drug safety and pharmacokinetic prediction.
Computer loaded Computer loaded with SVMProtwith SVMProt
Support vector machinesSupport vector machinesclassifier for every classifier for every
Drug classDrug class
Identified Identified classesclasses
Drug designed Drug designed or property or property predicted predicted
Send structure to classifierSend structure to classifier
J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
Toxicol. Sci. 79,170 (2004).
Input structurethrough internet
Option 2Option 1
Input structureon local machine
http://jing.cz3.nus.edu.sg/cgi-bin/svmprot.cgi
Your drug structure
Which class your Which class your drug belongs to?drug belongs to?
Drug
Chemical Structure Chemical
Structure
SVM Drug Prediction ResultsSVM Drug Prediction Results
Protein inhibitor/activator/substrate prediction:
• 86% of the 129 estrogen receptor activators and 84% of 101 non-activators correctly predicted.
• 81% of 116 P-glycoprotein substrates and 79% of 85 non-substrates correctly predicted
Drug Toxicity Prediction:
• 97% of 102 TdP+ and 84% of 243 TdP- agents correctly predicted • 73% of 229 genotoxic and 93% of 631 non-genotoxic agents correctly predicted
Pharmacokinetics prediction:
• 95% of 276 BBB+ and 82% of 139 BBB- agents correctly predicted• 90% of 131 human intestine absorption and 80% of 65 non-absoption agents
correctly predicted.
J. Chem. Inf. Comput. Sci. 44,1630 (2004) J. Chem. Inf. Comput. Sci. 44, 1497 (2004)
Toxicol. Sci. 79,170 (2004).