Upload
others
View
7
Download
0
Embed Size (px)
Citation preview
© 2009 Optibrium Ltd. Optibrium™, StarDrop™, Auto-Modeler™ and Glowing Molecule™ are trademarks of Optibrium Ltd.
Rasmus Leth*, Peter Hunt, Jonathan Tyzak, Matthew Segall UKQSAR/PCF Meeting,
Stevenage - 15th Mar 2016
WhichP450: Predicting which CYP450 isoforms are involved in the metabolism of a xenobiotic
© 2016 Optibrium Ltd.
Outline
• Why do we need this..?
• Methodology
• Descriptors
• Results
• Conclusions
2
© 2016 Optibrium Ltd.
Why
• Various isoforms of CYPs
− Different active site requirements
− Orientations of molecules different
− Hence different oxidative metabolite profile possible
• Possibility of DDIs
• Possibility of effects from polymorphism
• All round toxicity profile may be different
• Different databases
− Expand the list of isoforms covered
− Use substrate data only, not inhibitor data
− Comparison with WhichCYP web-based tool.
3
Pockets for PDB entries 4K9T (hCYP3A4 - purple) & 3E6I (hCYP2E1 - blue)
© 2016 Optibrium Ltd.
Methodology
• Utilised the literature data collected for our Regioselectivity models
• Annotated those molecules with not only where but by which CYP and a personal judgement of MAJOR vs MINOR metaboliser
• Total number of molecules used in this work is 484 with a 196 molecule/isoform test set
− A molecule, and site, can be a substrate (major or minor) for more than one isoform
• Assumes that a test compound is a substrate & predicts which P450 from a list of 7 isoforms 2D6, 2C8, 2C9, 1A2, 2E1, 2C19, 3A4
− Not a modelling exercise that predicts if a molecule is going to be a CYP substrate or not
4
• Used SVM as the statistical method, producing a single multi-class model with probabilities for each isoform.
− Produces (n*(n-1))/2 binary classifiers & then votes for multi-class (n classes) determination
− SVM package is ‘e1071’ (LIBSVM) for use in R
© 2016 Optibrium Ltd.
SVM – Support Vector Machine
• Machine learning algorithm
• Applied to classification or regression
• Numerous applications in chemistry (discrimination between ligands, QSAR, text mining, etc.)
5
Maximum separation hyperplane
Linear kernel polynomial kernel
© 2016 Optibrium Ltd.
Descriptor Fingerprints
Describing molecular features in binary or numerical language (01000100…01001000)
6
Atom Pairs Morgan Radius Acetone atom pairs
CX – (2) – C.X3
CX – (2) – C.X3
C.X3 – (2) – O.X1
CX – (3) – CX
CX – (3) – O.X1
CX – (3) – O.X1 Considering atom 1 in benzoic acid amide
Radius 1 Radius 2 Radius 3
Topological Torsions
(NPI-TYPE-NBR)-(NPI-TYPE-NBR)- (NPI-TYPE-NBR)-(NPI-TYPE-NBR)
© 2016 Optibrium Ltd.
Descriptors
• RDKit code used to generate fingerprints
− via KNIME or Python protocol – binary fingerprints
− ((5 FPs * 3 lengths) + 7 other FPs) * 5 over sampling methods = 110 SVM models per run
• StarDrop fingerprint – a frequency/count based fingerprint with a ‘Harlequin’ nature
7
RDKit via KNIME (v2.11) RDKit via Python (v2.7) StarDrop (v6.2)
Atom Pairs (AP) {256, 512, 1024bit} - StarDrop {257 descriptors}
Topological Torsions (TT) {ditto} - -
Morgan radius 2 & 3 {ditto} Morgan radius 2 & 3 {256bit only} -
Feature Morgan radius 2 & 3 {ditto} Feature Morgan radius 2 & 3 {ditto} -
RDKit topological {ditto} RDKit topological {ditto} -
- MACCS {167 descriptors} -
© 2016 Optibrium Ltd.
Results
• SVM model produces an ordered list of probabilities for all 7 isoforms
• Measure how successful a prediction is by seeing if a MAJOR isoform is one of the top-k guesses
• Reporting the % of the 196 molecule/isoform test set is predicted correctly
− 140 unique compounds in test set
• Also calculated a “random” prediction and a “guided random” biased by the known proportion of compounds that each isoform metabolises
3A4 = 0.35, 2D6 = 0.23, 2C9 = 0.15, 1A2 = 0.10, 2C19 = 0.08, 2C8 = 0.06, 2E1 = 0.04
8
© 2016 Optibrium Ltd.
Results for Morgan 256bit FPs in detail (KNIME)
• % correct prediction of a MAJOR isoform
− Top-1 (blue), Top-2 (orange), or Top-3 (grey) criteria
• The models produced perform much better than random or educated guess (mostly)
• Incorporation of the MINOR isoform data has a dramatic deleterious effect on success
− Could be considered as ‘noise’ to the MAJOR signal
− Generally the Morgan FPs (radius 2 & 3) suffer the most
9
% c
orr
ect
FeatM Morgan Randoms
© 2016 Optibrium Ltd.
Results for other 256bit FPs in detail (KNIME)
• MINOR noise doesn’t affect the RDKit or TT fingerprints
− RDKit FP is the poorest for Top-2 or Top-3 performance
− TT is the only FP that out-performs random for all points & all criteria
• Oversampling helps to overcome this noise and helps the Top1 prediction
− Not so for Top2 or Top3 criteria
• StarDrop FP is the best performer with ~85% success for Top-2 criterion
− This is a slightly longer FP (257bits) and it is a frequency based rather than binary FP
− Hence do longer FP’s help..?
10
% c
orr
ect
RDKit TT AP StarDrop
© 2016 Optibrium Ltd.
Results comparison of 512 & 1024bit FPs (KNIME)
• 512bit fingerprint, little different from 256bit
− although Feature Morgan radius 2 performance improves
• Morgan fingerprints are still the most affected by the MINOR ‘noise’
− RDKit now affected
• Longer FP − 1024Bit Helps the Morgan FPs
− 512 or 1024bit for the AP’s helps to reduce the effect of the MINOR noise
− Brings the AP & Feature Morgan FP up to StarDrop FP performance levels
11
% c
orr
ect
AP TT RDKit FeatM3 Morgan3 FeatM2 Morgan2 StarDrop
© 2016 Optibrium Ltd.
Results summary comparison for each FP
12
• Feature Morgan radius 2 = better tolerates the MINOR noise in going from 256 to 512 but no gain to 1024
• Feature Morgan radius 3 = only improvements seen at the 1024bit length & not before
• Morgan radius 2 = some improvement in MINOR noise tolerance from 256 to 512 and on to 1024 but no effect on
performance with any of the other oversampling
• Morgan radius 3 = similar to the Feature Morgan effect only improvements seen at 1024bit level
• Topological Torsions = virtually no effect on performance at FP increases
• Atom Pairs = Tolerance of the MINOR noise improves 256 to 512 but not 1024 only in top2 or top3 predictions not in
top1
• RDKit = MINOR noise tolerance gets worse in moving away from 256bit whilst other success improves (no difference
between the 512 & 1024 bit FPs)
© 2016 Optibrium Ltd.
Results comparison KNIME with Python versions
• MACCS gave >90% success in Top-2 criterion with inclusion of MINOR data & oversampling
• Surprising for such a simple descriptor
• Performances generally were better for what should have been equivalent FPs
• The FP was reversed in one generation method compared to the other
− ie Bit1-Bit256 was = Bit256-Bit1 in the other method
13
% c
orr
ect
RDKit FeatM3 Morgan3 FeatM2 Morgan2 MACCS
© 2016 Optibrium Ltd.
Results reversing Python fingerprints
• Specifically reversing the order of the bits in the Python derived FP reduces the predictive ability of the system
− With the ‘e1071’ SVM package
• So reversing the KNIME derived FP’s should be beneficial
14
% c
orr
ect
RDKit FeatM3 Morgan3 FeatM2 Morgan2 MACCS
© 2016 Optibrium Ltd.
Results comparison of KNIME forward with reversed versions
• Reversed versions of the KNIME FPs didn’t produce consistently better results
• Fingerprints shown for illustration
− Feature Morgan 1024bit,
− RDKit 256bit,
− StarDrop
• Why the difference..?
− Different order of compounds in each training set
15
% c
orr
ect
RDKit StarDrop StarDrop rev
FeatM2 RDKit rev
FeatM2 rev
© 2016 Optibrium Ltd.
Results comparison of KNIME forward, reversed, & reordered versions
• The reordered descriptor files gave a 0-5% improvement in predictive success
− depending on the fingerprint type and the length
• Reversing AND reordering did not reach the predictive success of the MACCS FP
• No consistent pattern for the improvement
16
% c
orr
ect
AP512 AP256 TT512 TT1024 TT256 MACCS AP1024
© 2016 Optibrium Ltd.
Our MACCS model vs WhichCYP web model
• The test set predicted using the WhichCYP Version 1.2 web resource
− http://www.farma.ku.dk/whichcyp/index.php
− libSVM and CDK libraries used
− Individual 2 category predictions for 1A2, 2C9, 2C19, 2D6, 3A4
− Built on inhibition data rather than substrate data
• The predictions were converted into probabilities
− one CYP probability was 1.0, two CYPs were 0.5 each…
• Results show that our SVM with MACCS fingerprint performs very well by comparison.
17
% c
orr
ect
WhichCYP MACCS
© 2016 Optibrium Ltd.
Conclusions
• MACCS keys perform surprisingly well
− although other fingerprints eg StarDrop, or Atom Pairs, are good performers
• SVM package appears to have some sensitivity to order of columns and/or order of rows
• Over sampled MAJOR data can provide better models for the most stringent Top1 criteria with some fingerprints (eg Atom Pairs).
− Mixing MAJOR & MINOR data equally simply provides noise
− Not all descriptor fingerprints are sensitive to the ‘MINOR’ noise.
• Fingerprint length (256, 512 or 1024 bits) – longer was not always better
− Morgan FP performances benefitted from longer fingerprints.
18
© 2016 Optibrium Ltd.
Acknowledgements
• Hepatic and Cardio Toxicity Systems modelling
− The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under the grant agreement no 602156
19