Upload
others
View
2
Download
0
Embed Size (px)
Citation preview
Quantifying Degree
of Aromaticity From
Structural Features
David J. Ponting,
Ruud van Deursen,
Martin A. Ott
• Not all aromatic systems are equal!• Furans can undergo Diels-Alder reactions1,2
• Direct effects on in vivo behaviour, e.g. toxicity3,4
Why model aromaticity?
1: LaPorte et al (2013), J. Org. Chem., 78, 167-174. 2: Jursic (1998), J. Mol. Struct. (TheoChem), 454, 105-116
3: Opinion on Dihydroxyindole, SCCP (2006). 4: Chichirau et al (2005), Free Rad. Biol. Med., 38, 344-355
Quantifying Aromaticity
HOMED model1
• Delocalisation can be thought of as ‘smearing-out’ bonds2
• C-C: Rd double (ethene) ≈ 1.3288 Å, Rs single (ethane) ≈ 1.5300 Å,
Ra aromatic (benzene) ≈ 1.3943 Å.
• Compare bond lengths (Ri) with ideal reference compounds1
• Calculate per-bond and sum for all in ring system
• Normalise against effect of ideal aromatic system (Rs-Ra or Ra-Rd)
• HOMED = 1 −1
𝑛𝑑+𝑛𝑠σ𝛼𝑖(𝑅𝑎−𝑅𝑖)
2
• where 𝛼𝑖 =𝑛𝑑+ 𝑛𝑠
𝑛𝑠(𝑅𝑎−𝑅𝑠)2+𝑛𝑑(𝑅𝑎−𝑅𝑑)2
• α is therefore constant for a given atom pair in a given ring type
• Reference compound selection is critical!
• Formally applies to all delocalised systems• Filter by Hückel’s rule to remove non-aromatics
1: Raczynska et al (2010), Symmetry, 2, 1485-1509
2: Jursic (1998), J. Mol. Struct. (TheoChem), 454, 105-116
Methods
• Reference compounds selected for 29 pairs of atoms• Extension of published HOMED method1
• References modelled at B3LYP/6-31G*• Not highest accuracy but same theory as dataset
• Large dataset of compounds derived from PubchemQC2
• ~4 million structures, as many aromatic rings
• Calculate aromaticity for all of these ring systems
1: Raczynska et al (2010), Symmetry, 2, 1485-1509
2: Nakata and Shimazaki (2017), J. Chem. Inf. Model., 57, 1300-1308
Furan and analogues
• Furan significantly weaker than
thiophene and pyrrole
• Several distinct sub-classes• Electron-donating groups at the
2-position are less aromatic
• Electron-withdrawing groups at
the 2-position are more aromatic
• The 3-position has less effect
• 2,3-fused systems typically less
aromatic
Furan and analogues
• Electron-withdrawing groups encourage delocalisation
• Electron-donating groups reduce delocalisation
• Fusions typically reduce aromaticity
2-Pyridone and analogues
• 2-Pyridone more aromatic than
both uracil and isocyanuric acid• For the latter two, any
substitution pattern makes them
less aromatic
• Electron-withdrawing groups on
nitrogens reduce aromaticity;
donating groups increase it
• 4- and 6-position have effects
analogous to 2-position in furans
• Electron-donating groups reduce delocalisation
• N-substitution can reduce aromaticity (EWG)
• or increase (EDG) it
2-Pyridone and analogues
Predicting Aromaticity
• Measuring aromaticity is useful – but requires an accurate geometry• X-Ray crystal structure
• Expensive QM calculation
• Can we machine-learn the HOMED index, and predict from a SMILES string?
Methods
• Reference compounds selected for 29 pairs of atoms• Extension of published HOMED method1
• References modelled at B3LYP/6-31G*• Not highest accuracy but same theory as dataset
• Large dataset of compounds derived from PubchemQC2
• ~4 million structures, as many aromatic rings
• Fragment molecules to isolate individual rings• Keep conjugated substituents
• Fingerprint the structures and group by ring systems• Machine learn values for each structure group
1: Raczynska et al (2010), Symmetry, 2, 1485-1509
2: Nakata and Shimazaki (2017), J. Chem. Inf. Model., 57, 1300-1308
Fragmentation
• Assume sp3 carbons interrupt conjugation• Remove acyclic bonds between them from molecule
• Then remove all acyclic C-C bonds with sp3 at one end
• Also remove halogens to allow grouping of substructures
• Calculate HOMED indices on disconnected fragments
Fragmentation
0.998
Not Hückel compliant
0.990
• Assume sp3 carbons interrupt conjugation• Remove acyclic bonds between them from molecule
• Then remove all acyclic C-C bonds with sp3 at one end
• Also remove halogens to allow grouping of substructures
• Calculate HOMED indices on disconnected fragments
Fingerprinting
• Structures with the same conjugated system grouped• Combinations of HOMED index taken
• Only those with >3 instances kept
• 15065 substituent patterns from 954 different rings
• Variety of fingerprints and hashed sizes tried• 256 – 16384 length
• Morgan Circular (ECFP2/4, both bits and counts)
• [Extended] Sybyl Atom-Pair
0.85
0.9
0.95
1
256 512 1024 2048 4096 8192 16384
Valu
e
Hologram hash size
Performance effect of fingerprint size
Gradient R^2
0.075
0.08
0.085
0.09
0.095
0.1
256 512 1024 2048 4096 8192 16384
Valu
e
Hologram hash size
Performance effect of fingerprint size
Intercept RMSE
Learning
• Machine learning was performed in Python• ‘First try’ random forest achieved R2=0.98, RMSE=0.08 as
out-of-bag estimates
• Due to a cluster of highly-aromatic species being well-
predicted; there was much more scatter elsewhere
• Some aromatic systems can have outliers• Often due to ring strain
• e.g.
• 3 at 0.954, 1 at 0.044: Median 0.954, Mean 0.727
• Learned against median, not mean
Data transformation
• The HOMED indices are clustered around 0.9-1, with
relatively few less-aromatic systems• Causes problems for some learning algorithms
• Calculated index was transformed using logistic functions
• Models built on this data, then predictions transformed back• i.e. 𝑦𝑝𝑟𝑒𝑑 = ൗ1 1+𝑒−𝑦𝑚𝑜𝑑𝑒𝑙
Choosing a fingerprint and metric
• Python allowed rapid model
development, optimising:• Model choice
• Model hyperparameters
• Choice of fingerprint
• 10-fold cross-validation within
model selection
• ECFP2 Hologram gave best results• Better than ECFP2 bitset, ECFP4 or
atom-pair fingerprints
• Very high r2 for all decent models• Makes it a poor metric
• RMSE affected by few outliers
• Median Absolute Deviation (MAD)
Experimental
Pre
dic
ted
Choosing a model
• NN and RF give good results
• kNN gives middling performance
• Kernel methods poor
Fingerprint Model R2 RMSE MAD
ECFP2 Hologram RF .9934 .07485 .004181
ECFP2 Hologram NN .9941 .07396 .004485
ECFP2 Hologram SVR .8315 .1538 .009593
ECFP2 Hologram KNN .8499 .09749 .006940
ECFP2 Bitset RF .9928 .07580 .004346
ECFP4 Hologram RF .9932 .07668 .004206
ESybyl Atom Pair RF .9914 .08200 .004295
Neural Network or Random Forest?• Very similar performance
• Random Forest (L) slightly tighter well-predicted band
• Neural Network (R) has fewer extreme outliers
Experimental Experimental
Pre
dic
ted
Pre
dic
ted
Kernel methods perform poorly
Experimental
Pre
dic
ted
Experimental
Pre
dic
ted
• Nearest Neighbours better but not spectacular
Experimental
Pre
dic
ted
Conclusions
• The HOMED index allows quantification of
aromaticity given an accurate geometry
• The HOMED index provides a numeric
response variable for machine learning
• This machine learning method allows us to
bypass the need for an expensive geometry
and predict degree of aromaticity from a
SMILES string
• The predicted degree of aromaticity can be
used to better predict chemical behaviour
Lhasa Limited
Granary Wharf House, 2 Canal Wharf
Leeds, LS11 5PS
Registered Charity (290866)
Company Registration Number 01765239
+44(0)113 394 6020
www.lhasalimited.org
Questions?PCCDB
PubChemQC
Work in progress disclaimer
This document is intended to outline our general product
direction and is for information purposes only, and may not be
incorporated into any contract. It is not a commitment to deliver
any material, code, or functionality, and should not be relied
upon. The development, release, and timing of any features or
functionality described for Lhasa Limited’s products remains at
the sole discretion of Lhasa Limited.