Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
doi.org/10.26434/chemrxiv.7770809.v1
Medicinal Chemistry Database GDBMedChemMahendra Awale, Finton Sirockin, Nikolaus Stiefl, Jean-Louis Reymond
Submitted date: 26/02/2019 • Posted date: 27/02/2019Licence: CC BY-NC-ND 4.0Citation information: Awale, Mahendra; Sirockin, Finton; Stiefl, Nikolaus; Reymond, Jean-Louis (2019):Medicinal Chemistry Database GDBMedChem. ChemRxiv. Preprint.
The generated database GDB17 enumerates 166.4 billion possible molecules up to 17 atoms of C, N, O, Sand halogens following simple chemical stability and synthetic feasibility rules, however medicinal chemistrycriteria are not taken into account. Here we applied rules inspired by medicinal chemistry to excludeproblematic functional groups and complex molecules from GDB17, and sampled the resulting subset evenlyacross molecular size, stereochemistry and polarity to form GDBMedChem as a compact collection of 10million small molecules.This collection has reduced complexity and better synthetic accessibility than the entire GDB17 but retainshigher sp 3 - carbon fraction and natural product likeness scores compared to known drugs. GDBMedChemmolecules are more diverse and very different from known molecules in terms of substructures and representan unprecedented source of diversity for drug design. GDBMedChem is available for 3D-visualization,similarity searching and for download at http://gdb.unibe.ch.
File list (2)
download fileview on ChemRxivGDBMedChem-ChemRXiv1.pdf (1.65 MiB)
download fileview on ChemRxivGDBMedChemSI.pdf (674.53 KiB)
1
Medicinal Chemistry Database GDBMedChem
Mahendra Awale,[a] Finton Sirockin,[b] Nikolaus Stiefl[b] and Jean-Louis Reymond*[a]
Abstract: The generated database GDB17 enumerates
166.4 billion possible molecules up to 17 atoms of C, N, O, S
and halogens following simple chemical stability and
synthetic feasibility rules, however medicinal chemistry
criteria are not taken into account. Here we applied rules
inspired by medicinal chemistry to exclude problematic
functional groups and complex molecules from GDB17, and
sampled the resulting subset evenly across molecular size,
stereochemistry and polarity to form GDBMedChem as a
compact collection of 10 million small molecules.
This collection has reduced complexity and better synthetic
accessibility than the entire GDB17 but retains higher sp3-
carbon fraction and natural product likeness scores
compared to known drugs. GDBMedChem molecules are
more diverse and very different from known molecules in
terms of substructures and represent an unprecedented
source of diversity for drug design. GDBMedChem is
available for 3D-visualization, similarity searching and for
download at http://gdb.unibe.ch.
Keywords: chemical space, drug design, small molecules, medicinal chemistry, virtual screening
1 Introduction
All attempts to estimate the total number of possible organic
molecules or to specifically enumerate them have shown that
synthetic chemistry including medicinal chemistry to date has
barely scratched the surface of the chemical universe, even
when including billions of compounds that can be readily
synthesized by combining known building blocks with known
reactions.[1] In one such attempt we enumerated all possible
molecules up to 11,[2] 13[3] and 17 atoms[4] starting from
mathematical graphs[5] by selecting the corresponding
hydrocarbons for chemically acceptable ring strain and
topologies and introducing unsaturations and heteroatoms
following simple chemical stability and synthetic feasibility
rules (Figure 1). More recently we similarly enumerated all
possible ring systems up to four rings and 30 atoms.[6] The
corresponding generated databases (GDBs) contain almost
exclusively unknown molecules and therefore represent a vast
reservoir of possible innovation.
GDB enumeration focuses on chemistry rules and
does not take specific medicinal chemistry criteria into
account, such as the type and number of functional
groups and the overall structural complexity that would be
compatible with a drug-type molecule.[7] Here we aimed
to define a subset of GDB17 for medicinal chemistry,
named GDBMedChem, by filtering GDB17 using such
criteria. We followed a similar approach to that used for
the fragment database FDB17, [8] which was recently
reported as a fragment-like subset of GDB17 following
fragment-likeness criteria.[9] We present the database
assembly procedure and discuss the resulting
GDBMedChem database in comparison to our previously
reported fragment database FDB17, and to known drugs,
bioactive molecules and natural products up to 17 atoms
from DrugBank, [10] ChEMBL,[11] and the natural products
directory (Table 1).[12] We show that GDBMedChem
represents a vast and diverse source of new molecular
structures for drug design.
Figure 1. GDBMedChem generation workflow. Steps 4) and 5)
are discussed in this publication.
Graphs
114,304,569,097
Hydrocarbons
5,422,153
1) Ring strain
Skeletons
1,330,958,530
GDB-17: Molecules
166,443,860,262
2) Unsaturations
3) Heteroatoms
17.8G set: Molecules
17,804,900,000
GDBMedChem
10,007,380
4) Functional groups
and complexity filters
5) Even sampling
[a] Department of Chemistry and Biochemistry, University
of Bern
Freiestrasse 3, 3012 Bern, Switzerland
*e-mail: [email protected]
[b] Novartis Institutes for Biomedical Research, Basel,
Switzerland
2
Table 1. Databases discussed in this publication.
Database Size Description
GDB17 166.4 G Virtually enumerated molecules of up to 17 atoms of C, N, O, S, and halogens
17.8G set 17.8 G GDB17 molecules passing filters in Table 2
GDBMedChem 10 M GDB17 molecules passing filters in Table 2, evenly sampled across size, heteroatoms and stereocenters
4.6G set 4.6 G Fragment like subset of GDB17
FDB17 10 M Fragment like subset of GDB17, evenly sampled across size, heteroatoms and stereocenters
ChEMBL17 105,423 Compounds with HAC ≤ 17 extracted from ChEMBL 22
DrugBank17 2,284 Approved and experimental drugs with HAC ≤ 17 extracted from DrugBank
UNPD17 20,302 Natural products with HAC ≤ 17 extracted from Universal natural product database (UNPD)
ChEMBL 1.4 M Compounds with HAC ≤ 50 extracted from ChEMBL 22
DrugBank 8,299 Approved and experimental drugs with HAC ≤ 50 extracted from DrugBank
ZINC 15 M Commercially available compounds from ZINC 12 database
2 Results and Discussion
2.1 Selecting GDBMedChem from GDB17
To identify a subset of GDB17 suitable for medicinal
chemistry we applied structural filters considering
problematic functional groups and complexity criteria
(Table 2). Overall, these filters reduced GDB17 by 89 %,
leaving 17.8 billion molecules, referred to as 17.8G set, of
which only 480 million also occurred in our previously
reported fragment-like set of 4.6 billion molecules (4.6G
set),[8] illustrating the very different approach chosen.
While the calculation was performed by iterative steps, we
discuss here the effect of each filter on the entire GDB17
in comparison to other reference databases.
The first set of filters addresses functional groups
(FGs). We remove FGs which are very abundant in
GDB17 due to the combinatorial enumeration but are not
desirable in drugs due to poor chemical or metabolic
stability (amidines, imidates, and terminal esters) or
undesirable reactivity (aldehydes, aziridines, epoxides,
note that hydrolytically reactive groups such as
anhydrides and acyl chlorides are not present in GDB17).
We also eliminate or cap the number of potentially
problematic FGs for drug design (no aromatic rings larger
than 6 atoms, no Br or I, no halogen on heterocycle,
maximum one nitrile, acetylene or sulfone, maximum two
acyclic esters, amides, or ethers). Overall FG filters
eliminate approximately half of GDB17 but reduce
bioactive molecules from ChEMBL17 and DrugBank17 by
less than 15% and natural products from UNPD17 by
21 % (Table 2, upper part).
Table 2. Filters used for selecting the 17.8G set from GDB17, and percentages of databases that are eliminated by each filter .
Filter GDB17a) 4.6G seta) FDB17 ChEMBL17 DrugBank17 UNPD17
FGs:
no amidine 23.55 17.27 11.76 1.80 1.10 0.23
no imidate 12.99 0 0 0.65 1.84 0.12
no aldehyde 10.14 0 0 1.22 2.32 5.65
no aziridine 5.48 0 0 0.22 0.13 0.01
no aromatic ring > 6 atoms 3.52 0 0 0.15 0 0.46
no Br, I 2.96 0 0 6.92 3.42 3.65
≤ 2 ethers 2.70 1.84 0.86 0.65 0.70 2.51
no epoxide 2.28 0 0 0.39 0.44 1.97
no terminal esterb) 1.75 0.73 0.48 2.64 1.18 6.87
no Cl or F on heterocycle 1.73 0 0 3.50 1.62 0.33
≤ 1 acetylene 1.37 0 0 0.12 0 2.73
≤ 1 nitrile 0.96 0 0 0.38 0.09 0.07
≤ 1 sulfone 0.25 0 0 0.20 0.31 0.03
≤ 2 amides 0.01 0.01 0.0067 0.075 0.04 0.0099
≤ 2 acyclic esters 0.000010 0 0 0.0038 0 0.03
All FG filtersc) 52.73 19.7 13.02 14.6 11.21 21.77
Structural complexity:
≤ 18 avalon density 34.81 47.05 38.12 9.21 6.26 3.59
≤ 1 cyclic tetravalent node 24.18 20.86 21.57 1.51 0.96 8.03
≤ 4 stereocenters 22.48 0 0 0.72 3.11 5.46
≤ 3 bonds in fused ring systems 17.42 16.47 15.91 1.83 0.96 2.93
≤ 3 rings 6.38 0 0 0.82 0.35 0.92
All complexity filtersd) 62.07 64.83 59.11 12.43 10.64 15.6
Polarity:
≤ 0.7 hetero atoms to carbon ratio 6.15 0.10 4.78 15.60 39.82 10.03
All filters combinede) 86.27 73.68 65.98 35.77 50.28 41.07
a) In case of GDB17 and the 4.6G set the calculation was performed on a 50M random subset of molecules. b) Terminal esters are
defined as methyl esters, acetates and formates. c) Percentage of compounds eliminated from each database by applying all
functional groups filters.d) Percentage of compounds eliminated from each database by applying all structural complexity filters. e)
Percentage of compounds eliminated from each database by applying all filters.
3
Secondly, we limit structural complexity by capping the
number of rings, stereocenters, cyclic quaternary centers
and bonds in fused rings. We also apply an overall
complexity filter calculated from the avalon fingerprint as
a limit on the fingerprint density value, defined here as the
number of on-bits in the avalon fingerprint scaled to the
heavy atom count.[13] The cut-off values for these
parameters were selected by analyzing histograms of
these values in GDB17 in comparison to known drugs,
which shows that GDB17 has generally higher values
compared to drug molecules (Figure 2). Accordingly,
these structural complexity reduction criteria eliminate
many GDB17 molecules (62 %) corresponding to complex
polycyclic scaffolds and/or difficult functional group
combinations but have a much smaller reduction effect on
the reference databases ChEMBL17 (12 %), DrugBank17
(11 %) and UNPD17 (16 %) (Table 2, lower part).
Finally, the heteroatom to carbon ratio is capped at 0.7,
which removes 6 % of GDB17. Although this criterion
eliminates a very significant fraction of ChEMBL17 (36 %),
DrugBank17 (50 %) and UNPD17 (41 %), molecules taken
out by this filter have mostly negative clogP values and
are not desirable from drug design point of view (Figure
3). Note that many of the filters applied on GDB17 are size
dependent, however they do not strongly affect target
assigned molecules from ChEMBL up to 50 non-hydrogen
atoms,[14] except for the ring count filters (≤ 3 rings), which
eliminates 50-80% of the compounds depending on the
target class (Table S1).
Despite of the functional group and complexity filters
applied, the composition of the 17.8G set was heavily
skewed towards the largest and most complex molecules.
To obtain a more realistic selection for medicinal
chemistry, we sampled this subset across molecular size,
stereochemistry and polarity by randomly selecting a
comparable number of molecules from each of the 243
triplet value bins (heavy atom count, number of
stereocenters, number of heteroatoms) (Figure S1). This
procedure resulted in a database of 10 million 2D-
structures as SMILES, defined here as GDBMedChem.
Figure 3. Clog P histogram for the compounds which were
eliminated due to hetero atoms to carbon ratio filter.
Figure 2. Histogram of structural complexity and polarity parameters used during GDBMedChem generation, for GDBMedChem (red),
its parent 17.8G set (magenta), the entire GDB17 (black), the fragment database FDB17 (blue) and its parent 4.6G set (cyan),
ChEMBL17 (green), DrugBank17 (grey) and UNPD17 (orange).
4
Figure 4. Molecular property histograms for GDBMedChem (red), its parent 17.8G set (magenta), the entire GDB17 (black), the
fragment database FDB17 (blue) and its parent 4.6G set (cyan), ChEMBL17 (green), DrugBank17 (grey) and UNPD17 (orange). In
plot (k) each molecule is assigned to a single category as a function of it’s ring types in priority order heteroaromatic > aromat ic >
heterocyclic > carbocyclic > acyclic.
2.2 Property analysis
To gain an insight into the composition of GDBMedChem,
we analyzed the distribution of molecules across various
molecular properties in comparison to the 17.8G set from
which it was sampled, the entire GDB17, the related
fragment database FDB17 and its parent 4.6G set, and to
ChEMBL17, DrugBank17, and UNPD17 (Figure 4). Both
molecular size histograms (HAC and MW) show that the
even sampling procedure used to compose
GDBMedChem and FDB17 from their respective larger
17.8G sets and 4.6G sets corrects their highly skewed
distribution towards the largest molecules, resulting in a
distribution closer to known molecules in ChEMBL17,
DrugBank17 and UNPD17 (Figure 4a/b).
5
The rotatable bond count (RBC) profiles show that
GDBMedChem molecules have similar structural
flexibility compared to know molecules, in contrast to the
fragment database FDB17 and its parent 4.6G set which
stand out by the low number or rotatable bonds due to the
fragment-likeness rule RBC ≤ 3 (Figure 4c). The HBD,
HBA, clogP and O+N count profiles in GDBMedChem are
also similar to those of known molecules. This effect
results from the even sampling procedure since its parent
17.8 G set is clearly different and matches that of GDB17
(Figure 4d, e, f, g).
In terms of synthetic accessibility score,
GDBMedChem and its parent 17.8G set shows slightly
lower values compared to GDB17 and is quite similar to
FDB17, reflecting the role of structural complexity filters
(Figure 4i).[15] However GDBMedChem and FDB17
molecules remain significantly less synthetically
accessible compared to known molecules according to
this score.
The natural product likeness of all GDB databases is
significantly higher than those of DrugBank17 and
ChEMBL17, although it is still lower than for the natural
products themselves in UNPD17 (Figure 4j). [16] The
higher natural product likeness of GDB molecules
compared to drugs and ChEMBL molecules probably
reflects their higher fraction of sp3 carbon atoms, which is
higher in natural products and GDB molecules compared
to drugs and ChEMBL molecules (Figure 4h).
Finally, all GDB databases show a very low
percentage of aromatic molecules but a much higher
percentage of heterocyclic molecules compared to known
molecules. This results from the combinatorial
enumeration used to generate GDBs, which produces
many more combinations when heteroatoms are present
in rings (Figure 4k).
2.3 Substructure analysis
To assess if GDBMedChem molecules are
significantly different from known molecules, we
compared substructures in GDBMedChem molecules to
those from molecules in the entire DrugBank, ChEMBL,
and ZINC database independent of molecular size. To
perform this analysis, we collected the molecular shingles
used in the calculation of MinHash fingerprint MHFP6, an
extended connectivity fingerprint which outperforms
ECFP4 in benchmarking studies. [17] In MHFP6 molecular
shingles comprise extended connectivity substructures
around each atom up to a diameter of 6 bonds as well as
all ring structures, and are written as SMILES with the
rooted atom appearing as the first atom in the SMILES
string.
GDBMedChem molecules contain on average 38
shingles per molecule, which is approximately two-third of
the number of shingles per molecule found in known
molecules from ChEMBL, DrugBank and ZINC, reflecting
the smaller size of GDBMedChem molecules (Table 3).
The total number of unique shingles in each database
grows in function of the number of molecules surveyed in
the database, however this number grows faster and more
steadily in GDBMedChem compared to known molecules
(Figure 5a). The number of occurrences of shingles in
each of the four databases follows a power law
distribution, with a small number of shingles appearing in
almost all molecules, and a large number of shingles
appearing in only few molecules (Figure 5b).
A Venn diagram analysis shows that almost all 17.3
million unique shingles in GDBMedChem (97 %) are
unique to this database, illustrating that GDBMedChem
molecules are very different from known molecules at the
level of their substructures (Figure 5c). By comparison,
the three databases of known molecules contain a much
smaller fraction of unique shingles, mostly because they
contain a much smaller number of unique shingles. This
is particularly striking with ZINC, which contains 15 million
molecules but only 1.5 million shingles. Interestingly, the
difference between GDBMedChem and known molecules
also holds true when focusing on the 100 most frequent
shingles in each database (Figure 5d). Among these
frequent shingles, oxygen containing saturated or singly
unsaturated shingles stand out in GDMMedChem, in
contrast to aromatic and nitrogen heterocycles in ZINC
(Figure S2). The 29 frequent shingles shared by all four
databases correspond to simple aliphatic substructures,
alcohol, carbonyl, and the benzene ring.
Table 3. MHFP6 shingle analysis for GDBMedChem and other databases.
Database no. of molecules av. ± SD shingles per moleculea)
no. of shingles in databaseb)
no. of shingles unique to the databasec)
GDBMedChem 10,006,044 38 ± 6 17,317,417 16,798,833 (97 %)
DrugBank 8,299 57 ± 26 82,193 9,995 (12 %)
ChEMBL 1,446,502 64 ± 16 1,593,674 911,181 (57 %)
ZINC 15,149,974 59 ± 12 1,477,745 780,276 (53 %)
a) Average number of MHFP6 shingles per molecule, considering unique shingles in each molecule separately . MHFP6 shingles are
substructures around each atom with a diameter of up to six bonds. see Ref. [17] for details. b) Total number of unique shingles
across entire database.c) Shingles that do not occur in any of the other three databases.
6
Figure 5. MHFP6 shingle analysis. (a) cumulative number of unique shingles and (b) frequency distribution of shingles for
GDBMedChem, DrugBank, ChEMBL and ZINC. To compute the cumulative number of unique shingles c ompounds were randomly
ordered in a database (on an x-axis). (c-d) Venn diagrams showing no. of MHFP6 shingles unique and shared among the different
databases. In (d) only the top 100 most frequent shingles from each database were considered.
2.4 Interactive visualization and search tools
To enable a closer insight into GDBMedChem we
represented the 10 million molecules in a principal
component analysis (PCA) 3D-projection of the 42D-
Molecular Quantum Number (MQN)[18] chemical space
using Faerun,[19] which is available at http://gdb.unibe.ch.
We also generated MQN 3D-maps for the fragment
database FDB17, and for the 128,011 known molecules in
ChEMBL17, DrugBank17 and UNPD17. PCA is a classical
approach for visualizing chemical space[20] which is much
faster than other dimensionality reduction methods[21] and
therefore well suited for large datasets such as GDB.[22]
MQN is a composition fingerprint counting different
types of atoms, bonds, polar groups and topologies. Since
many of these descriptors are correlated, MQN datasets
project well into 3D by PCA.[23] The resulting MQN maps
typically order compounds by size, number of rings, and
polarity. This is illustrated here for color-coding by HAC and
ring atom count (Figure 6). These maps illustrate that
GDBMedChem covers a similar chemical space as FDB17
and known molecules, however the coverage by
GDBMedChem and FDB17 is much denser than for known
molecules. Note that GDBMedChem covers a broader
range of structures than FDB17 by allowing a higher
number or rotatable bonds, which leads to more acyclic
molecules. The interactive versions of the maps in Faerun
allow one to efficiently browse the contents of
GDBMedChem and related databases and gain an
overview of their contents.
To further facilitate the exploitation of GDBMedChem,
we have generated a multi-fingerprint browser accessible at
http://gdb.unibe.ch (Figure 7).[24] In this browser we allow
nearest neighbor (NN) searches of any query molecule in
GDBMedChem, or for comparison in the combined set
ChEMBL17 + DrugBank17 + UNPD17. The search is
implemented using Annoy (Approximate Nearest Neighbors
Oh Yeah, https://github.com/spotify/annoy), which provides
very fast search results even for relatively large
databases.[25] Similarity searching is possible by shortest
city-block distance according to the MQN fingerprint,[18] by
highest similarity according to the extended connectivity
fingerprint ECFP4,[26] or by a combined search retrieving
NNs in MQN, followed by sorting these NNs by MHFP6
similarity to the query, which orders results according to a
detailed substructure logic.[17]
Similarity searches in GDBMedChem with this browser
readily return high-similarity analogs for any query molecule.
Typical results of such searches are exemplified here for
ten drugs of 17 atoms or less, for which we identified two
nearest neighbors from GDBMedChem that are not
currently documented in Scifinder (Figure 8).
3 Conclusion
To address the overwhelming complexity of molecules in
GDB17 we applied a set of medicinal chemistry criteria and
complexity filters to define GDBMedChem, a 10 million
subset of drug-like molecules covering a broad range of
molecular size, polarity and sterochemistry. The vast
majority of molecules in GDBMedChem are yet unknown
and represent a valuable resource for medicinal chemistry.
The database is available for 3D-visualization as well as for
similarity searching and for download at http://gdb.unibe.ch.
ChEMBL
ZINCGDBMedChem
DrugBank ChEMBL
ZINCGDBMedChem
DrugBank
a) b)
c) d)
7
Figure 6. MQN PCA-maps of GDBMedChem, FDB17 and merged database (DrugBank17 + ChEMBL17 + UNPD17), color coded
according to the count of heavy atoms, and number of atoms in rings. Color changes from blue to cyan to green to yellow to red to
magenta with increasing count of a property. PCA variance covered: GDBMedChem (PC1: 42%, PC2: 18%, PC3: 11%), FDB17 (PC1:
32%, PC2: 22%, PC3: 10%), DrugBank17 + ChEMBL17 + UNPD17 (PC1: 41%, PC2: 17%, PC3: 15%). The images were generated
from fearun and are accessible at http://faerun.gdb.tools.
Figure 7. Multifingerprint browser interface for GDBMedChem and ChEMBL17 + DrugBank17 + UNPD17 database. (a) Entry page
of browser showing nicotine as query compound. (b) Search result window displaying MQN nearest neighbors of nicotine. The
multifingerprint browser is publicly accessible at http://gdb-medchem-simsearch.gdb.tools.
GDBMedChem FDB17 DrugBank17 + UNPD17 + ChEMBL17
He
av
y a
tom
co
un
tN
um
be
r o
f a
tom
s in
rin
gs
≤7
10
15
17
0
3814≥16
≤7
10
15
17
0511
≥16
≤6
11
15
17
0612≥16
a)
b)
8
Figure 8. Ten representative drugs of 17 or less heavy atoms (blue) and in each case two NNs (black) retrieved from GDBMedChem
using the Multifingerprint browser with the combined MQN-MHFP6 method (see method for details). Numbers indicate MQN city block
distance / MHFP6 Tanimoto coefficient / rank in NN list. The NNs shown are currently not documented in Scifinder.
4 Methods
4.1 Assembly of GDBMedChem
The functional groups and structural complexity filters
mentioned in Table 2 were applied to GDB17 (stored as
splitted smiles files) to obtain the GDBMedChem database.
All calculations were performed using RDKit (version
2017_09_03) and PySpark (version 2.3.2) parallel
computing framework on a 98 nodes cluster with 252 GB of
RAM. Sixteen out of 21 filters discussed in Table 2 were
implemented as SMARTS queries, and the remaining five
filters (stereocenters, ring count, avalon density,
heteroatoms to carbon ratio and largest aromatic ring size)
were implemented using other functions provided in RDkit.
It should be notated that filters were applied in a progressive
manner (simple and obvious filters first) and not in the order
of Table 2. Molecules violating any of the filtering criteria
were removed from the GDB17 database. This resulted in
a subset of 17,804,900,000 molecules (17.8G set).
The molecules from the 17.8G set were binned into
425 triplet bins generated from all the possible
combinations of the values of three descriptors,
namely: heavy atom count (1 to 17), heteroatoms (≤1,
2, 3, 4, ≥5) and stereocenters (0, 1, 2, 3, 4). Of these
425 triplet bins, 181 bins were not occupied by any
molecule, thus leaving 244 bins for further
consideration. The binned 17.8G set was then stored in
the form of a PySpark DataFrame (data schema:
[SMILES: string, Triplet bin: string]), wherein each entry
contains two fields, namely the SMILES of a compound
and its Triplet bin. Next, 10 million molecules were
sampled from the DataFrame using the PySpark
“sampleBy” function to form GDBMedChem. The
stratified sampling without replacement was used.
The PySpark “sampleBy” function generates the
stratified samples from the DataFrame given two input
parameters: i) the column name which can be used to
define the different stratums and ii) the Python
dictionary object containing the names of different
stratums as keys and the fraction of entries to sample
from each stratum as the corresponding value. The
stratums are the bins used to group the entries in the
DataFrame. In our case the column “Triplet bin” was
used to define the different stratums (244 stratums
corresponding to 244 unique triplet bins in the
DataFrame). The fraction of entries to sample from
each stratum was computed as follows: The Python
dictionary variables n_selected = [stratum1: 0,
stratum2: 0 …, stratum244: 0] and n_total = [stratum1:
x, stratum2: x, …, stratum244: x] were initiated. In both
variables, keys indicate the names of different stratums
9
(triplet bins). The values in variable n_selected indicate
the number of compounds to sample from each of the
244 stratums. The values in variable n_total indicate
the total number of compounds present in each of the
244 stratums. Values x were computed beforehand.
Thereafter, items in the n_selected variable were
iterated several times (until the sum of values in
n_selected variables ═ ═10M), each time incrementing
the value of each stratum by 1, given the condition that
the n_selected value for a given stratum is less than the
n_total value for a given stratum. Finally, the fraction of
compounds to sample from each stratum was computed
by dividing the n_selected value for a given stratum by
the corresponding value from n_total variable.
4.2 Other databases
The random subset containing 50M molecules from GDB17
and the FDB17database were downloaded from
http://gdb.unibe.ch website. For the 4.6 G set from which
FDB17 is sampled, we used an in-house copy of the
database. Random subsets (containing 50M molecules) of
the 17.8G and 4.6G sets were generated using the PySpark
sample function. ChEMBL version 22 was downloaded from
https://www.ebi.ac.uk/chembl/, DrugBank version 5.011
from https://www.drugbank.ca/ and UNPD from
http://oolonek.github.io/ISDB/. ChEMBL17, DrugBank17
and UNPD17 databases were generated by removing the
compounds containing more than 17 heavy atoms. The
merged database (DrugBank17_UNPD17_ChEMBL17)
was formed by merging all molecules from DrugBank17,
ChEMBL17 and UNPD17. For the ppb2 set we used an in-
house copy of the database previously prepared for the
polypharmacology browser ppb2.[14] In the ppb2 set, each
compound is annotated with it’s SMILES, ChEMBL
compound ID and ChEMBL target IDs. Compounds from
the ppb2 set were assigned to target families based on the
classification provided in ChEMBL22. Within each target
family only unique compounds were considered for further
computation.
4.3 Processing of molecules and calculation of MQN
and molecular properties
For each molecule counter ions were removed (if present)
and the largest fragment was retained (if a molecule
contains multiple fragments). Thereafter, all molecules
were processed in non-chiral SMILES format, checked for
valence errors, and protonated at pH 7.4 using an in-house
written Java-program utilizing the JChem Chemistry library
from ChemAxon Pvt. Ltd. Next, based on unique SMILES
notation, duplicate molecules were removed in the context
of each database. Molecular properties were calculated for
each molecule using RDkit, except for the number of rings,
hydrogen bond acceptor and donor count which were
calculated using JChem. MQN fingerprints were calculated
using an in-house written Java Program utilizing the JChem
Chemistry library.
4.4 MHFP6 shingle analysis
MHFP6 (diameter of 6) shingles were calculated for each
molecule in each of the GDB17(10M), GDBMedChem,
FDB17, ChEMBL17, DrugBank17, UNPD17, ChEMBL,
DrugBank and ZINC databases using the MHFP python
package (GitHub repository: https://github.com/reymond-
group/mhfp, pip: https://pypi.org/project/mhfp/). More
specifically the function “shingling_from_smiles” from
MHFPEncoder class was used to generate the shingles.
Later, for each database, the number of total unique
MHFP6 shingles were calculated by simple string
comparison of all MHFP6 shingles for a given database.
Additionally, for each database, the number of shingles per
compound was calculated by diving the number of unique
shingles by the total number of compounds in a given
database.
4.5 Web-application for visualization
The web application for interactive visualization of property
color coded 3D-spaces of GDBMedChem, FDB17 and
DrugBank17 _UNPD17 _ChEMBL17 database was built
using FUn (http://doc.gdb.tools/fun/), an inhouse developed
framework for visualization of chemical spaces on the web.
The three main components of FUn framework are the data
preprocessing tool chain, the data service (Underdark Go)
and Faerun a web-application for interactive visualization.
Initially, we stored each database in a plain text file format.
In this file each line contains 4 fields separated by spaces:
i) SMILES of a compound, ii) name or id of a compound, iii)
42 MQN values (separated by colon) for a compound and
iv) molecular properties (separated by colon) to use for
color coding. Next, the plain text file for each database was
pre-processed using the data preprocessing toolchain
(https://github.com/reymond-group/pca), which projects the
42-dimentional MQN chemical space into 3 dimensions
using Principle Component Analysis (PCA) and generates
all the necessary files for visualization. Thereafter,
underdark dataservice and Fearun visualization containers
were run using docker.
4.6 Web-application for similarity searching
The web application for similarity searching in
GDBMedChem and the DrugBank17_UNPD17_
ChEMBL17 database is implemented using Html, Bootstrap,
JavaScript (frontend) and Flask python web framework. To
enable fast similarity searching, we implemented
approximate nearest neighbor searching using Annoy
(Approximate Nearest Neighbors Oh Yeah,
https://github.com/spotify/annoy). The option is provided to
perform a similarity search using 42-dimensional MQN or
256-dimensional Extended connectivity fingerprint (ECfp4)
or a combination of MQN and 128 dimensional MinHash
(MHFP6) fingerprint. In case of MQN, similarity search
application retrieves the nearest neighbors of a query
molecules using MQN based Annoy tree and rank them as
per increasing city block distance (manhattan distance).
While, in case of MQN-MHFP6, similarity search application
first retrieves the nearest neighbors of a query molecule
using MQN based Annoy tree and then resort nearest
neighbors based on their Jaccard distances with respect to
a query molecule in MHFP6 fingerprint space. The
calculation of MQN of a molecule is implemented using an
in-house written Java program, while calculation of MHFP6
fingerprint is implemented using GitHub Python repository
https://github.com/reymond-group/mhfp.
10
Acknowledgements
This work was supported financially by a grant of NIBR to
MA. We thank ChemAxon Pvt. Ltd. for providing free
academic and web licenses for their products
References
[1] a) R. S. Bohacek, C. McMartin, W. C. Guida, Med. Res. Rev. 1996, 16, 3-50; b) M. Hartenfeller, H. Zettl, M. Walter, M. Rupp, F. Reisen, E. Proschak, S. Weggen, H. Stark, G. Schneider, PLoS Comput. Biol. 2012, 8, e1002380; c) J. L. Reymond, L. Ruddigkeit, L. C. Blum, R. Van Deursen, WIREs comput. Mol. Sci. 2012, doi: 10.1002/wcms.1104; d) F. Chevillard, P. Kolb, J. Chem. Inf. Model. 2015, 55, 1824-1835; e) M. Awale, R. Visini, D. Probst, J. Arus-Pous, J. L. Reymond, Chimia 2017, 71, 661-666; f) J. Boström, D. G. Brown, R. J. Young, G. M. Keserü, Nat. Rev. Drug Discovery 2018, 17, 709-727; g) N. van Hilten, F. Chevillard, P. Kolb, J. Chem. Inf. Model. 2019, doi: 10.1021/acs.jcim.1028b00737.
[2] T. Fink, J. L. Reymond, J. Chem. Inf. Model. 2007, 47, 342-353.
[3] L. C. Blum, J.-L. Reymond, J. Am. Chem. Soc. 2009, 131, 8732-8733.
[4] L. Ruddigkeit, R. van Deursen, L. C. Blum, J. L. Reymond, J. Chem. Inf. Model. 2012, 52, 2864-2875.
[5] B. D. McKay, Congressus Numerantium 1981, 30, 45-87. [6] R. Visini, J. Arus-Pous, M. Awale, J. L. Reymond, J. Chem.
Inf. Model. 2017, 57, 2707-2718. [7] a) M. Hann, B. Hudson, X. Lewell, R. Lifely, L. Miller, N.
Ramsden, J. Chem. Inf. Comp. Sci. 1999, 39, 897-902; b) I. Muegge, S. L. Heald, D. Brittelli, J. Med. Chem. 2001, 44, 1841-1846; c) D. F. Veber, S. R. Johnson, H.-Y. Cheng, B. R. Smith, K. W. Ward, K. D. Kopple, J. Med. Chem. 2002, 45, 2615-2623; d) W. P. Walters, M. A. Murcko, Advanced Drug Deliv. Rev. 2002, 54, 255-271; e) R. F. Bruns, I. A. Watson, J. Med. Chem. 2012, 55, 9763-9772; f) S. J. Capuzzi, E. N. Muratov, A. Tropsha, J. Chem. Inf. Model. 2017, 57, 417-427.
[8] R. Visini, M. Awale, J.-L. Reymond, J. Chem. Inf. Model. 2017, 57, 700-709.
[9] M. Congreve, R. Carr, C. Murray, H. Jhoti, Drug Discovery Today 2003, 8, 876-877.
[10] V. Law, C. Knox, Y. Djoumbou, T. Jewison, A. C. Guo, Y. Liu, A. Maciejewski, D. Arndt, M. Wilson, V. Neveu, A. Tang, G. Gabriel, C. Ly, S. Adamjee, Z. T. Dame, B. Han, Y. Zhou, D. S. Wishart, Nucleic Acids Res. 2014, 42, D1091-D1097.
[11] A. P. Bento, A. Gaulton, A. Hersey, L. J. Bellis, J. Chambers, M. Davies, F. A. Krüger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos, J. P. Overington, Nucleic Acids Res. 2014, 42, D1083-D1090.
[12] P. Banerjee, J. Erehman, B.-O. Gohlke, T. Wilhelm, R. Preissner, M. Dunkel, Nucleic Acids Res. 2015, 43, D935-D939.
[13] P. Gedeck, B. Rohde, C. Bartels, J. Chem. Inf. Model. 2006, 46, 1924-1936.
[14] M. Awale, J. L. Reymond, J. Chem. Inf. Model. 2019, 59, 10-17.
[15] P. Ertl, A. Schuffenhauer, J. Cheminform. 2009, 1, 8. [16] a) P. Ertl, S. Roggo, A. Schuffenhauer, J. Chem. Inf. Model.
2008, 48, 68-74; b) K. V. Jayaseelan, P. Moreno, A. Truszkowski, P. Ertl, C. Steinbeck, BMC Bioinformatics 2012, 13, 106.
[17] D. Probst, J. L. Reymond, J. Cheminform. 2018, 10, 66. [18] K. T. Nguyen, L. C. Blum, R. van Deursen, J.-L. Reymond,
ChemMedChem 2009, 4, 1803-1805. [19] D. Probst, J. L. Reymond, Bioinformatics 2018, 34, 1433-
1435. [20] a) T. I. Oprea, J. Gottfries, J. Comb. Chem. 2001, 3, 157-
166; b) J. Rosen, J. Gottfries, S. Muresan, A. Backlund, T. I. Oprea, J. Med. Chem. 2009, 52, 1953-1962.
[21] a) A. M. Wassermann, M. Wawer, J. Bajorath, J. Med. Chem. 2010, 53, 8209-8923; b) H. A. Gaspar, I. I. Baskin, G. Marcou, D. Horvath, A. Varnek, J. Chem. Inf. Model. 2014,
55, 84-94; c) T. Sander, J. Freyss, M. von Korff, C. Rufener, J. Chem. Inf. Model. 2015, 55, 460-473; d) A. Lin, D. Horvath, V. Afonina, G. Marcou, J.-L. Reymond, A. Varnek, ChemMedChem 2018, 13, 540-554.
[22] a) L. C. Blum, R. van Deursen, J. L. Reymond, J. Comput.-Aided Mol. Des. 2011, 25, 637-647; b) L. Ruddigkeit, L. C. Blum, J.-L. Reymond, J. Chem. Inf. Model. 2013, 53, 56-65.
[23] R. van Deursen, L. C. Blum, J. L. Reymond, J. Chem. Inf. Model. 2010, 50, 1924-1934.
[24] M. Awale, J. L. Reymond, Nucleic Acids Res. 2014, 42, W234-239.
[25] A. Capecchi, M. Awale, D. Probst, J. L. Reymond, ChemRXiv 2019, doi: 10.26434/chemrxiv.7650071.v7650072.
[26] D. Rogers, M. Hahn, J. Chem. Inf. Model. 2010, 50, 742-754.
download fileview on ChemRxivGDBMedChem-ChemRXiv1.pdf (1.65 MiB)
1
Supporting information for:
Medicinal Chemistry Database GDBMedChem
Mahendra Awale,[a] Finton Sirockin,[b] Nikolaus Stiefl[b] and Jean-Louis Reymond*[a]
[a] Department of Chemistry and Biochemistry, University of Bern Freiestrasse 3, 3012 Bern, Switzerland
*e-mail: [email protected]
[b] Novartis Institutes for Biomedical Research, Basel, Switzerland
Table of content
Figure S1.. ................................................................................................................................................ 2
Figure S2. ................................................................................................................................................. 3
Table S1. .................................................................................................................................................. 4
2
Figure S1. Frequency histograms for the 17.8G set (magenta line) and the 10 million
GDBMedChem database (red line) across (A) molecular size (1-17), (B) stereocenters (0, 1,
2, 3, 4) and (C) heteroatoms (N+O+S: ≤1, 2, 3, 4, ≥5). In D the frequency histogram is shown
by individual triplet value bins (HAC, heteroatoms, stereocenters) sorted by decreasing
occupancy in the 17.8G set (magenta line) and in GDBMedChem (red line).
3
Figure S2. Structures of MHFP6 shingles selected from the 100 most frequent shingles in
GDBMedChem, ZINC, ChEMBL and DrugBank. * indicates the rooted atom in each shingle.
Shingles without rooted atom are ring shingles. (a) Examples of shingles unique to
GDBMedChem molecules among the set of 100 most frequent shingles in the four databases.
(b) Examples of shingles unique to ZINC among the set of 100 most frequent shingles in the
four databases. (c) Examples of shingles common to GDBMedChem, ZINC, ChEMBL and
DrugBank. For each shingle a SMILES string and the number of compounds from database
containing a given shingle are shown.
4
Table S1. Filters used for selecting GDBMedChem from GDB17, and percentages of
bioactive compounds from given target family that are eliminated by each filter.
Filter
Kin
ases
Pro
teas
es
Oth
er
En
zym
es
Mem
bra
ne
rece
pto
rs
Ion
ch
ann
els
Tra
nsp
ort
ers
Tra
nsc
rip
tio
n
fact
ors
Oth
ers
Functional groups:
no amidine 0.18 2.08 0.64 0.66 0.19 0.27 0.16 0.88
no imidate 0.12 2.03 0.34 0.21 0.33 0.09 0.36 0.13
no aldehyde 0.13 1.57 0.55 0.13 0.18 0.07 0.17 0.19
no aziridine 0.01 0.01 0.01 0.00 0.00 0.01 0.10 0.01
no aromatic ring > 6 atoms 0.22 0.00 0.03 0.04 0.06 0.08 0.03 0.21
no Br, I 5.25 3.37 4.04 4.51 3.71 4.74 4.02 4.38
≤ 2 ethers 4.91 4.37 4.90 4.05 3.13 7.40 5.76 2.81
no epoxide 0.04 0.23 0.12 0.03 0.00 0.04 0.07 0.12
no formate, acetate, or methyl ester 4.95 4.60 5.02 4.08 3.13 7.44 5.82 2.93
no Cl or F on heterocycle 5.71 8.46 3.78 4.37 6.78 2.58 2.18 6.01
≤ 1 acetylene 0.03 0.03 0.03 0.03 0.14 0.04 0.11 0.04
≤ 1 nitrile 0.36 0.56 0.29 0.15 0.27 0.44 0.42 0.28
≤ 1 sulfone 0.20 1.29 1.84 0.53 0.49 0.21 0.80 0.72
≤ 2 amides 1.04 9.07 0.86 2.64 0.53 0.15 0.92 6.82
≤ 2 acyclic esters 0.01 0.06 0.22 0.03 0.11 0.46 0.01 0.04
Combined FG filtersb) 13.65 25.91 16.99 16.26 10.53 25.22 20.73 18.90
Structural complexity:
≤ 18 avalon density 1.36 0.25 1.66 1.26 0.83 0.37 0.60 0.64
≤ 1 cyclic tetravalent node 0.62 3.26 3.42 2.88 2.60 4.24 9.11 2.27
≤ 4 stereocenters 0.32 3.25 3.57 3.32 1.09 14.01 6.86 3.85
≤ 3 bonds in fused ring systems 3.88 4.00 6.28 5.96 7.46 5.40 11.98 3.90
≤ 3 rings 78.42 47.75 51.01 66.12 59.14 56.16 55.02 56.12
Combined complexity filtersc) 79.27 50.40 53.71 67.50 60.25 62.24 57.32 58.24
Polarity:
≤ 0.7 hetero atoms to carbon ratio 0.89 2.64 5.43 1.25 2.80 0.68 1.15 2.40
All filters combinedd) 83.97 63.83 64.26 72.71 66.09 68.17 65.57 67.40 a) 350K bioactive compounds from our recently reported polypharmacology browser (ppb2) classified according
to their target family. These compounds were originally extracted from ChEMBL22 database. b) Percentage of
compounds eliminated from each database by applying all functional groups filters.c) Percentage of compounds
eliminated from each database by applying all structural complexity filters. c) Percentage of compounds
eliminated from each database by applying all filters
download fileview on ChemRxivGDBMedChemSI.pdf (674.53 KiB)