Medicinal Chemistry Database GDBMedChem

doi.org/10.26434/chemrxiv.7770809.v1

Medicinal Chemistry Database GDBMedChemMahendra Awale, Finton Sirockin, Nikolaus Stiefl, Jean-Louis Reymond

Submitted date: 26/02/2019 • Posted date: 27/02/2019Licence: CC BY-NC-ND 4.0Citation information: Awale, Mahendra; Sirockin, Finton; Stiefl, Nikolaus; Reymond, Jean-Louis (2019):Medicinal Chemistry Database GDBMedChem. ChemRxiv. Preprint.

The generated database GDB17 enumerates 166.4 billion possible molecules up to 17 atoms of C, N, O, Sand halogens following simple chemical stability and synthetic feasibility rules, however medicinal chemistrycriteria are not taken into account. Here we applied rules inspired by medicinal chemistry to excludeproblematic functional groups and complex molecules from GDB17, and sampled the resulting subset evenlyacross molecular size, stereochemistry and polarity to form GDBMedChem as a compact collection of 10million small molecules.This collection has reduced complexity and better synthetic accessibility than the entire GDB17 but retainshigher sp 3 - carbon fraction and natural product likeness scores compared to known drugs. GDBMedChemmolecules are more diverse and very different from known molecules in terms of substructures and representan unprecedented source of diversity for drug design. GDBMedChem is available for 3D-visualization,similarity searching and for download at http://gdb.unibe.ch.

File list (2)

download fileview on ChemRxivGDBMedChem-ChemRXiv1.pdf (1.65 MiB)

download fileview on ChemRxivGDBMedChemSI.pdf (674.53 KiB)

http://doi.org/10.26434/chemrxiv.7770809.v1

https://chemrxiv.org/authors/Mahendra_Awale/5584736

https://chemrxiv.org/authors/Jean-Louis_Reymond/5585375

https://chemrxiv.org/ndownloader/files/14465111

https://chemrxiv.org/articles/Medicinal_Chemistry_Database_GDBMedChem/7770809/1?file=14465111



1

Medicinal Chemistry Database GDBMedChem

Mahendra Awale,[a] Finton Sirockin,[b] Nikolaus Stiefl[b] and Jean-Louis Reymond*[a]

Abstract: The generated database GDB17 enumerates

166.4 billion possible molecules up to 17 atoms of C, N, O, S

and halogens following simple chemical stability and

synthetic feasibility rules, however medicinal chemistry

criteria are not taken into account. Here we applied rules

inspired by medicinal chemistry to exclude problematic

functional groups and complex molecules from GDB17, and

sampled the resulting subset evenly across molecular size,

stereochemistry and polarity to form GDBMedChem as a

compact collection of 10 million small molecules.

This collection has reduced complexity and better synthetic

accessibility than the entire GDB17 but retains higher sp3-

carbon fraction and natural product likeness scores

compared to known drugs. GDBMedChem molecules are

more diverse and very different from known molecules in

terms of substructures and represent an unprecedented

source of diversity for drug design. GDBMedChem is

available for 3D-visualization, similarity searching and for

download at http://gdb.unibe.ch.

Keywords: chemical space, drug design, small molecules, medicinal chemistry, virtual screening

1 Introduction

All attempts to estimate the total number of possible organic

molecules or to specifically enumerate them have shown that

synthetic chemistry including medicinal chemistry to date has

barely scratched the surface of the chemical universe, even

when including billions of compounds that can be readily

synthesized by combining known building blocks with known

reactions.[1] In one such attempt we enumerated all possible

molecules up to 11,[2] 13[3] and 17 atoms[4] starting from

mathematical graphs[5] by selecting the corresponding

hydrocarbons for chemically acceptable ring strain and

topologies and introducing unsaturations and heteroatoms

following simple chemical stability and synthetic feasibility

rules (Figure 1). More recently we similarly enumerated all

possible ring systems up to four rings and 30 atoms.[6] The

corresponding generated databases (GDBs) contain almost

exclusively unknown molecules and therefore represent a vast

reservoir of possible innovation.

GDB enumeration focuses on chemistry rules and

does not take specific medicinal chemistry criteria into

account, such as the type and number of functional

groups and the overall structural complexity that would be

compatible with a drug-type molecule.[7] Here we aimed

to define a subset of GDB17 for medicinal chemistry,

named GDBMedChem, by filtering GDB17 using such

criteria. We followed a similar approach to that used for

the fragment database FDB17, [8] which was recently

reported as a fragment-like subset of GDB17 following

fragment-likeness criteria.[9] We present the database

assembly procedure and discuss the resulting

GDBMedChem database in comparison to our previously

reported fragment database FDB17, and to known drugs,

bioactive molecules and natural products up to 17 atoms

from DrugBank, [10] ChEMBL,[11] and the natural products

directory (Table 1).[12] We show that GDBMedChem

represents a vast and diverse source of new molecular

structures for drug design.

Figure 1. GDBMedChem generation workflow. Steps 4) and 5)

are discussed in this publication.

Graphs

114,304,569,097

Hydrocarbons

5,422,153

1) Ring strain

Skeletons

1,330,958,530

GDB-17: Molecules

166,443,860,262

2) Unsaturations

3) Heteroatoms

17.8G set: Molecules

17,804,900,000

GDBMedChem

10,007,380

4) Functional groups

and complexity filters

5) Even sampling

[a] Department of Chemistry and Biochemistry, University

of Bern

Freiestrasse 3, 3012 Bern, Switzerland

*e-mail: [email protected]

[b] Novartis Institutes for Biomedical Research, Basel,

Switzerland

2

Table 1. Databases discussed in this publication.

Database Size Description

GDB17 166.4 G Virtually enumerated molecules of up to 17 atoms of C, N, O, S, and halogens

17.8G set 17.8 G GDB17 molecules passing filters in Table 2

GDBMedChem 10 M GDB17 molecules passing filters in Table 2, evenly sampled across size, heteroatoms and stereocenters

4.6G set 4.6 G Fragment like subset of GDB17

FDB17 10 M Fragment like subset of GDB17, evenly sampled across size, heteroatoms and stereocenters

ChEMBL17 105,423 Compounds with HAC ≤ 17 extracted from ChEMBL 22

DrugBank17 2,284 Approved and experimental drugs with HAC ≤ 17 extracted from DrugBank

UNPD17 20,302 Natural products with HAC ≤ 17 extracted from Universal natural product database (UNPD)

ChEMBL 1.4 M Compounds with HAC ≤ 50 extracted from ChEMBL 22

DrugBank 8,299 Approved and experimental drugs with HAC ≤ 50 extracted from DrugBank

ZINC 15 M Commercially available compounds from ZINC 12 database

2 Results and Discussion

2.1 Selecting GDBMedChem from GDB17

To identify a subset of GDB17 suitable for medicinal

chemistry we applied structural filters considering

problematic functional groups and complexity criteria

(Table 2). Overall, these filters reduced GDB17 by 89 %,

leaving 17.8 billion molecules, referred to as 17.8G set, of

which only 480 million also occurred in our previously

reported fragment-like set of 4.6 billion molecules (4.6G

set),[8] illustrating the very different approach chosen.

While the calculation was performed by iterative steps, we

discuss here the effect of each filter on the entire GDB17

in comparison to other reference databases.

The first set of filters addresses functional groups

(FGs). We remove FGs which are very abundant in

GDB17 due to the combinatorial enumeration but are not

desirable in drugs due to poor chemical or metabolic

stability (amidines, imidates, and terminal esters) or

undesirable reactivity (aldehydes, aziridines, epoxides,

note that hydrolytically reactive groups such as

anhydrides and acyl chlorides are not present in GDB17).

We also eliminate or cap the number of potentially

problematic FGs for drug design (no aromatic rings larger

than 6 atoms, no Br or I, no halogen on heterocycle,

maximum one nitrile, acetylene or sulfone, maximum two

acyclic esters, amides, or ethers). Overall FG filters

eliminate approximately half of GDB17 but reduce

bioactive molecules from ChEMBL17 and DrugBank17 by

less than 15% and natural products from UNPD17 by

21 % (Table 2, upper part).

Table 2. Filters used for selecting the 17.8G set from GDB17, and percentages of databases that are eliminated by each filter .

Filter GDB17a) 4.6G seta) FDB17 ChEMBL17 DrugBank17 UNPD17

FGs:

no amidine 23.55 17.27 11.76 1.80 1.10 0.23

no imidate 12.99 0 0 0.65 1.84 0.12

no aldehyde 10.14 0 0 1.22 2.32 5.65

no aziridine 5.48 0 0 0.22 0.13 0.01

no aromatic ring > 6 atoms 3.52 0 0 0.15 0 0.46

no Br, I 2.96 0 0 6.92 3.42 3.65

≤ 2 ethers 2.70 1.84 0.86 0.65 0.70 2.51

no epoxide 2.28 0 0 0.39 0.44 1.97

no terminal esterb) 1.75 0.73 0.48 2.64 1.18 6.87

no Cl or F on heterocycle 1.73 0 0 3.50 1.62 0.33

≤ 1 acetylene 1.37 0 0 0.12 0 2.73

≤ 1 nitrile 0.96 0 0 0.38 0.09 0.07

≤ 1 sulfone 0.25 0 0 0.20 0.31 0.03

≤ 2 amides 0.01 0.01 0.0067 0.075 0.04 0.0099

≤ 2 acyclic esters 0.000010 0 0 0.0038 0 0.03

All FG filtersc) 52.73 19.7 13.02 14.6 11.21 21.77

Structural complexity:

≤ 18 avalon density 34.81 47.05 38.12 9.21 6.26 3.59

≤ 1 cyclic tetravalent node 24.18 20.86 21.57 1.51 0.96 8.03

≤ 4 stereocenters 22.48 0 0 0.72 3.11 5.46

≤ 3 bonds in fused ring systems 17.42 16.47 15.91 1.83 0.96 2.93

≤ 3 rings 6.38 0 0 0.82 0.35 0.92

All complexity filtersd) 62.07 64.83 59.11 12.43 10.64 15.6

Polarity:

≤ 0.7 hetero atoms to carbon ratio 6.15 0.10 4.78 15.60 39.82 10.03

All filters combinede) 86.27 73.68 65.98 35.77 50.28 41.07

a) In case of GDB17 and the 4.6G set the calculation was performed on a 50M random subset of molecules. b) Terminal esters are

defined as methyl esters, acetates and formates. c) Percentage of compounds eliminated from each database by applying all

functional groups filters.d) Percentage of compounds eliminated from each database by applying all structural complexity filters. e)

Percentage of compounds eliminated from each database by applying all filters.

3

Secondly, we limit structural complexity by capping the

number of rings, stereocenters, cyclic quaternary centers

and bonds in fused rings. We also apply an overall

complexity filter calculated from the avalon fingerprint as

a limit on the fingerprint density value, defined here as the

number of on-bits in the avalon fingerprint scaled to the

heavy atom count.[13] The cut-off values for these

parameters were selected by analyzing histograms of

these values in GDB17 in comparison to known drugs,

which shows that GDB17 has generally higher values

compared to drug molecules (Figure 2). Accordingly,

these structural complexity reduction criteria eliminate

many GDB17 molecules (62 %) corresponding to complex

polycyclic scaffolds and/or difficult functional group

combinations but have a much smaller reduction effect on

the reference databases ChEMBL17 (12 %), DrugBank17

(11 %) and UNPD17 (16 %) (Table 2, lower part).

Finally, the heteroatom to carbon ratio is capped at 0.7,

which removes 6 % of GDB17. Although this criterion

eliminates a very significant fraction of ChEMBL17 (36 %),

DrugBank17 (50 %) and UNPD17 (41 %), molecules taken

out by this filter have mostly negative clogP values and

are not desirable from drug design point of view (Figure

3). Note that many of the filters applied on GDB17 are size

dependent, however they do not strongly affect target

assigned molecules from ChEMBL up to 50 non-hydrogen

atoms,[14] except for the ring count filters (≤ 3 rings), which

eliminates 50-80% of the compounds depending on the

target class (Table S1).

Despite of the functional group and complexity filters

applied, the composition of the 17.8G set was heavily

skewed towards the largest and most complex molecules.

To obtain a more realistic selection for medicinal

chemistry, we sampled this subset across molecular size,

stereochemistry and polarity by randomly selecting a

comparable number of molecules from each of the 243

triplet value bins (heavy atom count, number of

stereocenters, number of heteroatoms) (Figure S1). This

procedure resulted in a database of 10 million 2D-

structures as SMILES, defined here as GDBMedChem.

Figure 3. Clog P histogram for the compounds which were

eliminated due to hetero atoms to carbon ratio filter.

Figure 2. Histogram of structural complexity and polarity parameters used during GDBMedChem generation, for GDBMedChem (red),

its parent 17.8G set (magenta), the entire GDB17 (black), the fragment database FDB17 (blue) and its parent 4.6G set (cyan),

ChEMBL17 (green), DrugBank17 (grey) and UNPD17 (orange).

4

Figure 4. Molecular property histograms for GDBMedChem (red), its parent 17.8G set (magenta), the entire GDB17 (black), the

fragment database FDB17 (blue) and its parent 4.6G set (cyan), ChEMBL17 (green), DrugBank17 (grey) and UNPD17 (orange). In

plot (k) each molecule is assigned to a single category as a function of it’s ring types in priority order heteroaromatic > aromat ic >

heterocyclic > carbocyclic > acyclic.

2.2 Property analysis

To gain an insight into the composition of GDBMedChem,

we analyzed the distribution of molecules across various

molecular properties in comparison to the 17.8G set from

which it was sampled, the entire GDB17, the related

fragment database FDB17 and its parent 4.6G set, and to

ChEMBL17, DrugBank17, and UNPD17 (Figure 4). Both

molecular size histograms (HAC and MW) show that the

even sampling procedure used to compose

GDBMedChem and FDB17 from their respective larger

17.8G sets and 4.6G sets corrects their highly skewed

distribution towards the largest molecules, resulting in a

distribution closer to known molecules in ChEMBL17,

DrugBank17 and UNPD17 (Figure 4a/b).

5

The rotatable bond count (RBC) profiles show that

GDBMedChem molecules have similar structural

flexibility compared to know molecules, in contrast to the

fragment database FDB17 and its parent 4.6G set which

stand out by the low number or rotatable bonds due to the

fragment-likeness rule RBC ≤ 3 (Figure 4c). The HBD,

HBA, clogP and O+N count profiles in GDBMedChem are

also similar to those of known molecules. This effect

results from the even sampling procedure since its parent

17.8 G set is clearly different and matches that of GDB17

(Figure 4d, e, f, g).

In terms of synthetic accessibility score,

GDBMedChem and its parent 17.8G set shows slightly

lower values compared to GDB17 and is quite similar to

FDB17, reflecting the role of structural complexity filters

(Figure 4i).[15] However GDBMedChem and FDB17

molecules remain significantly less synthetically

accessible compared to known molecules according to

this score.

The natural product likeness of all GDB databases is

significantly higher than those of DrugBank17 and

ChEMBL17, although it is still lower than for the natural

products themselves in UNPD17 (Figure 4j). [16] The

higher natural product likeness of GDB molecules

compared to drugs and ChEMBL molecules probably

reflects their higher fraction of sp3 carbon atoms, which is

higher in natural products and GDB molecules compared

to drugs and ChEMBL molecules (Figure 4h).

Finally, all GDB databases show a very low

percentage of aromatic molecules but a much higher

percentage of heterocyclic molecules compared to known

molecules. This results from the combinatorial

enumeration used to generate GDBs, which produces

many more combinations when heteroatoms are present

in rings (Figure 4k).

2.3 Substructure analysis

To assess if GDBMedChem molecules are

significantly different from known molecules, we

compared substructures in GDBMedChem molecules to

those from molecules in the entire DrugBank, ChEMBL,

and ZINC database independent of molecular size. To

perform this analysis, we collected the molecular shingles

used in the calculation of MinHash fingerprint MHFP6, an

extended connectivity fingerprint which outperforms

ECFP4 in benchmarking studies. [17] In MHFP6 molecular

shingles comprise extended connectivity substructures

around each atom up to a diameter of 6 bonds as well as

all ring structures, and are written as SMILES with the

rooted atom appearing as the first atom in the SMILES

string.

GDBMedChem molecules contain on average 38

shingles per molecule, which is approximately two-third of

the number of shingles per molecule found in known

molecules from ChEMBL, DrugBank and ZINC, reflecting

the smaller size of GDBMedChem molecules (Table 3).

The total number of unique shingles in each database

grows in function of the number of molecules surveyed in

the database, however this number grows faster and more

steadily in GDBMedChem compared to known molecules

(Figure 5a). The number of occurrences of shingles in

each of the four databases follows a power law

distribution, with a small number of shingles appearing in

almost all molecules, and a large number of shingles

appearing in only few molecules (Figure 5b).

A Venn diagram analysis shows that almost all 17.3

million unique shingles in GDBMedChem (97 %) are

unique to this database, illustrating that GDBMedChem

molecules are very different from known molecules at the

level of their substructures (Figure 5c). By comparison,

the three databases of known molecules contain a much

smaller fraction of unique shingles, mostly because they

contain a much smaller number of unique shingles. This

is particularly striking with ZINC, which contains 15 million

molecules but only 1.5 million shingles. Interestingly, the

difference between GDBMedChem and known molecules

also holds true when focusing on the 100 most frequent

shingles in each database (Figure 5d). Among these

frequent shingles, oxygen containing saturated or singly

unsaturated shingles stand out in GDMMedChem, in

contrast to aromatic and nitrogen heterocycles in ZINC

(Figure S2). The 29 frequent shingles shared by all four

databases correspond to simple aliphatic substructures,

alcohol, carbonyl, and the benzene ring.

Table 3. MHFP6 shingle analysis for GDBMedChem and other databases.

Database no. of molecules av. ± SD shingles per moleculea)

no. of shingles in databaseb)

no. of shingles unique to the databasec)

GDBMedChem 10,006,044 38 ± 6 17,317,417 16,798,833 (97 %)

DrugBank 8,299 57 ± 26 82,193 9,995 (12 %)

ChEMBL 1,446,502 64 ± 16 1,593,674 911,181 (57 %)

ZINC 15,149,974 59 ± 12 1,477,745 780,276 (53 %)

a) Average number of MHFP6 shingles per molecule, considering unique shingles in each molecule separately . MHFP6 shingles are

substructures around each atom with a diameter of up to six bonds. see Ref. [17] for details. b) Total number of unique shingles

across entire database.c) Shingles that do not occur in any of the other three databases.

6

Figure 5. MHFP6 shingle analysis. (a) cumulative number of unique shingles and (b) frequency distribution of shingles for

GDBMedChem, DrugBank, ChEMBL and ZINC. To compute the cumulative number of unique shingles c ompounds were randomly

ordered in a database (on an x-axis). (c-d) Venn diagrams showing no. of MHFP6 shingles unique and shared among the different

databases. In (d) only the top 100 most frequent shingles from each database were considered.

2.4 Interactive visualization and search tools

To enable a closer insight into GDBMedChem we

represented the 10 million molecules in a principal

component analysis (PCA) 3D-projection of the 42D-

Molecular Quantum Number (MQN)[18] chemical space

using Faerun,[19] which is available at http://gdb.unibe.ch.

We also generated MQN 3D-maps for the fragment

database FDB17, and for the 128,011 known molecules in

ChEMBL17, DrugBank17 and UNPD17. PCA is a classical

approach for visualizing chemical space[20] which is much

faster than other dimensionality reduction methods[21] and

therefore well suited for large datasets such as GDB.[22]

MQN is a composition fingerprint counting different

types of atoms, bonds, polar groups and topologies. Since

many of these descriptors are correlated, MQN datasets

project well into 3D by PCA.[23] The resulting MQN maps

typically order compounds by size, number of rings, and

polarity. This is illustrated here for color-coding by HAC and

ring atom count (Figure 6). These maps illustrate that

GDBMedChem covers a similar chemical space as FDB17

and known molecules, however the coverage by

GDBMedChem and FDB17 is much denser than for known

molecules. Note that GDBMedChem covers a broader

range of structures than FDB17 by allowing a higher

number or rotatable bonds, which leads to more acyclic

molecules. The interactive versions of the maps in Faerun

allow one to efficiently browse the contents of

GDBMedChem and related databases and gain an

overview of their contents.

To further facilitate the exploitation of GDBMedChem,

we have generated a multi-fingerprint browser accessible at

http://gdb.unibe.ch (Figure 7).[24] In this browser we allow

nearest neighbor (NN) searches of any query molecule in

GDBMedChem, or for comparison in the combined set

ChEMBL17 + DrugBank17 + UNPD17. The search is

implemented using Annoy (Approximate Nearest Neighbors

Oh Yeah, https://github.com/spotify/annoy), which provides

very fast search results even for relatively large

databases.[25] Similarity searching is possible by shortest

city-block distance according to the MQN fingerprint,[18] by

highest similarity according to the extended connectivity

fingerprint ECFP4,[26] or by a combined search retrieving

NNs in MQN, followed by sorting these NNs by MHFP6

similarity to the query, which orders results according to a

detailed substructure logic.[17]

Similarity searches in GDBMedChem with this browser

readily return high-similarity analogs for any query molecule.

Typical results of such searches are exemplified here for

ten drugs of 17 atoms or less, for which we identified two

nearest neighbors from GDBMedChem that are not

currently documented in Scifinder (Figure 8).

3 Conclusion

To address the overwhelming complexity of molecules in

GDB17 we applied a set of medicinal chemistry criteria and

complexity filters to define GDBMedChem, a 10 million

subset of drug-like molecules covering a broad range of

molecular size, polarity and sterochemistry. The vast

majority of molecules in GDBMedChem are yet unknown

and represent a valuable resource for medicinal chemistry.

The database is available for 3D-visualization as well as for

similarity searching and for download at http://gdb.unibe.ch.

ChEMBL

ZINCGDBMedChem

DrugBank ChEMBL

ZINCGDBMedChem

DrugBank

a) b)

c) d)

http://gdb.unibe.ch/


http://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximate_nearest_neighbor

https://github.com/spotify/annoy


7

Figure 6. MQN PCA-maps of GDBMedChem, FDB17 and merged database (DrugBank17 + ChEMBL17 + UNPD17), color coded

according to the count of heavy atoms, and number of atoms in rings. Color changes from blue to cyan to green to yellow to red to

magenta with increasing count of a property. PCA variance covered: GDBMedChem (PC1: 42%, PC2: 18%, PC3: 11%), FDB17 (PC1:

32%, PC2: 22%, PC3: 10%), DrugBank17 + ChEMBL17 + UNPD17 (PC1: 41%, PC2: 17%, PC3: 15%). The images were generated

from fearun and are accessible at http://faerun.gdb.tools.

Figure 7. Multifingerprint browser interface for GDBMedChem and ChEMBL17 + DrugBank17 + UNPD17 database. (a) Entry page

of browser showing nicotine as query compound. (b) Search result window displaying MQN nearest neighbors of nicotine. The

multifingerprint browser is publicly accessible at http://gdb-medchem-simsearch.gdb.tools.

GDBMedChem FDB17 DrugBank17 + UNPD17 + ChEMBL17

He

av

y a

tom

co

un

tN

um

be

r o

f a

tom

s in

rin

gs

≤7

10

15

17

0

3814≥16

≤7

10

15

17

0511

≥16

≤6

11

15

17

0612≥16

a)

b)

http://faerun.gdb.tools/

http://gdb-medchem-simsearch.gdb.tools/

8

Figure 8. Ten representative drugs of 17 or less heavy atoms (blue) and in each case two NNs (black) retrieved from GDBMedChem

using the Multifingerprint browser with the combined MQN-MHFP6 method (see method for details). Numbers indicate MQN city block

distance / MHFP6 Tanimoto coefficient / rank in NN list. The NNs shown are currently not documented in Scifinder.

4 Methods

4.1 Assembly of GDBMedChem

The functional groups and structural complexity filters

mentioned in Table 2 were applied to GDB17 (stored as

splitted smiles files) to obtain the GDBMedChem database.

All calculations were performed using RDKit (version

2017_09_03) and PySpark (version 2.3.2) parallel

computing framework on a 98 nodes cluster with 252 GB of

RAM. Sixteen out of 21 filters discussed in Table 2 were

implemented as SMARTS queries, and the remaining five

filters (stereocenters, ring count, avalon density,

heteroatoms to carbon ratio and largest aromatic ring size)

were implemented using other functions provided in RDkit.

It should be notated that filters were applied in a progressive

manner (simple and obvious filters first) and not in the order

of Table 2. Molecules violating any of the filtering criteria

were removed from the GDB17 database. This resulted in

a subset of 17,804,900,000 molecules (17.8G set).

The molecules from the 17.8G set were binned into

425 triplet bins generated from all the possible

combinations of the values of three descriptors,

namely: heavy atom count (1 to 17), heteroatoms (≤1,

2, 3, 4, ≥5) and stereocenters (0, 1, 2, 3, 4). Of these

425 triplet bins, 181 bins were not occupied by any

molecule, thus leaving 244 bins for further

consideration. The binned 17.8G set was then stored in

the form of a PySpark DataFrame (data schema:

[SMILES: string, Triplet bin: string]), wherein each entry

contains two fields, namely the SMILES of a compound

and its Triplet bin. Next, 10 million molecules were

sampled from the DataFrame using the PySpark

“sampleBy” function to form GDBMedChem. The

stratified sampling without replacement was used.

The PySpark “sampleBy” function generates the

stratified samples from the DataFrame given two input

parameters: i) the column name which can be used to

define the different stratums and ii) the Python

dictionary object containing the names of different

stratums as keys and the fraction of entries to sample

from each stratum as the corresponding value. The

stratums are the bins used to group the entries in the

DataFrame. In our case the column “Triplet bin” was

used to define the different stratums (244 stratums

corresponding to 244 unique triplet bins in the

DataFrame). The fraction of entries to sample from

each stratum was computed as follows: The Python

dictionary variables n_selected = [stratum1: 0,

stratum2: 0 …, stratum244: 0] and n_total = [stratum1:

x, stratum2: x, …, stratum244: x] were initiated. In both

variables, keys indicate the names of different stratums

9

(triplet bins). The values in variable n_selected indicate

the number of compounds to sample from each of the

244 stratums. The values in variable n_total indicate

the total number of compounds present in each of the

244 stratums. Values x were computed beforehand.

Thereafter, items in the n_selected variable were

iterated several times (until the sum of values in

n_selected variables ═ ═10M), each time incrementing

the value of each stratum by 1, given the condition that

the n_selected value for a given stratum is less than the

n_total value for a given stratum. Finally, the fraction of

compounds to sample from each stratum was computed

by dividing the n_selected value for a given stratum by

the corresponding value from n_total variable.

4.2 Other databases

The random subset containing 50M molecules from GDB17

and the FDB17database were downloaded from

http://gdb.unibe.ch website. For the 4.6 G set from which

FDB17 is sampled, we used an in-house copy of the

database. Random subsets (containing 50M molecules) of

the 17.8G and 4.6G sets were generated using the PySpark

sample function. ChEMBL version 22 was downloaded from

https://www.ebi.ac.uk/chembl/, DrugBank version 5.011

from https://www.drugbank.ca/ and UNPD from

http://oolonek.github.io/ISDB/. ChEMBL17, DrugBank17

and UNPD17 databases were generated by removing the

compounds containing more than 17 heavy atoms. The

merged database (DrugBank17_UNPD17_ChEMBL17)

was formed by merging all molecules from DrugBank17,

ChEMBL17 and UNPD17. For the ppb2 set we used an in-

house copy of the database previously prepared for the

polypharmacology browser ppb2.[14] In the ppb2 set, each

compound is annotated with it’s SMILES, ChEMBL

compound ID and ChEMBL target IDs. Compounds from

the ppb2 set were assigned to target families based on the

classification provided in ChEMBL22. Within each target

family only unique compounds were considered for further

computation.

4.3 Processing of molecules and calculation of MQN

and molecular properties

For each molecule counter ions were removed (if present)

and the largest fragment was retained (if a molecule

contains multiple fragments). Thereafter, all molecules

were processed in non-chiral SMILES format, checked for

valence errors, and protonated at pH 7.4 using an in-house

written Java-program utilizing the JChem Chemistry library

from ChemAxon Pvt. Ltd. Next, based on unique SMILES

notation, duplicate molecules were removed in the context

of each database. Molecular properties were calculated for

each molecule using RDkit, except for the number of rings,

hydrogen bond acceptor and donor count which were

calculated using JChem. MQN fingerprints were calculated

using an in-house written Java Program utilizing the JChem

Chemistry library.

4.4 MHFP6 shingle analysis

MHFP6 (diameter of 6) shingles were calculated for each

molecule in each of the GDB17(10M), GDBMedChem,

FDB17, ChEMBL17, DrugBank17, UNPD17, ChEMBL,

DrugBank and ZINC databases using the MHFP python

package (GitHub repository: https://github.com/reymond-

group/mhfp, pip: https://pypi.org/project/mhfp/). More

specifically the function “shingling_from_smiles” from

MHFPEncoder class was used to generate the shingles.

Later, for each database, the number of total unique

MHFP6 shingles were calculated by simple string

comparison of all MHFP6 shingles for a given database.

Additionally, for each database, the number of shingles per

compound was calculated by diving the number of unique

shingles by the total number of compounds in a given

database.

4.5 Web-application for visualization

The web application for interactive visualization of property

color coded 3D-spaces of GDBMedChem, FDB17 and

DrugBank17 _UNPD17 _ChEMBL17 database was built

using FUn (http://doc.gdb.tools/fun/), an inhouse developed

framework for visualization of chemical spaces on the web.

The three main components of FUn framework are the data

preprocessing tool chain, the data service (Underdark Go)

and Faerun a web-application for interactive visualization.

Initially, we stored each database in a plain text file format.

In this file each line contains 4 fields separated by spaces:

i) SMILES of a compound, ii) name or id of a compound, iii)

42 MQN values (separated by colon) for a compound and

iv) molecular properties (separated by colon) to use for

color coding. Next, the plain text file for each database was

pre-processed using the data preprocessing toolchain

(https://github.com/reymond-group/pca), which projects the

42-dimentional MQN chemical space into 3 dimensions

using Principle Component Analysis (PCA) and generates

all the necessary files for visualization. Thereafter,

underdark dataservice and Fearun visualization containers

were run using docker.

4.6 Web-application for similarity searching

The web application for similarity searching in

GDBMedChem and the DrugBank17_UNPD17_

ChEMBL17 database is implemented using Html, Bootstrap,

JavaScript (frontend) and Flask python web framework. To

enable fast similarity searching, we implemented

approximate nearest neighbor searching using Annoy

(Approximate Nearest Neighbors Oh Yeah,

https://github.com/spotify/annoy). The option is provided to

perform a similarity search using 42-dimensional MQN or

256-dimensional Extended connectivity fingerprint (ECfp4)

or a combination of MQN and 128 dimensional MinHash

(MHFP6) fingerprint. In case of MQN, similarity search

application retrieves the nearest neighbors of a query

molecules using MQN based Annoy tree and rank them as

per increasing city block distance (manhattan distance).

While, in case of MQN-MHFP6, similarity search application

first retrieves the nearest neighbors of a query molecule

using MQN based Annoy tree and then resort nearest

neighbors based on their Jaccard distances with respect to

a query molecule in MHFP6 fingerprint space. The

calculation of MQN of a molecule is implemented using an

in-house written Java program, while calculation of MHFP6

fingerprint is implemented using GitHub Python repository

https://github.com/reymond-group/mhfp.


https://www.drugbank.ca/

http://oolonek.github.io/ISDB/

https://github.com/reymond-group/mhfp


https://pypi.org/project/mhfp/

http://doc.gdb.tools/fun/

https://github.com/reymond-group/pca

http://en.wikipedia.org/wiki/Nearest_neighbor_search#Approximate_nearest_neighbor

https://github.com/spotify/annoy


10

Acknowledgements

This work was supported financially by a grant of NIBR to

MA. We thank ChemAxon Pvt. Ltd. for providing free

academic and web licenses for their products

References

[1] a) R. S. Bohacek, C. McMartin, W. C. Guida, Med. Res. Rev. 1996, 16, 3-50; b) M. Hartenfeller, H. Zettl, M. Walter, M. Rupp, F. Reisen, E. Proschak, S. Weggen, H. Stark, G. Schneider, PLoS Comput. Biol. 2012, 8, e1002380; c) J. L. Reymond, L. Ruddigkeit, L. C. Blum, R. Van Deursen, WIREs comput. Mol. Sci. 2012, doi: 10.1002/wcms.1104; d) F. Chevillard, P. Kolb, J. Chem. Inf. Model. 2015, 55, 1824-1835; e) M. Awale, R. Visini, D. Probst, J. Arus-Pous, J. L. Reymond, Chimia 2017, 71, 661-666; f) J. Boström, D. G. Brown, R. J. Young, G. M. Keserü, Nat. Rev. Drug Discovery 2018, 17, 709-727; g) N. van Hilten, F. Chevillard, P. Kolb, J. Chem. Inf. Model. 2019, doi: 10.1021/acs.jcim.1028b00737.

[2] T. Fink, J. L. Reymond, J. Chem. Inf. Model. 2007, 47, 342-353.

[3] L. C. Blum, J.-L. Reymond, J. Am. Chem. Soc. 2009, 131, 8732-8733.

[4] L. Ruddigkeit, R. van Deursen, L. C. Blum, J. L. Reymond, J. Chem. Inf. Model. 2012, 52, 2864-2875.

[5] B. D. McKay, Congressus Numerantium 1981, 30, 45-87. [6] R. Visini, J. Arus-Pous, M. Awale, J. L. Reymond, J. Chem.

Inf. Model. 2017, 57, 2707-2718. [7] a) M. Hann, B. Hudson, X. Lewell, R. Lifely, L. Miller, N.

Ramsden, J. Chem. Inf. Comp. Sci. 1999, 39, 897-902; b) I. Muegge, S. L. Heald, D. Brittelli, J. Med. Chem. 2001, 44, 1841-1846; c) D. F. Veber, S. R. Johnson, H.-Y. Cheng, B. R. Smith, K. W. Ward, K. D. Kopple, J. Med. Chem. 2002, 45, 2615-2623; d) W. P. Walters, M. A. Murcko, Advanced Drug Deliv. Rev. 2002, 54, 255-271; e) R. F. Bruns, I. A. Watson, J. Med. Chem. 2012, 55, 9763-9772; f) S. J. Capuzzi, E. N. Muratov, A. Tropsha, J. Chem. Inf. Model. 2017, 57, 417-427.

[8] R. Visini, M. Awale, J.-L. Reymond, J. Chem. Inf. Model. 2017, 57, 700-709.

[9] M. Congreve, R. Carr, C. Murray, H. Jhoti, Drug Discovery Today 2003, 8, 876-877.

[10] V. Law, C. Knox, Y. Djoumbou, T. Jewison, A. C. Guo, Y. Liu, A. Maciejewski, D. Arndt, M. Wilson, V. Neveu, A. Tang, G. Gabriel, C. Ly, S. Adamjee, Z. T. Dame, B. Han, Y. Zhou, D. S. Wishart, Nucleic Acids Res. 2014, 42, D1091-D1097.

[11] A. P. Bento, A. Gaulton, A. Hersey, L. J. Bellis, J. Chambers, M. Davies, F. A. Krüger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos, J. P. Overington, Nucleic Acids Res. 2014, 42, D1083-D1090.

[12] P. Banerjee, J. Erehman, B.-O. Gohlke, T. Wilhelm, R. Preissner, M. Dunkel, Nucleic Acids Res. 2015, 43, D935-D939.

[13] P. Gedeck, B. Rohde, C. Bartels, J. Chem. Inf. Model. 2006, 46, 1924-1936.

[14] M. Awale, J. L. Reymond, J. Chem. Inf. Model. 2019, 59, 10-17.

[15] P. Ertl, A. Schuffenhauer, J. Cheminform. 2009, 1, 8. [16] a) P. Ertl, S. Roggo, A. Schuffenhauer, J. Chem. Inf. Model.

2008, 48, 68-74; b) K. V. Jayaseelan, P. Moreno, A. Truszkowski, P. Ertl, C. Steinbeck, BMC Bioinformatics 2012, 13, 106.

[17] D. Probst, J. L. Reymond, J. Cheminform. 2018, 10, 66. [18] K. T. Nguyen, L. C. Blum, R. van Deursen, J.-L. Reymond,

ChemMedChem 2009, 4, 1803-1805. [19] D. Probst, J. L. Reymond, Bioinformatics 2018, 34, 1433-

1435. [20] a) T. I. Oprea, J. Gottfries, J. Comb. Chem. 2001, 3, 157-

166; b) J. Rosen, J. Gottfries, S. Muresan, A. Backlund, T. I. Oprea, J. Med. Chem. 2009, 52, 1953-1962.

[21] a) A. M. Wassermann, M. Wawer, J. Bajorath, J. Med. Chem. 2010, 53, 8209-8923; b) H. A. Gaspar, I. I. Baskin, G. Marcou, D. Horvath, A. Varnek, J. Chem. Inf. Model. 2014,

55, 84-94; c) T. Sander, J. Freyss, M. von Korff, C. Rufener, J. Chem. Inf. Model. 2015, 55, 460-473; d) A. Lin, D. Horvath, V. Afonina, G. Marcou, J.-L. Reymond, A. Varnek, ChemMedChem 2018, 13, 540-554.

[22] a) L. C. Blum, R. van Deursen, J. L. Reymond, J. Comput.-Aided Mol. Des. 2011, 25, 637-647; b) L. Ruddigkeit, L. C. Blum, J.-L. Reymond, J. Chem. Inf. Model. 2013, 53, 56-65.

[23] R. van Deursen, L. C. Blum, J. L. Reymond, J. Chem. Inf. Model. 2010, 50, 1924-1934.

[24] M. Awale, J. L. Reymond, Nucleic Acids Res. 2014, 42, W234-239.

[25] A. Capecchi, M. Awale, D. Probst, J. L. Reymond, ChemRXiv 2019, doi: 10.26434/chemrxiv.7650071.v7650072.

[26] D. Rogers, M. Hahn, J. Chem. Inf. Model. 2010, 50, 742-754.

download fileview on ChemRxivGDBMedChem-ChemRXiv1.pdf (1.65 MiB)



1

Supporting information for:

Medicinal Chemistry Database GDBMedChem

Mahendra Awale,[a] Finton Sirockin,[b] Nikolaus Stiefl[b] and Jean-Louis Reymond*[a]

[a] Department of Chemistry and Biochemistry, University of Bern Freiestrasse 3, 3012 Bern, Switzerland

*e-mail: [email protected]

[b] Novartis Institutes for Biomedical Research, Basel, Switzerland

Table of content

Figure S1.. ................................................................................................................................................ 2

Figure S2. ................................................................................................................................................. 3

Table S1. .................................................................................................................................................. 4

2

Figure S1. Frequency histograms for the 17.8G set (magenta line) and the 10 million

GDBMedChem database (red line) across (A) molecular size (1-17), (B) stereocenters (0, 1,

2, 3, 4) and (C) heteroatoms (N+O+S: ≤1, 2, 3, 4, ≥5). In D the frequency histogram is shown

by individual triplet value bins (HAC, heteroatoms, stereocenters) sorted by decreasing

occupancy in the 17.8G set (magenta line) and in GDBMedChem (red line).

3

Figure S2. Structures of MHFP6 shingles selected from the 100 most frequent shingles in

GDBMedChem, ZINC, ChEMBL and DrugBank. * indicates the rooted atom in each shingle.

Shingles without rooted atom are ring shingles. (a) Examples of shingles unique to

GDBMedChem molecules among the set of 100 most frequent shingles in the four databases.

(b) Examples of shingles unique to ZINC among the set of 100 most frequent shingles in the

four databases. (c) Examples of shingles common to GDBMedChem, ZINC, ChEMBL and

DrugBank. For each shingle a SMILES string and the number of compounds from database

containing a given shingle are shown.

4

Table S1. Filters used for selecting GDBMedChem from GDB17, and percentages of

bioactive compounds from given target family that are eliminated by each filter.

Filter

Kin

ases

Pro

teas

es

Oth

er

En

zym

es

Mem

bra

ne

rece

pto

rs

Ion

ch

ann

els

Tra

nsp

ort

ers

Tra

nsc

rip

tio

n

fact

ors

Oth

ers

Functional groups:

no amidine 0.18 2.08 0.64 0.66 0.19 0.27 0.16 0.88

no imidate 0.12 2.03 0.34 0.21 0.33 0.09 0.36 0.13

no aldehyde 0.13 1.57 0.55 0.13 0.18 0.07 0.17 0.19

no aziridine 0.01 0.01 0.01 0.00 0.00 0.01 0.10 0.01

no aromatic ring > 6 atoms 0.22 0.00 0.03 0.04 0.06 0.08 0.03 0.21

no Br, I 5.25 3.37 4.04 4.51 3.71 4.74 4.02 4.38

≤ 2 ethers 4.91 4.37 4.90 4.05 3.13 7.40 5.76 2.81

no epoxide 0.04 0.23 0.12 0.03 0.00 0.04 0.07 0.12

no formate, acetate, or methyl ester 4.95 4.60 5.02 4.08 3.13 7.44 5.82 2.93

no Cl or F on heterocycle 5.71 8.46 3.78 4.37 6.78 2.58 2.18 6.01

≤ 1 acetylene 0.03 0.03 0.03 0.03 0.14 0.04 0.11 0.04

≤ 1 nitrile 0.36 0.56 0.29 0.15 0.27 0.44 0.42 0.28

≤ 1 sulfone 0.20 1.29 1.84 0.53 0.49 0.21 0.80 0.72

≤ 2 amides 1.04 9.07 0.86 2.64 0.53 0.15 0.92 6.82

≤ 2 acyclic esters 0.01 0.06 0.22 0.03 0.11 0.46 0.01 0.04

Combined FG filtersb) 13.65 25.91 16.99 16.26 10.53 25.22 20.73 18.90

Structural complexity:

≤ 18 avalon density 1.36 0.25 1.66 1.26 0.83 0.37 0.60 0.64

≤ 1 cyclic tetravalent node 0.62 3.26 3.42 2.88 2.60 4.24 9.11 2.27

≤ 4 stereocenters 0.32 3.25 3.57 3.32 1.09 14.01 6.86 3.85

≤ 3 bonds in fused ring systems 3.88 4.00 6.28 5.96 7.46 5.40 11.98 3.90

≤ 3 rings 78.42 47.75 51.01 66.12 59.14 56.16 55.02 56.12

Combined complexity filtersc) 79.27 50.40 53.71 67.50 60.25 62.24 57.32 58.24

Polarity:

≤ 0.7 hetero atoms to carbon ratio 0.89 2.64 5.43 1.25 2.80 0.68 1.15 2.40

All filters combinedd) 83.97 63.83 64.26 72.71 66.09 68.17 65.57 67.40 a) 350K bioactive compounds from our recently reported polypharmacology browser (ppb2) classified according

to their target family. These compounds were originally extracted from ChEMBL22 database. b) Percentage of

compounds eliminated from each database by applying all functional groups filters.c) Percentage of compounds

eliminated from each database by applying all structural complexity filters. c) Percentage of compounds

eliminated from each database by applying all filters

download fileview on ChemRxivGDBMedChemSI.pdf (674.53 KiB)



Documents

Medicinal Chemistry Database GDBMedChem