Active Search for Computer‐aided Drug Designgarnett/files/papers/oglic_et_al_mi_2018.pdf · Dino Oglic,*[a, b] Steven A. Oatley,[c] Simon J. F. Macdonald,[d] Thomas Mcinally,[c]

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

DOI: 10.1002/minf.201700130

Active Search for Computer-aided Drug DesignDino Oglic,*[a, b] Steven A. Oatley,[c] Simon J. F. Macdonald,[d] Thomas Mcinally,[c] Roman Garnett,[e] JonathanD. Hirst,[c] and Thomas Gartner[a]

Abstract: We consider lead discovery as active search in aspace of labelled graphs. In particular, we extend our recentdata-driven adaptive Markov chain approach, and evaluateit on a focused drug design problem, where we search foran antagonist of an av integrin, the target protein thatbelongs to a group of Arg�Gly�Asp integrin receptors. Thisgroup of integrin receptors is thought to play a key role inidiopathic pulmonary fibrosis, a chronic lung disease ofsignificant pharmaceutical interest. As an in silico proxy ofthe binding affinity, we use a molecular docking score to anexperimentally determined avb6 protein structure. Thesearch is driven by a probabilistic surrogate of the activityof all molecules from that space. As the process evolves and

the algorithm observes the activity scores of the previouslydesigned molecules, the hypothesis of the activity is refined.The algorithm is guaranteed to converge in probability tothe best hypothesis from an a priori specified hypothesisspace. In our empirical evaluations, the approach achieves alarge structural variety of designed molecular structures forwhich the docking score is better than the desired thresh-old. Some novel molecules, suggested to be active by thesurrogate model, provoke a significant interest from theperspective of medicinal chemistry and warrant prioritiza-tion for synthesis. Moreover, the approach discovered 19out of the 24 active compounds which are known to beactive from previous biological assays.

Keywords: active search · antagonist · cheminformatics · drug design · integrin

1 Introduction

We investigate a data-driven adaptive Markov chainapproach for computer-aided drug design with the goal ofspeeding up the discovery process. The approach is anadaptation of our recent work on active search in intension-ally specified structured spaces.[1] An intensional descriptionspecifies a set of structures with necessary and sufficientconditions for any structure to be in that set. In contrast tothis, an extensional definition simply lists all elements of theset. The intensional specification is often much smaller thanthe extensional one. The difference in the size of specifica-tion is important when considering the runtime and spacecomplexities of algorithms. While the extensional definitionsof significant parts of the set of all potentially synthesizablemolecules (estimates of the cardinality are often larger than1060) cannot be stored on a disk nor enumerated in afeasible time, sampling from intensionally defined parts isby no means impossible.

In drug design, a chemical space of interest is specifiedonly implicitly by the binding affinity to a target protein siteor an in silico proxy of it. Faced with such an implicitspecification, medicinal chemists devise an intensionaldescription which either covers the whole chemical spaceof interest or a significant part of it. Access to any suchspecification can be provided by a proposal generator,which randomly samples compounds from the intensionalspecification. In contrast to this, typical active searchapproaches[2–4] require an explicit list of compounds to beprovided as input to the algorithm. Further, the runtimecomplexity of these approaches often has a linear depend-

ence on the length of this list, requiring only a small subsetof the intensionally specified search space to maintaincomputational tractability. This hinders the ability of thesealgorithms to discover promising lead candidates. Incontrast to this, our recently proposed approach[1] for activesearch can work directly and efficiently with intensionaldefinitions. Interacting with the candidate space via aproposal generator allows the method to be computation-ally efficient while not artificially limiting access to a subsetof the search space. In our recent work, we have evaluatedthe approach (presented in Section 2.1) on the space of

[a] D. Oglic, T. G�rtnerSchool of Computer Science, University of NottinghamJubilee Campus, Wollaton Road, NottinghamNG8 1BB, United KingdomE-mail: [email protected]

[b] D. OglicInstitut f�r Informatik III, Universit�t BonnRçmerstraße 164, 53117 Bonn, Germany

[c] S. A. Oatley, T. Mcinally, J. D. HirstSchool of Chemistry, University of NottinghamUniversity Park, NottinghamNG7 2RD, United Kingdom

[d] S. J. F. MacdonaldGlaxoSmithKline, Medicines Research CentreGunnels Wood Road, Stevenage, HertfordshireSG1 2NY, United Kingdom

[e] R. GarnettDepartment of Computer Science and Engineering WashingtonUniversity in St. LouisOne Brookings Drive CB 1045, St. Louis, MO63130, USA

Full Paper www.molinf.com

© 2018 Wiley-VCH Verlag GmbH & Co. KGaA, Weinheim Mol. Inf. 2018, 37, 1700130 (1 of 15) 1700130

www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

unlabelled graphs with properties that share many charac-teristics with drug discovery. We continue this study herewith a focused drug design problem where the goal is tofind novel and effective pharmaceuticals for the treatmentof a chronic lung disease.

Developing new pharmaceuticals is difficult and costly.The number of new drugs approved per billion US dollars(inflation-adjusted) spent on research and development hashalved every nine years since 1950.[5] Even a modestacceleration in the process would equate to medicinesreaching patients more quickly and foster significant eco-nomic benefits. Fueled by the recent developments inartificial intelligence, e. g., AI agents beating human worldchampions in games of chess and Go,[6] there is a resurgenceof interest in applications of machine learning. For the past15 years, drug discovery has been a popular testbed for theapplication of machine learning.[7–9] A key challenge is toreduce the amount of supervision needed to make usefulpredictions in drug discovery.[10] This can be achieved byexploiting more fully existing collections of molecules and byoptimizing the design of novel compounds to be synthesizedand assayed, which may ultimately lead to safer and moreefficacious drugs. The growth in the number of availableassays characterizing various binding affinities and potentiallead compounds in drug design is a timely opportunity formore extensive development and application of machinelearning algorithms.[11]

As with other instances of cyclic discovery processes,drug design can be characterized by four phases:[12] design,make, test, and analyze. Scheme 1 illustrates the cyclicdiscovery process characteristic to hypothesis-driven drugdesign. While make and test phases are inherently imple-mented in a lab (in vitro or in vivo), machine learning basedalgorithmic support is inherently focused on the design andanalysis phases. In the context of ligand-based de novodrug design, numerous domain-specific approaches have

been developed in the past 20 years to aid the design andanalysis phases of the cyclic discovery process.[8,13] In severalof these approaches,[8,13–17] hand-crafted energy-based func-tions are used to quantify the probability that a compoundbinds to a target protein site. The Metropolis-Hastingsalgorithm[18,19] can then be used to sample from suchdistributions provided that it has access to a chemical spaceof interest in the form of a proposal generator. The role of aproposal generator is to generate a sequence of novelcompounds with a possible (but not necessary) dependencebetween successive structures. A simple proposal generatorcan, for example, be realized by sampling atoms/fragmentsand making random connections based on interactionproperties of atoms/fragments and a protein bindingsite.[8,14,15] The Metropolis-Hastings algorithm allows for anexpert-designed scoring function to modify the distributionof a proposal generator using an acceptance criterion thatdetermines whether or not the currently selected design/sample should be replaced with a proposed structure.

Motivated by these search heuristics, we have previ-ously proposed a Metropolis-Hastings sampling scheme,[1]

where the acceptance probability is given by a probabil-istic surrogate of the binding affinity to a target proteinsite, modelled with a maximum entropy conditionalmodel. This is a data-driven approach that, in contrast toprevious Metropolis-Hastings sampling schemes, learnsand adapts the acceptance criterion as the constructiveprocess evolves and the results of new experiments areobserved. The access to compounds from an intensionallyspecified design space is provided via a proposal gener-ator. As the samples from the proposal generator exhibit-ing the desired binding affinity are typically rare andevaluating this property is expensive, making and testingsuch samples would have a prohibitively high ratio of costper discovered active structure (e. g., see Section 3.1). Wehave, therefore, embedded the proposal generator in a

Scheme 1. Schematic of a cyclic discovery process for hypothesis-driven de novo drug design.



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

Metropolis-Hastings chain whose samples (once the chainhas mixed) are the novel designs that are made and testedto observe their binding affinity to a target protein site. Asthe adaptation of the Metropolis-Hastings acceptancecriterion using already tested designs changes the distri-bution of the chain in the following design step, theselected designs are no longer drawn independently norfrom identical distributions. In previous work,[1] we haveinvestigated theoretical properties of this stochastic proc-ess such as consistency and the worst case mixing boundof the independent Metropolis chain.

This study adapts our algorithm[1] to lead discovery andevaluates this adaptation on a focused drug designproblem. In particular, we employ an oracle capable ofapproximating the binding affinity of a compound to aselected protein site and devise a proposal generatorwhich specifies a search space of molecular structures. Indesigning the proposal generator, we have tried to mimica practice in medicinal chemistry, where the process startswith a compound showing promising binding affinity to aselected protein binding site. We have applied theadapted approach to focused drug design aimed atdesigning pharmaceuticals that are effective againstidiopathic pulmonary fibrosis (IPF). This is a chronic lungdisease with an urgent need for new medicines. Thedisease is characterized by scar tissue which forms in thelungs with increasing severity and it is often caused bymicro-injuries from tobacco smoking, inhalation of microparticles, such as wood and metal dust, or by viralinfection. In the United States, about 100,000 people haveIPF, and a similar number in Europe; each year approx-imately 35,000 new patients are diagnosed in Europe. Thebest current treatment, lung transplantation, is available toonly 5 % of patients. The recently approved drugs,Pirfenidone and Nintedanib, slow the disease, but haveside effects and do not reverse it.[20]

Integrins are relatively large proteins that act as trans-membrane receptors. They link the extracellular matrix withthe cytoskeleton of cells. The general structure of anintegrin is a heterodimer, consisting of an a and a b subunit.The group of RGD integrins recognizes an arginine-glycine-aspartate sequence in the endogenous ligands that bind atthe interface of the two subunits. For a sub-set of these(avb1, avb3, avb5, avb6, avb8) the av subunit remainsconstant while the b subunit varies. The RGD integrinreceptors are thought to play a key role in fibrosis (andmany other diseases, including cancer) and are likely to bedruggable targets.[21–23] One recent study[24] used moleculardynamics simulations to investigate the interactions be-tween RGD-containing peptides and the avb3 integrin.There are a few integrin inhibitor agents on the market,e. g., Tirofiban (an a2bb3 RGD antagonist), and there havebeen substantial drug discovery efforts toward RGD integrinantagonists, many with small molecule RGD mimetics.[25]

However, no small av integrin antagonist has yet reachedthe market. Antagonism of avb6 is one promising avenue of

inquiry and some success has been reported[26] in discover-ing compounds with significant activity against avb6 thathave physico-chemical properties commensurate with oralbioavailability. Moreover, molecular dynamics simulations ofpeptides binding to avb6 have been conducted,[27] but thereare only few published studies on docking of smallmolecules to this integrin.

A series of small molecule mimetics of the RGD peptidehas been synthesized and characterized[26] using a celladhesion assay against several avb integrins, including avb6.The avb6 integrin binds to a RGDLXXL/I motif within theprodomains of transforming growth factor-b1 (TGF-b1) andTGF-b3.[28] The protein structure was recently solved by X-ray crystallography.[29] The binding mode of the RGD motifconformed to that observed in the crystal structure of thecyclic pentapeptide, cilengitide bound to theavb3extracellular domain.[30] In the b subunit, there is ametal ion�dependent adhesion site (abbreviated to MIDAS)which binds to an acidic residue present in all integrinligands. There is a strong side-on bidentate hydrogenbonding interaction (guanidine carboxylate) in the subunit.The b2 � b3 disulfide-bonded loop in the b1 domain appearsto be an important influence on ligand specificity. A goodcorrelation between cell adhesion and integrin bindingassays (which give a truer sense of selectivity) has beendemonstrated[22] with two compounds exhibiting 100 nMpotency as antagonists of four of the avb integrins. Themolecules comprise a naphthyridine (as an arginine guani-dine mimetic) linked to a carboxylic acid (as a mimetic ofthe aspartic acid). Mono-substitution in the meta-positionled to approximately 10-fold range of activity and was moreinfluential on avb6 antagonism than para-substitution. Thesize of the substituent appeared to be more important thanits electronic properties.

In our empirical study, we focus on discovering molec-ular structures from a design space specified with aproposal generator in which an integrin antagonist acts as aparent compound and all the designs are obtained byaltering that compound. The size of the design space is ofthe order of 185,000 compounds, which, although notparticularly large in size, is sufficiently large for us to assessthe efficiency of the algorithm and characterize its behavior.The ultimate goal of the empirical study is to investigatewhether the algorithm is able to suggest novel compoundswhich would warrant prioritization for synthesis from theperspective of medicinal chemists. Some of the designedmolecules, known from our previous work,[26] are presentedin Section 3 and several novel molecular structures warrant-ing further investigation are not disclosed here, because ofpotential commercial interest.

2 Computational Methods

This section describes an adaptation of our recent ap-proach[1] for active search in structured spaces to computer-



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

aided drug design. The approach can be characterized bythree core components: i) a proposal generator thatprovides access to parts of the space of potentiallysynthesizable compounds, ii) an evaluation oracle capableof determining or approximating the binding affinity of amolecule to a target protein site, and iii) a probabilisticsurrogate of the property evaluated by the oracle, modelledwith a conditional density function from a family of suchmodels provided as input to the algorithm. To tailor theapproach to drug discovery, we employ a suitable evalua-tion oracle and devise a proposal generator which specifiesa search/design space of molecular structures. As molecularstructures are typically expensive to evaluate, i. e., tosynthesize and assay, we employ an in silico evaluationoracle. Such oracles need to report a property representinga good approximation to a laboratory evaluation ofdesigned compounds. We account for this by using amolecular docking program to evaluate the quality ofdocking of a designed molecular structure to a receptor siteof interest. The docking program considers a single rigidprotein structure with no water included, approximationsthat could be revisited in future work. Having chosen theevaluation oracle, the next challenge is to define a goodproposal generator that specifies a search space containingstructures that exhibit the property evaluated by the oracle(i. e., the binding affinity to a target protein site). We have inthis process tried to mimic a practice in medicinal chemistry,where the process starts with a compound showingpromising binding affinity to a protein site. The compoundis then modified by attaching different functional groups inan attempt to achieve better docking, i. e., a more potentcompound.

2.1 Active Search for Computer-aided Drug Design

Computer-aided drug design is a cyclic discovery process inwhich design and analyze steps are performed by computerprograms, informed by medicinal chemists. The algorithmsaiding this process are typically stochastic and independentinstantiations of the process generate different com-pounds.[8] This section describes one such approach whichis based on the Metropolis-Hastings algorithm and reflectson the related previous studies.

Already early attempts[14,15] to model the cyclic discoveryprocess characteristic to drug design involved Markovchains. The stochastic process xtf gt2N defined on a chemicalspace of interest is a Markov chain if for any t � 1 theconditional distribution of xt given x0; :::; xt�1 is the same asthat of xt given xt�1, i. e., p xt x0; :::; xt�1jð Þ ¼ p xt xt�1jð Þ. Afterchoosing the initial state x0, the stochastic process isspecified with a conditional density function p xt xt�1jð Þ, alsoknown as the transition kernel of the chain. A simulation ofa Markov chain is quite similar to a practice in conventionaldrug design where medicinal chemists design novel com-pounds by starting with a molecular structure known to be

moderately active and then sequentially alter their designsby replacing functional groups or fragments. A conditionaldensity function p xt xt�1jð Þ guiding the design step of thediscovery process needs to have specific theoretical proper-ties[19] so that the state xT for sufficiently large T 2 N (inpractice, typically chosen in the range of 10,000 to 50,000)is an approximate sample from the distribution of mole-cules exhibiting a binding affinity of interest to medicinalchemists. The latter distribution of molecules covers achemical space of interest, unknown to medicinal chemistsprior to discovering sufficiently many satisfactory leadcandidates (even then, the estimates can be inaccurate). Inthe absence of such information about the chemical spaceof interest, it is quite difficult to design a transition kernelwhich simulates that part of the space of potentiallysynthesizable compounds. While an approximate transitionkernel based on the experience and background ofmedicinal chemists can produce pretty good candidates,there is no guarantee that the process will not fail to findsatisfactory lead candidates.

Approaches[16,17] following such early attempts consid-ered adapting plain Markov chains using the Metropolis-Hastings algorithm.[18] The Metropolis-Hastings algorithm isa Markov Chain Monte Carlo approach for the simulation ofa probability distribution. The input to the algorithm is atarget density function p which needs to be specified up toa normalization constant, a proposal generator given by atransition kernel g, an instance x0 from the state-space X asthe initial state, and the number of Markov chain steps n.For a proposal generator that covers the domain of p, theoutput of the algorithm is an approximate sample from thetarget density function p. This convergence property of thechain holds subject to running the Metropolis-Hastingschain sufficiently long so that it forgets the initial state andmoves away from the stationary distribution of the proposalgenerator to the target density function p. Approaches[8,15–

17] based on this algorithm typically devise an energy-basedscoring function (i. e., an in silico proxy of the bindingaffinity to a target protein site) that is computationallycheap to evaluate and run the Metropolis-Hastings samplerwith that scoring function as the target density function.Such hand-crafted scoring functions are often devised usinga small sample of active compounds and are usually notvery good at approximating the binding affinity to a targetprotein site. Moreover, these scoring functions are keptstatic throughout the constructive process[8,13] and do notexploit the information from the evaluation of previouslydesigned compounds.

Algorithm 1 is a pseudo-code description of theMetropolis algorithm. In the first step of the samplingprocess, the output variable is set to the initial state givenas an input to the algorithm. Following this, the chainiterates for n steps and the last accepted state is returnedas a sample from the target density function p. Eachiteration of the random process starts by using thetransition kernel to sample the next candidate state



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

conditioned on the currently accepted state of the chain.The chain then makes a transition to the sampled candidatestate with an acceptance probability given by the well-established Metropolis-Hastings criterion. The number ofsteps/transitions until the chain is sufficiently close to thetarget density function is called the mixing time of the chainand is an important property for the analysis of approachesbased on the Metropolis-Hastings algorithm.[19]

Having reviewed the Metropolis-Hastings algorithm, wenow focus on our active search approach which relies onthat algorithm to sample from the posterior distributionover molecular structures conditioned on them having thedesired binding affinity to a target protein site. Unlikeprevious approaches based on the Metropolis-Hastingsalgorithm, we refine our model of the binding affinity as weobserve the activity of designed structures. To model thebinding affinity, our algorithm takes as input an evaluationoracle (specifically a molecular docking program, describedsubsequently), which outputs a discrete label for a givenmolecular structure. The cost of evaluation is modelled byimposing a budget which limits the number of times theoracle can be accessed. Other parameters of the algorithmare the proposal generator (described below), targetproperty, and parameters specifying a set of models fromthe conditional exponential family.[31,33] For this choice of a

conditional model, the probabilistic surrogate for the oracleevaluations is a maximum entropy model subject toconstraints on the first moments of the sample.[1,31,32]

Denote the space of candidate structures X , the space ofproperties Y, and a Hilbert space H with inner product�; �h i: A parameter set V � H together with the sufficient

statistics � : X � Y ! H of y xj specifies a set of condi-tional exponential family models[31,33] via

p y j x; qð Þ ¼ exp h� x; yð Þ; qi � A qjxð Þð Þ;

where A qjxð Þ ¼ lnP

y2Y exp h� x; yð Þ; qið Þ is the conditionallog-partition function and q 2 V. In practice, we do notdirectly specify the set of parameters V, but insteadregularize the importance weighted negative log-likelihoodof the sample by adding the term kqk2

H. The weightednegative log-likelihood minimization with such a regulariza-tion term is equivalent to weighted maximum a posterioriestimation with a Gaussian prior on the parameter vector.[33]

To account for the implicit specification of the parameterset and avoid overfitting, the algorithm takes as input ahyperparameter which controls the amount of regulariza-tion and complexity of our models.

Algorithm/Scheme 2 is a pseudo-code description of theapproach. The constructive process is initialized by settingthe parameter vector of the conditional exponential familyto zero (line 1 in Algorithm 2). Thus, the first designedstructure is a sample from the proposal generator (i. e.,unbiased and uninformed). The algorithm then startsiterating until the oracle budget is depleted (line 2). In lines3 to 7, the Metropolis-Hastings algorithm[18,19] is used tosample from the posterior

p x j y*; qtð Þ ¼ p y* j x; qtð Þp xð Þp y*ð Þ ;

where p y*ð Þ is the marginal probability of y* 2 Y and p xð Þ

Scheme 2. Schematic of our approach for de novo drug design, as embodied in Algorithm 2.



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

the stationary distribution of the proposal generator Gdefined with a transition kernel g for which the detailedbalance condition holds.[19] To obtain samples from theposterior p x j y*; qtð Þ, the Metropolis-Hastings acceptancecriterion is, thus, given by

p y* j x0; qtð Þp y* j xt; qtð Þ �

p x0ð Þ�g x0 ! xtð Þp xtð Þ�g xt ! x0ð Þ ¼

p y* j x0; qtð Þp y* j xt; qtð Þ ;

where x0 is the proposed candidate, xt is the last acceptedstate, qt is the parameter vector of the conditionalexponential family model, and g xt ! x0ð Þ denotes theprobability of the transition from state xt to state x0. Afterthe Metropolis-Hastings chain has mixed (line 7), thealgorithm outputs its last accepted state xt as a candidatestructure and presents it to the oracle (line 8), whichevaluates it, providing feedback yt to the algorithm. Thelabelled pair xt; ytð Þ is added to the training sample and animportance weight is assigned to it (line 8). The impor-tance weighting is necessary to ensure the consistency ofthe algorithm, because the samples are, in general, neitherindependent nor identically distributed. Finally, the condi-tional exponential family model is updated (line 9) byoptimizing the weighted negative-log likelihood of thesample, regularized with a Gaussian prior on the parametervector. The updated model is used by the algorithm tosample a candidate structure in the next iteration.

In our previous theoretical study,[1] we have shown thatthe described random process is consistent under fairlystandard assumptions: With high probability, the algorithmwill, after at most polynomially many oracle queries, samplefrom the model given by a parameter that is arbitrarily closeto the best parameter from a set of parameters specifying afamily of conditional exponential family models. Thistheoretical result is important because it quantifies theamount of exploration performed by the algorithm. Inparticular, the consistency of the cyclic discovery process isa guarantee that the approach will eventually be able todiscover structures from the whole search space. In addition

to that result, the theoretical study of the approachprovides a worst case bound on the mixing time of theindependent Metropolis-Hastings chain for sampling fromthe posterior distribution p x j y* ; qtð Þ: That bound isimportant, because it provides a worst case estimate on themixing time of the Metropolis-Hastings chain described inlines 4–7 of Algorithm 2. In other words, we can select thenumber of steps in the Metropolis-Hastings algorithm suchthat the stationary distribution of the chain is arbitrarilyclose to the target density function p x j y*; qtð Þ.

2.2 Molecular Representation

We represent molecular structures as vertex labelled graphsand use the Weisfeiler-Lehman graph kernel[34] to embedthese structures to a reproducing kernel Hilbert space(Figure 1 provides an illustration of the embedding). Thisgraph kernel has been shown to be highly expressive onprediction tasks involving molecules[34] and it is related tothe molecular fingerprint called ECFP.[35] In contrast to theWeisfeiler-Lehman graph kernel, that fingerprint: i) ignoreshydrogen atoms, ii) uses binary features instead of countsto express occurrences of vertex centered subtree patternsin molecular graphs, and iii) incorporates more informationthan atomic number into the initial labels.

The Weisfeiler-Lehman graph kernel, depending on thestructural diversity of the compounds, can embed molecularstructures to a high dimensional space where due to a largenumber of parameters it is computationally expensive tosolve the optimization problem in line 9 (Algorithm 2). Weavoid such inefficient parametrization of the model by

Figure 1. An illustration of the Weisfeiler-Lehman transformation forgenerating feature vectors from molecules. (a) An examplemolecular structure (Methyl anthranilate). (b) An undirected vertexlabelled graph is formed from the molecule (notice that bond typeis ignored). (c) Features are generated according to the appearancefrequency of subtree patterns rooted at any of the vertices from thelabelled graph. The height of any such subtree pattern needs to beless than a pre-specified parameter value (e. g., all subtree patternsof height less than 10). In this feature vector, additional compo-nents would be present for other subtree patterns such as P (here0), N: CHH (here 1), C: COO (here 1), H: CHN (here 2) etc.



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

solving a kernelized version of the problem, where thenumber of parameters is given by the number of designsevaluated by the oracle. More specifically, the optimizationproblem in line 9 of Algorithm 2 is convex in q and therepresenter theorem[36] guarantees that it is possible toexpress an optimal solution as a linear combination ofsufficient statistics, i. e.,

qtþ1 ¼Xt

i¼1

X

c2Yaic� xi; cð Þ;

with aic 2 R. For a separable sufficient statistic,� x; yð Þ ¼ �1 xð Þ�2 yð Þ, we can then rewrite the conditionalexponential family models using the kernels defined on thespace of structures and the space of properties, respectively.More formally, we have that the probability of observing aproperty y 2 Y for a given structure x 2 X is given by

p y j x; að Þ ¼exp

Pti¼1

Pc2Y aic k x; xið Þ h y; cð Þ

� �

Pc2Y exp

Pti¼1

Pc0 2Y aic0 k x; xið Þ h c; c0ð Þ

� � :

In the latter equation, k x; xið Þ denotes the value of theWeisfeiler-Lehman graph kernel for vertex labelled graphs xand xi representing the corresponding molecules andh y; cð Þ denotes a kernel function expressing similaritybetween properties y and c. For binary property spaces,Y ¼ �1; 1f g, we set hðy; cÞ ¼ 1 if y ¼ c and otherwiseh y; cð Þ ¼ �1.

2.3 Proposal Generator

A typical approach to candidate generation in drug designis to make alterations to a parent compound having amoderate binding affinity to a target protein site. Changesto the parent compound can include modifications to itsfunctional groups and attachment of different fragments inthe place of hydrogen atoms or small fragments containedin the parent molecule. Motivated by these approaches, wedevelop a proposal generator for finding candidate mole-cules of which some are likely to dock well to the avb6

receptor site. In particular, we start with an integrinantagonist compound as the parent and consider substitu-tions at five possible points on the aryl ring (Figure 2),which based on structure activity relationships, are knownto profoundly influence potency and selectivity.[26]

Based on our own data and the integrin medicinalchemistry literature, a variety of possible substituentswere considered: H, F, Cl, Br, methyl, ethyl, propyl, iso-propyl, cyclopropyl, methoxy, hydroxyl, CF3, OCF3, SO2Me,nitrile and several heterocycles, imidazole, pyrazole andtriazole (with possible substituents of H, methyl or ethyl).After a preliminary calculation, we elected to impose acouple of restraints on the molecules that could begenerated, so that they would be more drug-like andmore amenable to synthesis. Thus, catechols (where there

are neighboring hydroxyls on the aryl ring) wereprecluded, as they are prone to autoxidation and, there-fore, difficult to work with experimentally. The totalnumber of hydrogen bond donors that could be presentin a molecule was capped at five. A maximum of 500 wasset for the molecular weight. Clearly, more of theLipinski’s rules[37] or other in silico restrictions (polarsurface area, for example) could be readily implemented,but the above restraints proved to be sufficient.

Algorithm 3 is a pseudo-code description of theproposal generator. The algorithm takes as input a parentcompound, together with a set of fragments and a set ofattachment points onto which the fragments can besubstituted instead of hydrogen atoms. As describedabove, the Lipinski constraints are enforced with amaximum allowed molecular weight and a maximumallowed number of hydrogen bond donors in theresulting compound. The sampling process is initializedby setting the total molecular mass to that of the parentcompound and the total attached fragment mass to zero.Also, the set of attachment points of fragments withhydrogen bond donors is initialized with the empty set.The algorithm then starts to iterate until: i) there are noavailable attachment points, ii) there are no feasiblefragments to be substituted at the available attachmentpoints, iii) a random interruption event occurs, which isdefined to happen with probability given by the ratio ofthe attached molecular mass and the total availableattachment mass. This probabilistic constraint is intro-duced so that it is more likely to sample lightercompounds. The alterations to the parent compound areachieved through two steps: i) sampling uniformly atrandom an attachment point from the available ones, andii) sampling a fragment uniformly at random from the setof feasible fragments for the sampled attachment point.For the constraints that were applied, there are approx-imately 185,000 different compounds that define oursearch space.

Figure 2. The parent compound considered in this study; greencircles denote points where substituents could be attached.



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

2.4 Docking: Evaluation Oracle

Whilst it is recognized that there are many conformationalchanges of av integrins during their activation and signal-ing, we have elected to base our modelling on a publishedcrystal structure. The avb6 structure[29] was taken from theProtein Data Bank (PDB code: 4um9). The zwitterionic formsof the ligands were considered, with the negatively chargedcarboxylate moiety at one end (coordinating to a metal inthe MIDAS site) and the naphthyridine protonated (having apKa of ~7.8), making the aromatic nitrogen atom positivelycharged.[38] This is important for a bidentate hydrogen bondinteraction with Asp218. Molecular docking was performedusing OpenEye FRED,[39] which uses a rigid ligand approach,where a large number of conformations are generated andeach of those are docked successively. The chemgauss3scoring function was used in an initial docking, and thehighest scoring positions were evaluated using the moresophisticated (and more computationally expensive) chem-gauss4 scoring function, which includes improved terms forligand-receptor hydrogen bonds and metal-chelator inter-actions. The latter is particularly pertinent, considering the

importance of binding with the divalent metal cationswithin the active site.[40] Both enantiomers were sampledseparately, while individual conformers were deemed iden-tical (and removed) if the root mean squared difference(RMSD) in the atomic positions was less than 0.5 A. Amaximum of 10,000 conformers was allowed per enantiom-er and typically there were between 2,500 to 5,000 con-formers per enantiomer. To allow more extended samplingof the conformational space, a truncated form of theMMFF94 s forcefield[41] was used to calculate individualconformer energy and the maximum range between theglobal minimum and any conformer was limited to 20 kcalmol�1. This truncated form of the forcefield excludes bothCoulomb and the attractive part of van der Waalsinteractions. The binding box used for the docking wascentered on the Thr221 residue, in the middle of the activesite. The edges were extended past important features,namely Asp218 and the Mg2 + ion, such that the final sizewas 27.0 by 29.7 by 21.3 A, giving a total search volume of17,010 A3. The binding site was further restricted byenforcing an interaction with the divalent metal cation aswell as a hydrogen bond with the Asp218 residue; a singlehydrogen bond was required in order not to restrict thesearch space unduly. The grid point spacing was 1 A with asecond pass grid point spacing of 0.5 A.

The docking program takes as input a molecularstructure and outputs a real-valued score that quantifies thequality of the docking of that molecule to the avb6 receptorsite. To obtain a discrete/binary label for the activity of adesigned structure, we threshold the docking score. Todetermine a suitable threshold value, we have performed aseries of preliminary experiments and defined a binarylabelling oracle that assigns label 1 to molecular structureswith a docking score below �11.75. The average timerequired by the oracle for the evaluation of a singlestructure is approximately 20 minutes. Thus, given the sizeof our search space, if one were to label all the structureswith the docking program, it would take over 6.5 years ofCPU time on a single processor or 25 days on 100processors in parallel.

3 Results

In this section, we present our findings and analyze theresults of our simulations. We assess the performance of thealgorithm from several different perspectives. First, weconfirm that the approach represents a significant improve-ment over plain Monte Carlo search performed with theproposal generator. Following this, we quantify the learningrate of our approach by measuring how much more likelythe approach is to generate desired molecular structurescompared to the proposal generator as a function of thebudget expended. Having assessed the algorithmic per-formance of the approach, we proceed to analyze thedesigned molecules from the perspective of medicinal



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

chemistry. In particular, we discuss some of the designedmolecular structures in the context of compounds[26] alreadyreported in the literature.

Before we proceed with the analysis, let us describe ourexperimental setup. We have performed five independentsimulations of our approach, in each case running thealgorithm for 50 rounds. In each round, the algorithm takes10 independent and identically distributed samples fromthe posterior distribution of molecules by running 10Metropolis-Hastings chains in parallel (note that samplesfrom different rounds are dependent). The Metropolis-Hastings sampling of the posterior distribution was per-formed with a burn-in sample of 50,000 proposals. To allowfor models of different complexity, we have estimated theconditional exponential family regularization parameter ineach round using five-fold stratified cross-validation.

3.1 Performance

We assess the performance of our approach using thecorrect-construction curve that shows the cumulativenumber of discovered molecules exhibiting the targetproperty, i. e., a docking score lower than �11.75, as afunction of the budget expended. To quantify the improve-ment of our approach over plain Monte Carlo searchperformed with the proposal generator, we measure the liftof the correct-construction curve, given by the ratiobetween the expected number of hits generated by ourapproach and the expected number of hits observed in asample of the same size from the distribution of a proposalgenerator.

Figure 3a, which shows correct-construction curves forour approach (blue curve) and the described proposalgenerator (red curve), confirms that our approach generatesmore hits than Monte Carlo search with the proposalgenerator. Moreover, the correct-construction curve of theproposal generator is, apart from a few initial rounds, alwaysbelow the lower endpoint of the confidence interval for thecurve of our approach. The lift of the correct-constructioncurve for our approach (showed in Figure 3b) indicates thatthe approach is approximately 2.8 times more likely tooutput a hit than the proposal generator after 50 rounds ofmodel calibration.

3.2 Illustrative Docked Molecule

Figure 4 shows an illustrative example of a docked complex.The S enantiomer consistently scored higher than thecorresponding R enantiomer (for the compound shown, thedocking score for the S enantiomer was �11.9 compared tothe score of �10.3 for the R enantiomer), in line with theprevious experimental findings.[26] In each of the fivesimulations, the 3,4 substitution pattern appeared between50 % to 90 % of the time in the top 20 hits, a preference

which is in accord with trend observed in earlier work.[26]

Chelating interactions between the carboxyl-Mg2 + (withdistances of 2.6 and 3.8 A) are present, as is the requiredhydrogen bond (distance is 2.3 A) from the naphthyridinefragment to Asp218 on the alpha subunit. The carboxylgroup acts as a hydrogen bond acceptor for the backboneof nearby residues; Ala126, Ser127 and Asn218 on the betasubunit and another hydrogen bond is present betweenthe carboxyl and Ser127 sidechain hydroxyl group. There isalso a hydrogen bond between the ligand amide oxygenatom and the sidechain on residue Thr221 on the betasubunit.

Figure 3. Panel (a) shows the cumulative number of ‘hits’ as afunction of the budget expended. The blue curve is the correct-construction curve of our approach (with corresponding confidenceinterval colored in light blue) and the red curve is the correct-construction curve of the proposal generator. Panel (b) indicateshow much more likely it is to see a hit compared to a standardMonte Carlo search performed with the proposal generator.



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

3.3 Experimental Validation

The experimental knowledge provided to the algorithm waslimited to the X-ray crystal structure of the receptor andsome basic constraints on the mode of binding applied tothe molecular docking. The parent compound and thepossible substituents were informed, in a broad sense, byexpert knowledge from medicinal chemistry, but no explicitdata on the experimental activities of any compounds wereused. Previous work[26] presented the synthesis and exper-imental assay (reported as pIC50 values, i. e., the negativelogarithm of the concentration required for 50 % inhibition)of 30 derivatives of the parent compound shown inFigure 2. A pIC50 = 6.0 corresponds to a 1 mM potency, andthe compound might be considered active or worth furtherinvestigation. From Table 1, we can see that the parentcompound has a pIC50 of 5.7 and would, therefore, beconsidered inactive. Encouragingly, 19 out of the 26reported active compounds were found by the algorithm.Several of the compounds reported as active were notfound, but two of these were not discoverable by thealgorithm, because the substitution pattern (forming newring structures) was not part of the proposal generator.There were two compounds for which the docking scorewas not sufficiently low, which indicates that there is anopportunity to improve the docking protocol. Specifically,we will investigate reducing the threshold of 0.5 A RMSDused for defining different conformers, which should lead toa more thorough conformational search during the docking.In total, 20 of the 30 compounds from previous work[26]

were sampled.

3.4 Medicinal Chemistry Perspective

The algorithm described is a “proof of principle” and stillrequires considerable refinement but nonetheless, from amedicinal chemistry and drug discovery perspective, themolecules suggested for synthesis are promising for severalreasons. First, many of the molecules suggested align withthe structure activity relationships (both published[42,43] andunpublished) which were not part of the input to thealgorithm, either in terms of parameters or design. Anexample is the algorithm predominantly suggests substitu-ents at the meta position, which indeed appears to becrucial for avb6 activity. Suggested substituents also oftenfeature heterocycles which are known to deliver avb6

activity.[42,43] Secondly, most of the molecules are druglike:that is, they resemble both the structures and physico-chemical properties of oral drugs. Thirdly and particularlypromising is the speed at which new molecules can beevaluated computationally allowing several iterations to beeasily carried out to improve the design quality of themolecules (as detailed earlier). Moving forward, it will bestraightforward to incorporate additional constraints, suchas scoring molecules against avb3; which should improvethe selectivity window for avb6 over avb3; including polarsurface area cut-offs (which correlate with several importantdrug-like properties) and simple synthetic chemistry consid-erations.

4 Discussion and Related Work

In this section, we place our work in the context ofapproaches closely related to ours and discuss somedirections for further development of the described cyclicdiscovery process.

Figure 4. An illustrative model of a docked complex, featuring the3-CN derivative (compound 32 in Table 1), with a pIC50 of 6.6. To aidvisualization, the Asp218 has been shown as stick and the metal ionas CPK. The image was generated using OpenEye Vida.

Table 1. Comparison of hits identified by the approach withcompounds that have been experimentally assayed.

No.a Substituents Dockingscore

pIC50 No. times foundby algorithm

Compounds independently identified as hits4 3-F �12.16 6.1 2522 3-MeO �11.79 6.5 825 4-Me �11.96 6.1 732 3-CN �11.94 6.6 23Hits not found by the algorithm31 4-Ph �11.92 6.4 –38 3,4-Me2 �12.18 6.7 039 3,4,–CH2CH2CH2 �11.93 6.8 –Compounds with pIC50�7.0, but with a docking score abovethe threshold (�11.75)33 3-CF3 �11.54 7.0 1043 3-CF3-4-Cl �11.39 7.0 1Parent compound15 H �10.26 5.7 0

a Taken from reference [26].



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

4.1 Learning Task

Active learning is broadly defined as the learning setting inwhich a learning algorithm is allowed to select instancesfrom an instance space and ask for the properties of any ofthese objects.[44,45] The goal of active learning is to generatean accurate hypothesis with as few such queries as possible.This learning setting is different from the standard passivemodel of supervised learning where the algorithm receivesall labels and instances at once. The typical success measurefor active learning is the quality of the found hypotheses. Indrug design and several other applications of activelearning the goal is, however, to discover objects of highquality and the algorithms should, therefore, be rewardedfor the quality of the discovered objects, rather than thequality of the formed hypotheses. Several extensions ofactive learning have been developed to address thisdisparity, active search and active optimization being themost prominent ones. Active search[2] is a variant of activeclassification where the goal is to discover as many objectsas possible from an unknown property class. Activeoptimization[46] focuses on finding a single high-quality itemfrom an instance space rather than a (diverse) set of objectsexhibiting a desired property. Active optimization has beeninvestigated extensively in the context of learning with real-valued and binary feedback.[46,47] The former case[46] canoften be cast as global optimization of a black-box functionthat is expensive to evaluate. The latter case[47] is a variantof active classification where the goal is to discover an itemwith the highest conditional probability of being a target.Thus, while active search does not discriminate betweeninstances with desired properties and focuses on discover-ing such objects from the whole search space, activeoptimization with binary feedback focuses on potentiallysmall regions of the design space consisting of objects withhigh conditional probability of being a target.

In earlier work,[3] we have investigated an active searchapproach for ranking a collection of compounds accordingto a hypothesized contribution to the total number ofdiscovered actives, given a fixed budget of evaluations.While the approach can, in theory, generate an optimalranking of a collection of compounds, it is computationallyintractable to evaluate such ranking functions. To addressthis computational shortcoming, our earlier work providesan effective sequential approximation scheme.[2,3] Thescheme selects a subset of molecular structures byiteratively ranking the available collection of moleculesaccording to a hypothesized contribution to the totalnumber of discovered structures with desired propertiesand incorporating the evaluation feedback into the nextdiscovery step. Active search was assessed on variousdatabases of molecules with different assigned activityclasses.

In the past 15 years,[9,46,48–50] active machine learningresearch has been applied to various drug discoveryproblems. Apart from a few heuristic approaches, these

applications only consider explicitly enumerated pools ofmolecules (i. e., extensionally specified design/search space).Intensional descriptions have the potential to outperformextensional ones because an algorithm requiring an exten-sional description can only consider small and oftenarbitrary subsets of an exponentially large search space ofinterest.[13] Moreover, approaches with an extensional searchspace are not tailored for discovery of novel activecompounds that have not been synthesized previously andare not protected by intellectual property rights. Thus, theability to generate molecules from the whole chemicalspace of interest is a key distinction between the inves-tigated problem setting and previous active learning andactive search approaches. Active optimization approachestypically rely on the Euclidean geometry in problems withintensionally specified instance spaces. In particular, a recentapproach[51] represents molecules with SMILES[52] strings andembeds these to a vector space where active optimizationis performed. However, an optimal preimage (i. e., SMILESstring mapping to a particular vector) of the vector selectedby an active optimization method does not necessarilycorrespond to a feasible SMILES string or a valid molecule.While it is possible to reduce the number of infeasibleSMILES strings during decoding with the help of a contextfree grammar,[53] the problem of finding optimal preimagesis still hard in the context of intensionally specifiedstructured spaces.[54] However, several effective approacheshave been proposed for finding preimages in special casessuch as strings.[55,56]

Having reviewed different variants of active learning, wenow provide an insight indicating that active search couldindeed be the most appropriate variant for lead search indrug discovery. From Table 1, we can observe that while theemployed docking program represents a good approxima-tion to the actual activity level there is still quite an amountof unreliability in the computed docking score. In particular,two compounds[26] that are known to be active (with pIC50 =7.0) are assigned worse docking scores than several othercompounds with lower activity level (pIC50<7.0), knownfrom the literature. Thus, if one were to use activeoptimization with the docking score as black-box optimiza-tion function the amount of unreliability in the computedscore (as a result of using a misspecified in silico oracle)could potentially focus the search to a region of thechemical space with suboptimal activity level (e. g., lowpIC50 value measured in biological assays). In contrast toactive optimization with real-valued feedback, active searchassumes binary feedback and as our empirical evaluationindicates can be more robust to unreliability inherent to insilico oracles.

4.2 De Novo Design

In drug discovery, de novo design refers to a family ofapproaches for finding novel molecules with desired



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

properties from an intensionally specified chemical space ofinterest.[8,13] An algorithm from this family can be charac-terized by three core components: i) (adaptive) proposalgenerator, ii) scoring function, and iii) adaptation schemethat adapts/shrinks the search space by modifying theproposal generator using scores assigned to the previouslygenerated compounds. Any de novo design approach isassessed by the quality of the designed compounds, whichdepends on the ability of the algorithm to cope with thecombinatorial complexity of the search space.[8] This abilityand, thus, the outcome of any de novo design approach,crucially depends on the adaptation scheme. As describedearlier, our approach copes with the combinatorial complex-ity of the search space by focusing the search with aprobabilistic surrogate of the binding affinity. Moreover, thefocused search is iteratively refined by updating thesurrogate model as we observe the binding affinity of thepreviously selected designs. The whole process is consistentand guaranteed not to perform arbitrarily bad.[1] In partic-ular, after at most polynomially many oracle queries ourapproach is guaranteed to sample from the posteriordistribution over molecular structures that is defined by thebest conditional model of the binding affinity from a familyof such models provided as input to the algorithm. Incontrast to the presented approach, adaptation schemes inde novo design are typically not driven by adaptive models/hypotheses and the achieved reductions in the search spacedo not come with any type of guarantee. More specifically,de novo design methods are stochastic processes thatusually discover good candidates but there is no guaranteethat any of these random processes will not becomearbitrarily bad (i. e., fail to discover satisfactory leadcandidates after at most polynomially many queries). More-over, our empirical results (e. g., see Figure 3b) indicate thatour approach exhibits a fast learning rate and after severalhundred oracle queries samples a model that approximatesfairly well the in silico proxy of the binding affinity to theselected protein binding site.

Most in silico scoring functions used in de novo designare developed to approximate primary target constraints,that is, the binding affinity of a ligand to a target proteinsite.[8,13] In silico evaluation of any compound can be achallenging and computationally intensive task. For exam-ple, the docking oracle employed in this paper takesapproximately 20–25 CPU minutes to dock an individualmolecule and typically involves the explicit docking of 3,000to 7,000 conformations per molecule. This number ofconformations is somewhat larger than that usually consid-ered due to the presence of stereo-centers as well asexploring extended, slightly higher energy conformations.Receptor-based scoring functions typically employed in denovo design can be divided into three groups:[8,13,57] i)explicit force-field methods, ii) empirical scoring functions,and iii) knowledge-based scoring functions. The approacheswith force-field scoring functions can be computationallyexpensive and discovered compounds are evaluated by

approximating the binding energy.[58–62] The empiricalscoring functions rely on a small set of known actives totrain a regression model that weights individual ligand-receptor interactions types.[63–66] However, as only a smallset of known actives is available beforehand such oraclestend to bias the discovery process toward structuralcomponents present in the set of known actives.[8,13] Knowl-edge-based evaluation oracles are based on statisticalproperties of ligand-receptor structures, that is, frequenciesof interactions between all possible pairs of atoms.[67,68] Suchoracles require only structural information to derive theinteraction frequencies for all pairs of atoms and are knownto be less biased than the empirical ones.

Access to compounds from an intensionally specifiedchemical space of interest is typically provided throughproposal generators (also known as structure samplers). Theexisting compound generation procedures can be classifiedinto two groups:[8,13] receptor and ligand based structuresamplers. Proposal generators based on a particularreceptor structure can provide additional informationcharacteristic to a protein binding site. Several suchapproaches have been developed over the years with theprominent ones being: i) linking of fragments placed at keyinteraction sites of the receptor structure,[63,64,66] ii) growingof a fragment randomly selected from a set of possibleinitial fragments which have all been placed at interactionsites of the receptor using expert knowledge,[58,67,68] and iii)structure sampling where randomly selected fragments areassembled at the receptor with the help of moleculardynamics simulations.[60–62,69] Ligand based proposal gener-ators are independent of the receptor structure and workby sampling atoms/fragments and connecting them usingvalence rules.[8,13–15,70] Atom-based samplers[14,15,71] are knownto generate diverse compounds and span a large chemicalspace. This then increases the combinatorial complexity ofthe search space and makes the search for active com-pounds more difficult. Contrary to this, fragment basedapproaches[16,63,64,72–74] can significantly reduce the size of thesearch space. The reduction is deemed meaningful[8,13] whenthe used fragments are common structures found in avariety of known drug-like compounds. In our simulations,we take the latter approach and investigate a localneighborhood of a parent compound with relativelymoderate activity level, known from our earlier work.[26]

While the binding affinity is of primary concern for denovo design, equally important are secondary targetconstraints: absorption, distribution, metabolism, excretion,and toxicity (ADMET properties).[8,13,70] Similar to bindingaffinity, ADMET properties can also be approximated insilico.[70,75–77] Previous attempts to approximate secondarytarget constraints in silico include approaches based onQSAR analysis[70,76] and/or protein modelling.[70,77] Thus, anyeffective drug design algorithm needs to be successfulacross multiple different criteria. While the proof of conceptpresented in this paper focuses on binding affinity only, ourapproach can be easily adapted to multiple objectives.



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

More specifically, instead of oracles with binary feedback wecould employ an oracle providing binary vectors as feed-back. Such vectors could, for example, have a componentfor each of the ADMET properties and the kernel functionon such space of properties can be given by a simple dotproduct between the binary vectors. The desired propertyclass from which the algorithm would aim at sampling fromis given by the vector of all ones. We consider this to be apromising avenue for future work.

Wider application of de novo design algorithms hasbeen hindered by two main shortcomings:[70] syntheticaccessibility of the designed compounds and the insuffi-cient reliability of the affinity approximations. In particular,proposal generators that only consider valence rules whileproposing candidate compounds are not sufficient toensure generation of stable and synthetically accessiblemolecules.[8,13] Several different approaches have beendeveloped for tackling the problem of synthetic accessibilityof compounds.[13,70] In this paper, we have taken one suchapproach by substituting fragments consisting of functionalgroups in place of hydrogen atoms at specific attachmentpoints of the parent compound. As the parent compound issynthetically accessible, it is expected that the substitutionsmimicking chemical reactions would yield syntheticallyaccessible designs, as well. In addition to this, we haveincorporated filters into our proposal generator to increasethe drug-likeness of the proposals and their syntheticaccessibility. The filters consist of some of Lipinski’s rules[37]

and a constraint preventing undesirable (from the perspec-tive of synthetic accessibility) placements of hydroxylgroups. Moreover, as lead compounds with large molecularmass are likely to reduce the chance of drug reaching themarket,[70] we enhance the Lipinski’s constraint on molecularmass by incorporating a stochastic stopping criterion intoour proposal generator that favors lighter compounds (e. g.,see the mass-dependent stopping criterion in line 12 ofAlgorithm 3). Further improvements to structure samplingare possible with the addition of information from availablesets of actual chemical reactions.[65,78–81] This type of addi-tional information has the potential to generate a viablesynthesis path together with a novel compound (i. e., recipefor derivation of any designed compound) and will beconsidered in future work.

5 Conclusions

The results presented in this study have been generatedusing simulations consuming approximately 28 hours ofCPU time and running on 10 processors in parallel. Thisoffers a significant speed up over an exhaustive explorationof the search space specified by the proposal generatorthat would take more than 8 months of CPU time using 10parallel processors. Moreover, while the simulations wererelatively short (with a budget of 500 oracle evaluations),the approach managed to discover a number of interesting

compounds from the perspective of industrial medicinalchemists. Thus, the preliminary results of the algorithmaugur well for future, more extensive work, which willinclude, for example, more extended searches and explora-tion of significantly larger spaces. Incorporation of moreconstraints relating to Lipinski’s rules and synthetic accessi-bility will be investigated, as well. For instance, the tetra-benzenes proposed by the algorithm would be depriori-tized in a synthetic effort, because they would be difficult tomake. The logP values of some compounds is beyond therange usually considered drug-like. Another avenue ofinterest will be optimization of selectivity across differentavb receptors and also of interest is pan-activity, i. e., highaffinity for more than one avb receptor.

Conflict of Interest

SJFM is an employee and shareholder of GlaxoSmithKline;he is involved in unrelated series on an av programme atGSK. For this work all SJFM’s contributions are owned bythe University of Nottingham.

Acknowledgements

We are grateful for access to the University of NottinghamHigh Performance Computing Facility. SAO is supported byEPSRC [EP/P510592/1] under a GlaxoSmithKline CASE awardscheme. Part of this work was developed while DO, RG, andTG were at the University of Bonn and partially funded bythe German Science Foundation [GA 1615/1-1].

References

[1] D. Oglic, R. Garnett, T. G�rtner, Proc. 31st AAAI Conf. Artif. Intell.2017, 2449–2464.

[2] R. Garnett, Y. Krishnamurthy, X. Xiong, J. Schneider, R. P. Mann.In: Langford J (ed) Proc. 29th Intl. Conf. Mach. Learn. 2012,1239–1246.

[3] R. Garnett, T. G�rtner, M. Vogt, J. Bajorath, J. Comput.-Aided Mol.Des. 2015, 29, 305–314.

[4] X. Wang, R. Garnett, J. Schneider, In: Proc. 19th ACM SIGKDD Intl.Conf. Knowledge Discov. and Data Min. 2013, 731–738.

[5] J. W. Scannell, A. Blanckley, H. Boldon, B. Warrington, NatureRev. Drug Discov. 2012, 11, 191–200.

[6] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang,A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T.Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, D.Hassabis, Nature 2017, 550, 354–359.

[7] K. Baumann, G. Schneider. Mol. Inf. 2017, 36, 1780132.[8] G. Schneider, U. Fechner, Nat. Rev. Drug Discovery 2005, 4, 649–

663.[9] M. K. Warmuth, J. Liao, G. R�tsch, M. Mathieson, S. Putta, C.

Lemmen, J. Chem. Inf. Comput. Sci. 2003, 43, 667–673.[10] H. Altae-Tran, B. Ramsundar, A. S. Pappu, V. Pande, ACS Cent.

Sci. 2017, 3, 283–293.



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

[11] A. Mullard, Nature 2017, 549, 445–447.[12] S. Andersson, A. Armstrong, A. Bjçre, S. Bowker, S. Chapman, R.

Davies, C. Donald, B. Egner, T. Elebring, S. Holmqvist, T.Inghardt, P. Johannesson, M. Johansson, C. Johnstone, P.Kemmitt, J. Kihlberg, P. Korsgren, M. Lemurell, J. Moore, J. A.Pettersson, H. Pointon, F. Pont�n, P. Schofield, N. Selmi, P.Whittamore, Drug Discovery Today, 2009.

[13] G. Schneider, K. H. Baringhaus, In: De Novo Molecular Design,Weinheim: John Wiley & Sons, 2013, 1–55.

[14] Y. Nishibata, A. Itai, Tetrahedron 1991, 47, 8985–8990.[15] D. A. Pearlman, M. A. Murcko, J. Comput. Chem. 1993, 14, 1184–

1193.[16] E. Pellegrini, M. J. Field, J. Comput.-Aided Mol. Des. 2003, 17,

621–641.[17] A. Miranker, M. Karplus, Proteins 1995, 23, 472–490.[18] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller,

E. Teller, J. Chem. Phys. 1953, 21, 1087–1092.[19] C. Andrieu, N. de Freitas, A. Doucet, M. I. Jordan, Mach. Learn.

2003, 5–43.[20] Y.-M. Liu, K. Nepali, J.-P. Liou, J. Med. Chem. 2017, 60, 527–553.[21] K. Ley, J. Rivera-Nieves, W. J. Sandborn, S. Shattil, Nature Rev.

Drug Discov. 2016, 15, 173–183.[22] R. J. D. Hatley, S. J. F. Macdonald, R. J. Slack, P. T. Lukey, Angew.

Chem. Int. Ed. 2017, In press.[23] N. I. Reed, H. Jo, C. Chen, K. Tsujino, T. D. Arnold, W. F. DeGrado,

D. Sheppard, Sci. Transl. Med. 2015, 7, 288ra79.[24] X. Dong, Y. Yu, Q. Wang, Y. Xi, Y. Liu, Mol. Inf. 2017, 36, 1600069.[25] H. M. Sheldrake, L. H. Patterson, J. Med. Chem. 2014, 57, 6301–

6315.[26] J. Adams, E. C. Anderson, E. E. Blackham, Y. W. R. Chiu, T. Clarke,

N. Eccles, L. A. Gill, J. J. Haye, H. T. Haywood, C. R. Hoenig, M.Kausas, J. Le, H. L. Russell, C. Smedley, W. J. Tipping, T. Tongue,C. C. Wood, J. Yeung, J. E. Rowedder, M. J. Fray, T. McInally,S. J. F. Macdonald, ACS Med. Chem. Lett. 2014, 5, 1207–1212.

[27] O. V. Maltsev, U. K. Marelli, T. G. Kapp, F. Saverio Di Leva, S. DiMaro, M. Nieberler, U. Reuning, M. Schwaiger, E. Novellino, L.Marinelli, H. Kessler. Angew. Chem. Int. Ed. 2016, 55, 1535–1539.

[28] X. Dong, B. Zhao, R. E. Iacob, J. Zhu, A. C. Koksal, C. Lu, J. R.Engen, T. A. Springer, Nature 2017, 542, 55–59.

[29] X. Dong, N. E. Hudson, C. Lu, T. A. Springer, Nat. Struct. Mol.Biol. 2014, 21, 1091–1096.

[30] J. P. Xiong, T. Stehle, B. Diefenbach, R. Zhang, R. Dunker, D. L.Scott, A. Joachimiak, S. L. Goodman, M. A. Arnaout, Science2001, 294, 339–345.

[31] Y. Altun, A. J. Smola, In: Proc. 19th Ann. Conf. Learn. Th. 2006,139–153.

[32] E. T. Jaynes, Phys. Rev. 1957, 106, 620–630.[33] Y. Altun, A. J. Smola, T. Hofmann, In: Proc. 20th Conf. Uncertainty

in Artif. Intell. 2004, 2–9.[34] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn,

K. M. Borgwardt, J. Mach. Learn. Res. 2011, 12, 2539–2561.[35] D. Rogers, M. Hahn, J. Chem. Inf. Model. 2010, 50, 742–754[36] G. Wahba, Spline models for observational data. SIAM, 1990.[37] C. A. Lipinski, F. Lombardo, B. W. Dominy, P. J. Feeney, Adv. Drug

Delivery Rev. 2001, 46, 3–26.[38] B. Cacciari, P. Crepaldi, S. Federico, G. Spalluto, Frontiers Med.

Chem., 2009, 4, 587–618.[39] M. McGann. J. Chem. Inf. Model. 2011, 51, 578–596.[40] M. Millard, S. Odde, N. Neamati, Theranostics 2011, 1, 154–188.[41] T. A. Halgren, J. Comput. Chem. 1999, 20, 720–729.[42] N. A. Anderson, I. B. Campbell, M. H. J. Campbell-Crawford, A. P.

Hancock, S. Lemma, S. J. F. Macdonald, J. M. Pritchard, P. A.Procopiou, Patent; WO 2016046230.

[43] N. A. Anderson, M. H. J. Campbell-Crawford, A. P. Hancock, S.Lemma, S. J. F. Macdonald, J. M. Pritchard, P. A. Procopiou, S.Swanson, Patent; WO 2016046241.

[44] B. Settles, Synthesis Lectures on Artificial Intelligence andMachine Learning 2012, 6, 1–114.

[45] D. Cohn, L. Atlas, R. Ladner, Mach. Learn. 1994, 15, 201–221.[46] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, N. de Freitas,

Proc. of the IEEE 2016, 104, 148–175.[47] M. Tesch, J. Schneider, H. Choset, In: Proc. 30th Intl. Conf. Mach.

Learn. 2013, JMLR: W&CP 28.[48] D. Reker, P. Schneider, G. Schneider, Chem. Sci. 2016, 7, 3919–

3927.[49] T. Lang, F. Flachsenberg, U. von Luxburg, M. Rarey, J. Chem. Inf.

Model. 2016, 56, 12�20.[50] D. Reker, G. Schneider, Drug Discovery Today 2015, 20, 458–465.[51] R. Gomez-Bombardelli, D. Duvenaud, J. M. Hernandez-Lobato,

J. Aguilera-Iparraguirre, T. D. Hirzel, R. P. Adams, A. Aspuru-Guzik, arXiv:1610.02415, 2016.

[52] D. Weininger, J. Chem. Inf. Comput. Sci. 1988, 28, 31–36.[53] M. J. Kusner, B. Paige, J. M. Hernandez-Lobato, In: Proc. 34th Intl.

Conf. Mach. Learn. 2017, PMLR 70.[54] J. Weston, B. Schçlkopf, G. H. Bakir, In: Adv. Neural Inf. Proc. Sys.

2004, 16.[55] C. Cortes, M. Mohri, J. Weston, In: Proc. 22nd Intl. Conf. Mach.

Learn. 2005.[56] S. Giguere, A. Rolland, F. Laviolette, M. Marchand, In: Proc. 32nd

Intl. Conf. Mach. Learn. 2015.[57] E. Parola, W. P. Walters, P. S. Charifson, Proteins 2003, 56, 235–

249.[58] Y. Nishibata, A. Itai, J. Med. Chem. 1993, 36, 2921–2928.[59] S. H. Rotstein, M. A. Murcko, J. Med. Chem. 1993, 36, 1700–

1710.[60] H. Liu, Z. Duan, Q. Luo, Y. Shi, Proteins 1999, 36, 462–470.[61] J. Zhu, H. Yu, H. Fan, H. Liu, Y. Shi, J. Comput.-Aided Mol. Des.

2001, 15, 447–463.[62] D. A. Pearlman, M. A. Murcko, J. Med. Chem. 1996, 39, 1651–

1663.[63] H. J. Bçhm, J. Comput.-Aided Mol. Des. 1992, 6, 61–78.[64] D. E. Clark, D. Frenkel, S. A. Levy, J. Li, C. W. Murray, B. Robson,

B. Waszkowycz, D. R. Westhead, J. Comput.-Aided Mol. Des.1995, 9, 13–32.

[65] C. W. Murray, D. E. Clark, T. R. Auton, M. A. Firth, J. Li, R. A.Sykes, B. Waszkowycz, D. R. Westhead, S. C. Young, J. Comput.-Aided Mol. Des. 1997, 11, 193–207.

[66] R. Wang, Y. Gao, L. Lai, J. Mol. Model. 2000, 6, 498–516.[67] R. S. DeWitte, E. I. Shakhnovich, J. Am. Chem. Soc. 1996, 118,

11733–11744.[68] A. V. Ishchenko, E. I. Shakhnovich, J. Med. Chem. 2002, 45,

2770–2780.[69] P. J. Goodford, J. Med. Chem. 1985, 28, 849–857.[70] E. Vangrevelinghe, S. R�disser, Curr. Comput.-Aided Drug Des.

2007, 3, 69–83[71] N. P. Todorov, P. M. Dean, J. Comput.-Aided Mol. Des. 1997, 47,

175–192.[72] N. Brown, B. McKay, F. Gilardoni, J. A. Gasteiger, J. Chem. Inf.

Comput. Sci. 2004, 44, 1079–1087.[73] A. C. Pierce, G. Rao, G. W. Bemis, J. Med. Chem. 2004, 47, 2768–

2775.[74] V. Gillet, A. P. Johnson, P. Mata, S. Sike, P. Williams, J. Comput.-

Aided Mol. Des. 1993, 7, 127–153.[75] H. van de Waterbeemd, E. Gilford, E. Nat. Rev. Drug Discov.

2003, 2, 192[76] H. van de Waterbeemd, S. Rose, S. Butler, Practice of Med.

Chem. 2003, 2 Ed., Adademic Press, 351–369.



www.molinf.com

123456789

1011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556

[77] O. F. Guener, IUL Biotechnol. Ser. 2 2000.[78] G. Schneider, M.-L. Lee, M. Stahl, P. Schneider, J. Comput.-Aided

Mol. Des. 2000, 14, 449–466.[79] X. O. Lewell, D. B. Budd, S. P. Watson, M. M. Hann, J. Chem. Inf.

Comput. Sci. 1998, 38, 511–522.[80] H. M. Vinkers, M. R. de Jonge, F. F. Daeyaert, J. Heeres, L. M.

Koymans, J. H. van Lenthe, P. J. Lewi, H. Timmerman, K.van Aken, P. A. Janssen, J. Med. Chem. 2003, 46, 2765–2773.

[81] M. Hartenfeller, M. Eberle, P. Meier, C. Nieto-Oberhuber, K.-H.Altmann, G. Schneider, E. Jacoby, S. Renner, J. Chem. Inf. Model.2011, 51, 3093–3098.

Received: November 1, 2017Accepted: January 3, 2018

Published online on February 1, 2018



www.molinf.com

Documents

Active Search for Computer‐aided Drug Designgarnett/files/papers/oglic_et_al_mi_2018.pdf · Dino Oglic,*[a, b] Steven A. Oatley,[c] Simon J. F. Macdonald,[d] Thomas Mcinally,[c]