8
The ChEMBL bioactivity database: an update A. Patrı´cia Bento, Anna Gaulton, Anne Hersey, Louisa J. Bellis, Jon Chambers, Mark Davies, Felix A. Kru ¨ ger, Yvonne Light, Lora Mak, Shaun McGlinchey, Michal Nowotka, George Papadatos, Rita Santos and John P. Overington* European Molecular Biology Laboratory European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Received September 30, 2013; Accepted October 7, 2013 ABSTRACT ChEMBL is an open large-scale bioactivity database (https://www.ebi.ac.uk/chembl), previously described in the 2012 Nucleic Acids Research Database Issue. Since then, a variety of new data sources and im- provements in functionality have contributed to the growth and utility of the resource. In particular, more comprehensive tracking of compounds from research stages through clinical development to market is provided through the inclusion of data from United States Adopted Name applications; a new richer data model for representing drug targets has been developed; and a number of methods have been put in place to allow users to more easily identify reliable data. Finally, access to ChEMBL is now available via a new Resource Description Framework format, in addition to the web-based interface, data downloads and web services. INTRODUCTION ChEMBL is an open large-scale bioactivity database con- taining information largely manually extracted from the medicinal chemistry literature. Information regarding the compounds tested (including their structures), the biolo- gical or physicochemical assays performed on these and the targets of these assays are recorded in a structured form, allowing users to address a broad range of drug discovery questions. Applications of the data include the identification of suitable chemical tools for a target; inves- tigation of the selectivity and off-targets effects of drugs; large-scale data mining, such as the construction of pre- dictive models for targets and identification of bioisostere replacements or activity cliffs (1–4); and as a key compo- nent of integrated drug discovery platforms (5–7). In addition to literature-extracted information, ChEMBL also integrates deposited screening results and bioactivity data from other key public databases [e.g. PubChem BioAssay (8)], and information about approved drugs from resources such as the U.S. Food and Drug Administration (FDA) Orange Book (9) and DailyMed (http://dailymed.nlm.nih.gov/dailymed). Details of the data extraction process, curation and data model have been published previously (10); therefore, the current article focuses on recent enhancements to ChEMBL. DATA CONTENT Release 17 of the ChEMBL database contains informa- tion extracted from >51 000 publications, together with bioactivity data sets from 18 other sources (depositors and databases). In total, there are now >1.3 million distinct compound structures and 12 million bioactivity data points. The data are mapped to >9000 targets, of which 2827 are human protein targets. Data sets added over the past 2 years include the following: neglected disease screening results from projects funded by Medicines for Malaria Venture (11), Drugs for Neglected Diseases initiative (http://www.dndi.org), World Health Organization TDR programme (WHO- TDR) (12), Open Source Malaria (http://opensource malaria.org), Harvard University (13) and Glaxo- SmithKline (14); kinase screening results from Millipore (15), and several groups using the Protein Kinase Inhibitor Set compound collection (16); supplementary bioactivity data associated with publications from GlaxoSmithKline (17–19); and information from several other databases including DrugMatrix (https://ntp.niehs.nih.gov/ drugmatrix/index.html), TP-search (20) and Open TG- GATEs (21). NEW DEVELOPMENTS Tracking compound progression Although the extraction of structure–activity relationship data from medicinal chemistry literature provides a good overview of drug discovery research, a fuller picture of drugs in development and marketed products is obtained only by combining literature data with other information *To whom correspondence should be addressed. Tel: +44 1223 492666; Fax:+44 1223 494468; Email: [email protected] Published online 7 November 2013 Nucleic Acids Research, 2014, Vol. 42, Database issue D1083–D1090 doi:10.1093/nar/gkt1031 ß The Author(s) 2013. Published by Oxford University Press. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/3.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

The ChEMBL bioactivity database: an update

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The ChEMBL bioactivity database: an update

The ChEMBL bioactivity database an updateA Patrıcia Bento Anna Gaulton Anne Hersey Louisa J Bellis Jon Chambers

Mark Davies Felix A Kruger Yvonne Light Lora Mak Shaun McGlinchey

Michal Nowotka George Papadatos Rita Santos and John P Overington

European Molecular Biology Laboratory European Bioinformatics Institute Wellcome Trust Genome CampusHinxton Cambridgeshire CB10 1SD UK

Received September 30 2013 Accepted October 7 2013

ABSTRACT

ChEMBL is an open large-scale bioactivity database(httpswwwebiacukchembl) previously describedin the 2012 Nucleic Acids Research Database IssueSince then a variety of new data sources and im-provements in functionality have contributed to thegrowth and utility of the resource In particularmore comprehensive tracking of compounds fromresearch stages through clinical development tomarket is provided through the inclusion of datafrom United States Adopted Name applications anew richer data model for representing drug targetshas been developed and a number of methods havebeen put in place to allow users to more easilyidentify reliable data Finally access to ChEMBL isnow available via a new Resource DescriptionFramework format in addition to the web-basedinterface data downloads and web services

INTRODUCTION

ChEMBL is an open large-scale bioactivity database con-taining information largely manually extracted from themedicinal chemistry literature Information regarding thecompounds tested (including their structures) the biolo-gical or physicochemical assays performed on these andthe targets of these assays are recorded in a structuredform allowing users to address a broad range of drugdiscovery questions Applications of the data include theidentification of suitable chemical tools for a target inves-tigation of the selectivity and off-targets effects of drugslarge-scale data mining such as the construction of pre-dictive models for targets and identification of bioisosterereplacements or activity cliffs (1ndash4) and as a key compo-nent of integrated drug discovery platforms (5ndash7) Inaddition to literature-extracted information ChEMBLalso integrates deposited screening results and bioactivitydata from other key public databases [eg PubChemBioAssay (8)] and information about approved drugs

from resources such as the US Food and DrugAdministration (FDA) Orange Book (9) and DailyMed(httpdailymednlmnihgovdailymed) Details of thedata extraction process curation and data model havebeen published previously (10) therefore the currentarticle focuses on recent enhancements to ChEMBL

DATA CONTENT

Release 17 of the ChEMBL database contains informa-tion extracted from gt51 000 publications together withbioactivity data sets from 18 other sources (depositorsand databases) In total there are now gt13 milliondistinct compound structures and 12 million bioactivitydata points The data are mapped to gt9000 targets ofwhich 2827 are human protein targets Data sets addedover the past 2 years include the following neglecteddisease screening results from projects funded byMedicines for Malaria Venture (11) Drugs forNeglected Diseases initiative (httpwwwdndiorg)World Health Organization TDR programme (WHO-TDR) (12) Open Source Malaria (httpopensourcemalariaorg) Harvard University (13) and Glaxo-SmithKline (14) kinase screening results from Millipore(15) and several groups using the Protein Kinase InhibitorSet compound collection (16) supplementary bioactivitydata associated with publications from GlaxoSmithKline(17ndash19) and information from several other databasesincluding DrugMatrix (httpsntpniehsnihgovdrugmatrixindexhtml) TP-search (20) and Open TG-GATEs (21)

NEW DEVELOPMENTS

Tracking compound progression

Although the extraction of structurendashactivity relationshipdata from medicinal chemistry literature provides a goodoverview of drug discovery research a fuller picture ofdrugs in development and marketed products is obtainedonly by combining literature data with other information

To whom correspondence should be addressed Tel +44 1223 492666 Fax +44 1223 494468 Email jpoebiacuk

Published online 7 November 2013 Nucleic Acids Research 2014 Vol 42 Database issue D1083ndashD1090doi101093nargkt1031

The Author(s) 2013 Published by Oxford University PressThis is an Open Access article distributed under the terms of the Creative Commons Attribution License (httpcreativecommonsorglicensesby30) whichpermits unrestricted reuse distribution and reproduction in any medium provided the original work is properly cited

sources To increase the coverage of drugs in development(to complement the set of approved drugs alreadyincluded in ChEMBL from the FDA Orange Book) wehave now added structures and annotation for gt10 000compounds and biotherapeutics for which United StatesAdopted Name (USAN) or International NonproprietaryName (INN) applications have been filed This informa-tion has been obtained from the public list of adoptednames provided by the USANs Council (httpwwwama-assnorgamapubphysician-resourcesmedical-scienceunited-states-adopted-names-counciladopted-namespage) and the USP dictionary of USAN andInternational Drug Names (22) The application for aUSAN or INN is typically made when a compound is inearlymid-stage development and therefore serves as arobust general overview of clinical candidate spaceStructures for novel candidates are manually assignedand for protein therapeutics amino acid sequencesmay be annotated where available For each parentcompound information regarding its synonyms researchcodes applicants year of USAN assignment and the in-dication class for which the USAN has been initially filedwhere available is also included in the database Thesynonyms consist of the non-proprietary names for thecompounds containing that parent molecule and respect-ive type (or source) of that name such as the FDA nameUSAN INN British Approved Name (BAN) JapaneseAccepted Name (JAN) and French approved non-propri-etary name (Denomination Commune Francaise DCF)The inclusion of research codes and synonyms frommultiple sources maximizes the chance of finding acompound of interest based on text searches and allowsadaptive searching across the literature reflecting thechanging names of compounds as they are cross-licensedandor progress to later clinical stages The year of USANassignment can be used to roughly infer the likelihood of acompound being approved Typically an approved druggets its USAN assigned between 1ndash3 years beforeapproval and only a small fraction of drugs is approvedwhen the USAN is 10 years or older (see Figure 1)For each compound ChEMBL also provides the

USAN or INN stems assigned by the USANs councilor the WHO respectively These are prefixes suffixesand infixes in the non-proprietary name which emphasizea specific chemical structure type a pharmacologicalpropertymechanism of action or a combination ofthese In addition to the USAN-derived informationeach of the compounds is also annotated with its respect-ive Anatomical Therapeutic Chemical (ATC) code whereavailable The ATC classification is assigned by the WHOCollaborating Center for Drug Statistics Methodology(23) and can be used as a tool for comparing data ondrugs regarding the organ or anatomical system onwhich they act and their therapeutic pharmacologicaland chemical profile ChEMBL also provides users witha rapid assessment of the important compoundingredientfeatures such as drug type (synthetic small moleculenatural product-derived inorganic polymer antibodypeptideprotein oligonucleotide or oligosaccharide)whether the compound violates any of the Rule-of-Fivecriteria whether it exerts its pharmacological action by a

novel mechanism whether it is dosed as a defined singlestereoisomer or racemic mixture and whether it is a knownprodrug (see Figure 2)

The annotation of the compounds in preclinical researchand development serves as a platform for the annotation ofthe FDA-approved drugs It is likely that when a noveldrug is finally marketed much the information regardingthe active molecule can already be found in ChEMBL Foreach launched drug the ingredient-derived data are thencomplemented with information regarding its marketedproduct This includes information regarding its tradenames dosage information approval dates administrationroutes whether there are lsquoblack boxrsquo safety warningsassociated with the product whether it has a therapeuticapplication (as opposed to imagingdiagnostic agents addi-tives etc) and finally whether the product is available onprescription over-the-counter or if eventually it has beendiscontinued This information allows users of the bioactiv-ity data to assess whether a compound of interest is anapproved drug and is therefore likely to have an advan-tageous safetypharmacokinetic profile or be orally bio-available for example

Finally FDA-approved and WHO anti-malarial drugshave now been annotated with mechanism of action andefficacy target information This information has beenmanually assigned using primary sources such as pub-lished literature and manufacturerrsquos prescribing informa-tion Targets have only been included for a drug if (i) thedrug is believed to interact directly with the target and (ii)there is evidence that this interaction contributes towardthe efficacy of that drug in the indication(s) for which it isapproved We do not currently list as drug targets proteinsthat are responsible for the pharmacokinetics or adverseeffects of a drug those for which there is pharmacol-ogy data but no known link to in vivo efficacy (thoughthese can obviously be queried from the bioactivitydata in ChEMBL) or those that may be relevant toother indications for which the drug is not currentlyapproved

Figure 1 Frequency distribution for approved drugs showing thenumber of years taken for a drug to be approved after a USAN wasassigned to it

D1084 Nucleic Acids Research 2014 Vol 42 Database issue

Modeling drug targets

The appropriate representation of drug targets in bioactiv-ity databases is a non-trivial issue In certain circum-stances it may be sufficient to consider a lsquotargetrsquo and alsquoproteinrsquo as synonymous However there are numerouscases where this oversimplification breaks down A largenumber of marketed drugs for example bind to proteincomplexes or bind non-selectively to all members of aprotein family Similarly for published assay data if ameasurement has been performed in a cell-based assay itis often not clear which of several related proteins areresponsible for the effect observed Formerly assays inChEMBL that described the interaction of a compoundwith multiple possible proteins were mapped to severaltargets However this representation was suboptimal fora number of reasons Firstly users might have retrievedmultiple rows of data for a compound of interest anderroneously believed that it has been tested in multipledifferent assays against each of the individual targets Inaddition a user querying the database with a nonndashcompound-binding subunit of a protein complex wouldstill retrieve activity data and again might incorrectlyinfer that compounds identified bind directly to thatsubunit A new target data model has therefore beendeveloped within ChEMBL to draw a clear distinctionbetween targets (the entities with which a compound

interacts to exert its effects) and the molecular compo-nents (usually proteins) that make up those targetsA key step in the development of the new data model was

the definition of new target types The original lsquoPROTEINrsquotarget type has now been subdivided into a number ofcategories In the simple case where a compound isbelieved to interact specifically with a monomeric proteinthe target type lsquoSINGLE PROTEINrsquo is now used In caseswhere either a compound is known to act non-specificallywith all members of a protein family or the assay condi-tions are such that it is not possible to determine whichmember(s) of a protein family the compound is acting on(eg a cell-based or tissue-based assay) a target type oflsquoPROTEIN FAMILYrsquo is used The target represents andis linked to the group of all proteins (components) withwhich the compound may interact Where the molecularentity with which the compound interacts is known to bea protein complex and can be precisely defined the targettype lsquoPROTEIN COMPLEXrsquo is used Again the targetrepresents the complex itself but is linked to componentsrepresenting each of the protein subunits However it is notalways possible to define protein complex targets preciselyFor example compounds are often measured for activity atgamma-aminobutyric acid (GABA-A) receptors in tissue-based assays Although GABA-A receptors are known tobe pentameric ligand-gated ion channels they can consist

Figure 2 For rapid visual comparison the ChEMBL interface displays a set of icons that summarize features of the chemical compoundingredient(green icons) as well as any marketed products (blue icons)

Nucleic Acids Research 2014 Vol 42 Database issue D1085

of various combinations of a b and g subunits (of whichthere are 12 in total) In a tissue-based format the exactsubunit combinations present are generally not known Insuch cases the target type of lsquoPROTEIN COMPLEXGROUPrsquo is used Other new target types have also beencreated for approved drugs whosemolecular targets are notproteins (eg metal chelating agents ribosome inhibitorsantisense RNA agents) Table 1 shows a full list of all mo-lecular target types in ChEMBL and the number of targetsin each categoryThe new data model also allows the annotation of

binding site information on protein targets Binding sitescan be defined at varying levels of granularity (subunit-level protein domain-level or residue-level) according tothe information available This facility has been used toannotate bioactivity results with the predicted Pfamdomain to which the compound is most likely to bind (24)and to annotate drug efficacy targets with known bindingsubunit information For example benzodiazepine drugsare known to bind to GABA-A receptors at the interfaceof a and g subunits but only in a-1 a-2 a-3 and a-5containing receptors Therefore while the ChEMBL targetfor these drugs contains all a b and g subunits thebenzodiazepine binding site definition for this targetconsists of only the a-1 a-2 a-3 a-5 and g subunitsUsers can therefore use this information to excludeprotein subunits that are not directly involved in thebinding of the drug to its target (in this case the bsubunits) Full details of the latest ChEMBL data modelcan be seen in Supplementary Figure S1

Allowing users to pinpoint high-quality data

As the volume of data in ChEMBL grows it becomesincreasingly important to empower users with the abilityto evaluate the quality and appropriateness of these datafor their particular use cases For example while a

researcher investigating a single target or compoundseries may prefer to retrieve as much data as possibleand validate this information by referring back to theoriginal publications for other applications such as thetraining of computational models it may be vital toexclude any data that could potentially be erroneousTherefore a number of different enhancements havebeen made to the database to allow users to more easilyassess the drug-likeness of compounds compare bioactiv-ity values from different assays and highlight possibleerrors or duplications in the data

To help users assess the drug-likeness of compounds a setof physicochemical properties is provided for the ChEMBLcompounds The calculations are made on the parent formof the molecule after any salts have been removed andwhere the molecular weight of the compound is lt1000and the structure is comprised only of standard commonatoms (C N O H F Cl Br I S P) The exception isthe full molecular weight (FULL_MWT) which where ap-plicable is the molecular weight of the salt plus any hydratespresent The properties are calculated either using algorithmsprovided in Pipeline Pilot (version 85 Accelrys Inc 2012) orusing the ACDlabs Physchem software (version 1201Advanced Chemistry Development Inc 2010) Somefurther descriptors have been derived from these propertiessuch as the well-used Lipinski Rule-of-Five (25) the Rule-of-Three passes used to identify compounds suitable forfragment screening (26) and the weighted QuantitativeEstimate of Drug likeness (QED_WEIGHTED) for whichvalues range from 0ndash1 [1 being the most drug-like and 0 theleast drug-like (27)]

Ligand efficiencies are also increasingly used to identifynot just compounds that have high affinity for a target butthose that give maximum binding for their size number ofatoms or lipophilicitypolar atoms There are a number ofdifferent measures now described in the literature and four

Table 1 List of molecular target types included in release 17 of the ChEMBL database with a description and example of each type and the

total number of targets of that type

Target type Description Example Numberof targets

Single protein Single protein chain Phosphodiesterase 5A (CHEMBL1827) 5518Protein family Group of closely related proteins Muscarinic receptors

(CHEMBL2094109)188

Protein complex Defined protein complex consisting of multiplesubunits

GABA-A receptor alpha-3beta-3gamma-2 (CHEMBL2094120)

159

Protein complex group Poorly defined protein complex where subunitcomposition is unclear

GABA-A receptor (CHEMBL2093872) 43

Proteinndashprotein interaction Disruption of a proteinndashprotein interaction p53Mdm2 (CHEMBL1907611) 12Chimeric protein Fusion of two different proteins either a

synthetic construct or naturally occurringBcrAbl fusion protein

(CHEMBL2096618)2

Selectivity group Pair of proteins for which selectivity has beenassessed

Muscarinic receptors M2 and M3(CHEMBL2095187)

96

Proteinndashnucleic acidcomplex

Complex consisting of both protein andnucleic acid components

70S ribosome (CHEMBL2363965) 6

Nucleic acid DNA RNA or PNA Apo-B 100 mRNA (CHEMBL2364185) 28Oligosaccharide Oligosaccharide Heparin (CHEMBL2364712) 4Small molecule Small molecule such as amino acid sugar or

metaboliteGlutamine (CHEMBL2366039) 20

Macromolecule Large biological molecule other than proteincomplex

Hemozoin (CHEMBL613898) 4

Metal Metal or ion Iron (CHEMBL2363058) 8

D1086 Nucleic Acids Research 2014 Vol 42 Database issue

of the more common have now been made available in thedatabase Ligand Efficiency (LE) (28) Binding EfficiencyIndex (BEI) Surface Efficiency Index (SEI) (29) andLipophilic Ligand Efficiency (LLE) (30) The ligandefficiencies are calculated on the standardizedpChEMBL values (see later in the text) for binding datato protein targets For the BEI and SEI it is also possibleto see a plot of these for a specific target on the TargetReport Card on the ChEMBL Web site and interactivelylook at the structures and select sets of the most ligandefficient molecules that bind to a target of interest

When identifying compounds that bind to a particularprotein target for structurendashactivity relationship or leadidentification studies it is important to be using compar-able data We have therefore standardized many of theactivity types and their corresponding units For exampleIC50 IC50_mean IC50_mM and mean IC50 have all beengiven the standard activity type of IC50 and units of mMmM nmoll and 104moll and so forth have beenstandardized to nM Additionally a number of activitiesare reported in articles as the -log values (eg pKi pIC50-logIC50) we have anti-logged these values so that all ofthe IC50 values (whether reported in the original article asIC50 or pIC50) are seen as IC50 and hence all similaractivity types can be readily identified and their valuesmore easily compared

In addition to the conversion of published activitytypesvaluesunits to standard activity typesvaluesunitsan additional field called pCHEMBL_VALUE has beenadded to the activities table This value allows a numberof roughly comparable measures of half-maximal responseconcentrationpotencyaffinity to be compared on anegative logarithmic scale (eg an IC50 measurement of1 nM has a pChEMBL value of 9) The pChEMBL valueis currently defined as follows log10 (molar IC50 XC50EC50 AC50 Ki Kd or Potency)

Having standardized the activity types units andvalues it is also then possible to identify data that arepotentially erroneous and require further checking withreference to the original article These data are flaggedin the DATA_VALIDITY_COMMENT column ofthe activities table the values of which should be self-explanatory The key ones are lsquoOutside typical rangersquowhere the value is what would normally be consideredtoo high or low a value for the activity type (eg anIC50 value of 109nM) and lsquoNon standard unit for typersquowhere the unit is inappropriate for the activity type(eg an IC50 with units of or mg) A table is availablein the database (ACTIVITY_STDS_LOOKUP) thatcontains details of the activity types that have beenstandardized their permitted standardized units andtheir acceptable value ranges

Lastly the standardization of activity allows the identi-fication of potential duplicate activity values using therules outlined by Kramer et al (31) These are valueswhere an activity measurement reported in an article islikely to be a repeat citation of an earlier measurementrather than an independent measurement A particularexample would be a value reported on a compoundused as a standard in an assay We flag all data wherethe pChEMBL value between the earliest reference

and later references is lt002 (for the same compoundand target pair) with a value of 1 in thePOTENTIAL_DUPLICATE column of the activitiestable Similarly wherever the two pChEMBL valuesdiffer by exactly 3 or 6 log units the activity recordfrom the most recent publication is flagged as alsquoPotential transcription errorrsquo

DATA ACCESS

The ChEMBL interface

The ChEMBL database is accessible via a simple web-based interface at httpswwwebiacukchembl Thisinterface allows users to search the database in anumber of ways A simple key word search box at thetop of the page allows users to find compounds targetsassays or documents containing a search term of interest(by searching various name description and synonymfields) For users wishing to search for compounds bystructure rather than name the lsquoLigand Searchrsquo tabprovides a simple sketcher allowing users to draw a struc-ture or substructure of interest (or import a molfile) andretrieve related compounds (32) Alternatively com-pounds can also be retrieved by CHEMBL_ID smilesor a sequence search against biotherapeutic drugsSimilarly a lsquoTarget Searchrsquo tab allows searching oftargets by CHEMBL_ID or sequence Targets can alsobe browsed either by organism or protein family via thelsquoBrowse Targetsrsquo tab Enhancements have recently beenmade to this tree browser allowing users to search bykey word and identify relevant nodes of the tree and toexpand collapse or select multiple nodes simultaneouslyHaving identified compounds or targets of interest bio-

activity data can be retrieved either from a drop-downmenu on the search results pages or via the ReportCards provided for each ChEMBL compound targetassay or document which contain a series of clickablegraphical widgets for this purposeCompound and Target report card pages have been

improved to incorporate more extensive cross-referencesto other resources For compounds these are nowprovided by the UniChem service (33) whereas forprotein targets cross-references are derived either fromUniProt (34) or through manual annotation A newsection on the Compound and Target Report Cards alsoprovides mechanism of action information where avail-able for approved drugs and their efficacy targetsFollowing the modification of the ChEMBL target datamodel Target Report Cards now provide details of theprotein components of each target (eg in the case of aprotein complex or protein family) in the lsquoTargetComponentsrsquo section and a list of related targets (basedon overlap of protein components) in the lsquoTargetRelationsrsquo section This information allows users torapidly understand the composition of the target theyare viewing and identify any other similar targets thatmay have relevant bioactivity data (see Figure 3)A new feature of the Document Report Card is a

section that lists other publications in ChEMBL that aredeemed similar to the one featured in the Report Card

Nucleic Acids Research 2014 Vol 42 Database issue D1087

Figure 3 Screen capture showing enhancements to the ChEMBL Target Report Card The Target Components section shows which proteins arecomponents of this target (in this case members of the protein family) whereas the Target Relations section shows other targets that are related tothis one because they share one or more of those components The Approved Drugs section shows that there are approved products that are believedto exert at least part of their efficacy through inhibition of phosphodiesterase 4

D1088 Nucleic Acids Research 2014 Vol 42 Database issue

Several methods may be used to assess pairwise documentsimilarity eg overlap of Medical Subject Headings(MeSH) terms (httpwwwnlmnihgovmesh) ordocument clustering based on a term vector approach(35) In our case however the similarity between twodocuments consists of components the first one isdefined by whether a document cites or is referenced bythe other using information retrieved from EuropePMC(36) via the available web services Having establishedpairs of related documents the second component isdefined by the amount of overlap between the compoundsand biological targets reported in those documentsquantified by the Tanimoto coefficient (37) Forexample two articles reporting assay results for thesame set of compounds will have a Tanimoto score of 1as will two documents that report assay results for thesame set of targets The documents with the highestcompound and target Tanimoto similarity scores to thequery document are listed in the Related Documentssection (see Supplementary Figure S2)

In addition to the search tabs and report card pages anumber of tabs show different views of drug data withinthe database The lsquoBrowse Drugsrsquo tab lists FDA-approveddrugs and compounds from the USP DictionaryUSANdocuments together with their various properties andicons (as described in Figure 2) The new lsquoBrowse DrugTargetsrsquo tab lists mechanism of action information for allFDA-approved drugs and WHO anti-malarial drugs withlinks to the relevant Compound and Target Report Cardpages and references Finally the lsquoDrug Approvalsrsquo tabshows the most recently approved FDA drugs with linksto more detailed drug monographs

Downloads and web services

Although the ChEMBL interface provides the basic func-tionality required for many common queries some usersmay prefer to either download the entire database and useit locally (particularly for use in large-scale data miningor integration with other data sources) or retrieve dataprogrammatically via web services

The semantic web is becoming an increasingly popularplatform for large-scale data integration with many nowchoosing to use triple stores and storing data in ResourceDescription Framework (RDF) format in preference tobuilding traditional data warehouses In response todemand from our users an RDF version of ChEMBLhas now been developed and is available for downloadfrom our FTP site (ftpftpebiacukpubdatabaseschembl) The RDF model follows fairly closely the rela-tional model and uses a basic internal ontology known asthe ChEMBL Core Ontology to describe the core conceptsand relationships between them In addition multipleexternal ontologies are used including BioAssay Ontology(38) Unit Ontology (39) Quantities Units Dimensionsand Data Types Ontology (QUDT httpqudtorg) andChemical Information Ontology (CHEMINF) (40) ASPARQL endpoint and Linked Data browser (httpwwwebiacukfgptswlodestar) are also providedallowing users to query and navigate the data httpswwwebiacukrdfserviceschemblsparql

Each release of ChEMBL is also freely available fromour FTP site in a variety of other formats including OracleMySQL PostGRES a structure-data file (SDF) ofcompound structures and a FASTA format file of thetarget sequences under a Creative Commons Attribution-ShareAlike 30 Unported license (httpcreativecommonsorglicensesby-sa30)Finally a set of Representational State Transfer

(REST) based web services is also provided (togetherwith sample Java Perl and Python clients) to allow pro-grammatic retrieval of ChEMBL data in extensible mark-up language or JavaScript Object Notation (JSON)formats (see httpswwwebiacukchemblws for moredetails)

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online

FUNDING

Strategic Award for Chemogenomics from the WellcomeTrust [WT086151Z08Z] European Molecular BiologyLaboratory the Innovative Medicines Initiative JointUndertaking [115191 115002] Medicines for MalariaVenture Funding for open access charge EuropeanMolecular Biology Laboratory

Conflict of interest statement None declared

REFERENCES

1 BesnardJ RudaGF SetolaV AbecassisK RodriguizRMHuangXP NorvalS SassanoMF ShinAI WebsterLAet al (2012) Automated design of ligands to polypharmacologicalprofiles Nature 492 215ndash220

2 HuY and BajorathJ (2012) Extending the activity cliff conceptstructural categorization of activity cliffs and systematicidentification of different types of cliffs in the ChEMBL databaseJ Chem Inf Model 52 1806ndash1811

3 LounkineE KeiserMJ WhitebreadS MikhailovDHamonJ JenkinsJL LavanP WeberE DoakAK CoteSet al (2012) Large-scale prediction and testing of drug activity onside-effect targets Nature 486 361ndash367

4 WirthM ZoeteV MichielinO and SauerWH (2013)SwissBioisostere a database of molecular replacements for liganddesign Nucleic Acids Res 41 D1137ndashD1143

5 Halling-BrownMD BulusuKC PatelM TymJE andAl-LazikaniB (2012) canSAR an integrated cancer publictranslational research and drug discovery resource Nucleic AcidsRes 40 D947ndashD956

6 MagarinosMP CarmonaSJ CrowtherGJ RalphSARoosDS ShanmugamD Van VoorhisWC and AgueroF(2012) TDR Targets a chemogenomics resource for neglecteddiseases Nucleic Acids Res 40 D1118ndashD1127

7 WilliamsAJ HarlandL GrothP PettiferS ChichesterCWillighagenEL EveloCT BlombergN EckerG GobleCet al (2012) Open PHACTS semantic interoperability for drugdiscovery Drug Discov Today 17 1188ndash1198

8 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay Database Nucleic Acids Res 40D400ndashD412

9 US Department of Health and Human Services (2013) ApprovedDrug Products with Therapeutic Equivalence Evaluations 33rd ednUS Department of Health and Human Services WashingtonDC USA

Nucleic Acids Research 2014 Vol 42 Database issue D1089

10 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

11 SpangenbergT BurrowsJN KowalczykP McDonaldSWellsTN and WillisP (2013) The open access malaria box adrug discovery catalyst for neglected diseases PLoS One 8e62906

12 NwakaS BessonD RamirezB MaesL MatheeussenABickleQ MansourNR YousifF TownsonS GokoolS et al(2011) Integrated dataset of screening hits against multipleneglected disease pathogens PLoS Negl Trop Dis 5 e1412

13 DerbyshireER PrudencioM MotaMM and ClardyJ (2012)Liver-stage malaria parasites vulnerable to diverse chemicalscaffolds Proc Natl Acad Sci USA 109 8511ndash8516

14 BallellL BatesRH YoungRJ Alvarez-GomezD Alvarez-RuizE BarrosoV BlancoD CrespoB EscribanoJGonzalezR et al (2013) Fueling open-source drug discovery 177small-molecule leads against tuberculosis ChemMedChem 8313ndash321

15 GaoY DaviesSP AugustinM WoodwardA PatelUAKovelmanR and HarveyKJ (2013) A broad activity screen insupport of a chemogenomic map for kinase signalling researchand drug discovery Biochem J 451 313ndash328

16 DranchakP MacArthurR GuhaR ZuercherWJDrewryDH AuldDS and IngleseJ (2013) Profile of the GSKpublished protein kinase inhibitor set across ATP-dependentand-independent luciferases implications for reporter-gene assaysPLoS One 8 e57888

17 HeightmanTD ConwayE CorbettDF MacdonaldGJStempG WestawaySM CelestiniP GagliardiSRiccaboniM RonzoniS et al (2008) Identification of smallmolecule agonists of the motilin receptor Bioorg Med ChemLett 18 6423ndash6428

18 HeightmanTD ScottJS LongleyM BordasV DeanDKElliottR HutleyG WitheringtonJ AbberleyL PassinghamBet al (2007) Potent achiral agonists of the ghrelin (growthhormone secretagogue) receptor Bioorg Med Chem Lett 176584ndash6587

19 MiuraT KuriharaK FuruuchiT YoshidaT and AjitoK(2008) Novel 16-membered macrolides modified at C-12 andC-13 positions of midecamycin A1 and miokamycin Part 1Synthesis and evaluation of 1213-carbamate and 12-arylalkylamino-13-hydroxy analogues Bioorg Med Chem 163985ndash4002

20 OzawaN ShimizuT MoritaR YokonoY OchiaiTMunesadaK OhashiA AidaY HamaY TakiK et al (2004)Transporter database TP-Search a web-accessible comprehensivedatabase for research in pharmacokinetics of drugs Pharm Res21 2133ndash2134

21 UeharaT OnoA MaruyamaT KatoI YamadaH OhnoYand UrushidaniT (2010) The Japanese toxicogenomics projectapplication of toxicogenomics Mol Nutr Food Res 54 218ndash227

22 The United States Pharmacopeial Convention (2010) USPDictionary of USAN and International Drug Names 46th edn TheUnited States Pharmacopeial Convention Rockville MD USA

23 WHO Collaborating Centre for Drug Statistics Methodology(2012) Guidelines for ATC Classification and DDD Assignment2013 16th edn WHO Collaborating Centre for Drug StatisticsMethodology Oslo

24 KrugerFA RostomR and OveringtonJP (2012) Mappingsmall molecule binding data to structural domains BMCBioinformatics 13(Suppl 17) S11

25 LipinskiCA LombardoF DominyBW and FeeneyPJ (2001)Experimental and computational approaches to estimate solubilityand permeability in drug discovery and development settingsAdv Drug Deliv Rev 46 3ndash26

26 CongreveM CarrR MurrayC and JhotiH (2003) A lsquorule ofthreersquo for fragment-based lead discovery Drug Discov Today 8876ndash877

27 BickertonGR PaoliniGV BesnardJ MuresanS andHopkinsAL (2012) Quantifying the chemical beauty of drugsNat Chem 4 90ndash98

28 HopkinsAL GroomCR and AlexA (2004) Ligand efficiencya useful metric for lead selection Drug Discov Today 9430ndash431

29 Abad-ZapateroC and MetzJT (2005) Ligand efficiency indicesas guideposts for drug discovery Drug Discov Today 10464ndash469

30 LeesonPD and SpringthorpeB (2007) The influence of drug-likeconcepts on decision-making in medicinal chemistry Nat RevDrug Discov 6 881ndash890

31 KramerC KalliokoskiT GedeckP and VulpettiA (2012) Theexperimental uncertainty of heterogeneous public K(i) dataJ Med Chem 55 5165ndash5173

32 BienfaitB and ErtlP (2013) JSME a free molecule editor inJavaScript J Cheminform 5 24

33 ChambersJ DaviesM GaultonA HerseyA VelankarSPetryszakR HastingsJ BellisL McGlincheyS andOveringtonJP (2013) UniChem a unified chemical structure cross-referencing and identifier tracking system J Cheminform 5 3

34 The Uniprot Consortium (2013) Update on activities at theUniversal Protein Resource (UniProt) in 2013 Nucleic Acids Res41 D43ndashD47

35 AljaberBS StokesN BaileyJ and PeiJ (2010) Documentclustering of scientific texts using citation contexts Inf Retr 13101ndash131

36 McEntyreJR AnaniadouS AndrewsS BlackWJBoulderstoneR ButteryP ChaplinD ChevuruS CobleyNColemanLA et al (2011) UKPMC a full text article resourcefor the life sciences Nucleic Acids Res 39 D58ndashD65

37 WillettP (1998) Chemical similarity searching J Chem InfComput Sci 38 983ndash996

38 VisserU AbeyruwanS VempatiU SmithRP LemmonVand SchurerSC (2011) BioAssay Ontology (BAO) a semanticdescription of bioassays and high-throughput screening resultsBMC Bioinformatics 12 257

39 GkoutosGV SchofieldPN and HoehndorfR (2012) The UnitsOntology a tool for integrating units of measurement in scienceDatabase 2012 bas033

40 HastingsJ ChepelevL WillighagenE AdamsN SteinbeckCand DumontierM (2011) The chemical information ontologyprovenance and disambiguation for chemical data on thebiological semantic web PLoS One 6 e25513

D1090 Nucleic Acids Research 2014 Vol 42 Database issue

Page 2: The ChEMBL bioactivity database: an update

sources To increase the coverage of drugs in development(to complement the set of approved drugs alreadyincluded in ChEMBL from the FDA Orange Book) wehave now added structures and annotation for gt10 000compounds and biotherapeutics for which United StatesAdopted Name (USAN) or International NonproprietaryName (INN) applications have been filed This informa-tion has been obtained from the public list of adoptednames provided by the USANs Council (httpwwwama-assnorgamapubphysician-resourcesmedical-scienceunited-states-adopted-names-counciladopted-namespage) and the USP dictionary of USAN andInternational Drug Names (22) The application for aUSAN or INN is typically made when a compound is inearlymid-stage development and therefore serves as arobust general overview of clinical candidate spaceStructures for novel candidates are manually assignedand for protein therapeutics amino acid sequencesmay be annotated where available For each parentcompound information regarding its synonyms researchcodes applicants year of USAN assignment and the in-dication class for which the USAN has been initially filedwhere available is also included in the database Thesynonyms consist of the non-proprietary names for thecompounds containing that parent molecule and respect-ive type (or source) of that name such as the FDA nameUSAN INN British Approved Name (BAN) JapaneseAccepted Name (JAN) and French approved non-propri-etary name (Denomination Commune Francaise DCF)The inclusion of research codes and synonyms frommultiple sources maximizes the chance of finding acompound of interest based on text searches and allowsadaptive searching across the literature reflecting thechanging names of compounds as they are cross-licensedandor progress to later clinical stages The year of USANassignment can be used to roughly infer the likelihood of acompound being approved Typically an approved druggets its USAN assigned between 1ndash3 years beforeapproval and only a small fraction of drugs is approvedwhen the USAN is 10 years or older (see Figure 1)For each compound ChEMBL also provides the

USAN or INN stems assigned by the USANs councilor the WHO respectively These are prefixes suffixesand infixes in the non-proprietary name which emphasizea specific chemical structure type a pharmacologicalpropertymechanism of action or a combination ofthese In addition to the USAN-derived informationeach of the compounds is also annotated with its respect-ive Anatomical Therapeutic Chemical (ATC) code whereavailable The ATC classification is assigned by the WHOCollaborating Center for Drug Statistics Methodology(23) and can be used as a tool for comparing data ondrugs regarding the organ or anatomical system onwhich they act and their therapeutic pharmacologicaland chemical profile ChEMBL also provides users witha rapid assessment of the important compoundingredientfeatures such as drug type (synthetic small moleculenatural product-derived inorganic polymer antibodypeptideprotein oligonucleotide or oligosaccharide)whether the compound violates any of the Rule-of-Fivecriteria whether it exerts its pharmacological action by a

novel mechanism whether it is dosed as a defined singlestereoisomer or racemic mixture and whether it is a knownprodrug (see Figure 2)

The annotation of the compounds in preclinical researchand development serves as a platform for the annotation ofthe FDA-approved drugs It is likely that when a noveldrug is finally marketed much the information regardingthe active molecule can already be found in ChEMBL Foreach launched drug the ingredient-derived data are thencomplemented with information regarding its marketedproduct This includes information regarding its tradenames dosage information approval dates administrationroutes whether there are lsquoblack boxrsquo safety warningsassociated with the product whether it has a therapeuticapplication (as opposed to imagingdiagnostic agents addi-tives etc) and finally whether the product is available onprescription over-the-counter or if eventually it has beendiscontinued This information allows users of the bioactiv-ity data to assess whether a compound of interest is anapproved drug and is therefore likely to have an advan-tageous safetypharmacokinetic profile or be orally bio-available for example

Finally FDA-approved and WHO anti-malarial drugshave now been annotated with mechanism of action andefficacy target information This information has beenmanually assigned using primary sources such as pub-lished literature and manufacturerrsquos prescribing informa-tion Targets have only been included for a drug if (i) thedrug is believed to interact directly with the target and (ii)there is evidence that this interaction contributes towardthe efficacy of that drug in the indication(s) for which it isapproved We do not currently list as drug targets proteinsthat are responsible for the pharmacokinetics or adverseeffects of a drug those for which there is pharmacol-ogy data but no known link to in vivo efficacy (thoughthese can obviously be queried from the bioactivitydata in ChEMBL) or those that may be relevant toother indications for which the drug is not currentlyapproved

Figure 1 Frequency distribution for approved drugs showing thenumber of years taken for a drug to be approved after a USAN wasassigned to it

D1084 Nucleic Acids Research 2014 Vol 42 Database issue

Modeling drug targets

The appropriate representation of drug targets in bioactiv-ity databases is a non-trivial issue In certain circum-stances it may be sufficient to consider a lsquotargetrsquo and alsquoproteinrsquo as synonymous However there are numerouscases where this oversimplification breaks down A largenumber of marketed drugs for example bind to proteincomplexes or bind non-selectively to all members of aprotein family Similarly for published assay data if ameasurement has been performed in a cell-based assay itis often not clear which of several related proteins areresponsible for the effect observed Formerly assays inChEMBL that described the interaction of a compoundwith multiple possible proteins were mapped to severaltargets However this representation was suboptimal fora number of reasons Firstly users might have retrievedmultiple rows of data for a compound of interest anderroneously believed that it has been tested in multipledifferent assays against each of the individual targets Inaddition a user querying the database with a nonndashcompound-binding subunit of a protein complex wouldstill retrieve activity data and again might incorrectlyinfer that compounds identified bind directly to thatsubunit A new target data model has therefore beendeveloped within ChEMBL to draw a clear distinctionbetween targets (the entities with which a compound

interacts to exert its effects) and the molecular compo-nents (usually proteins) that make up those targetsA key step in the development of the new data model was

the definition of new target types The original lsquoPROTEINrsquotarget type has now been subdivided into a number ofcategories In the simple case where a compound isbelieved to interact specifically with a monomeric proteinthe target type lsquoSINGLE PROTEINrsquo is now used In caseswhere either a compound is known to act non-specificallywith all members of a protein family or the assay condi-tions are such that it is not possible to determine whichmember(s) of a protein family the compound is acting on(eg a cell-based or tissue-based assay) a target type oflsquoPROTEIN FAMILYrsquo is used The target represents andis linked to the group of all proteins (components) withwhich the compound may interact Where the molecularentity with which the compound interacts is known to bea protein complex and can be precisely defined the targettype lsquoPROTEIN COMPLEXrsquo is used Again the targetrepresents the complex itself but is linked to componentsrepresenting each of the protein subunits However it is notalways possible to define protein complex targets preciselyFor example compounds are often measured for activity atgamma-aminobutyric acid (GABA-A) receptors in tissue-based assays Although GABA-A receptors are known tobe pentameric ligand-gated ion channels they can consist

Figure 2 For rapid visual comparison the ChEMBL interface displays a set of icons that summarize features of the chemical compoundingredient(green icons) as well as any marketed products (blue icons)

Nucleic Acids Research 2014 Vol 42 Database issue D1085

of various combinations of a b and g subunits (of whichthere are 12 in total) In a tissue-based format the exactsubunit combinations present are generally not known Insuch cases the target type of lsquoPROTEIN COMPLEXGROUPrsquo is used Other new target types have also beencreated for approved drugs whosemolecular targets are notproteins (eg metal chelating agents ribosome inhibitorsantisense RNA agents) Table 1 shows a full list of all mo-lecular target types in ChEMBL and the number of targetsin each categoryThe new data model also allows the annotation of

binding site information on protein targets Binding sitescan be defined at varying levels of granularity (subunit-level protein domain-level or residue-level) according tothe information available This facility has been used toannotate bioactivity results with the predicted Pfamdomain to which the compound is most likely to bind (24)and to annotate drug efficacy targets with known bindingsubunit information For example benzodiazepine drugsare known to bind to GABA-A receptors at the interfaceof a and g subunits but only in a-1 a-2 a-3 and a-5containing receptors Therefore while the ChEMBL targetfor these drugs contains all a b and g subunits thebenzodiazepine binding site definition for this targetconsists of only the a-1 a-2 a-3 a-5 and g subunitsUsers can therefore use this information to excludeprotein subunits that are not directly involved in thebinding of the drug to its target (in this case the bsubunits) Full details of the latest ChEMBL data modelcan be seen in Supplementary Figure S1

Allowing users to pinpoint high-quality data

As the volume of data in ChEMBL grows it becomesincreasingly important to empower users with the abilityto evaluate the quality and appropriateness of these datafor their particular use cases For example while a

researcher investigating a single target or compoundseries may prefer to retrieve as much data as possibleand validate this information by referring back to theoriginal publications for other applications such as thetraining of computational models it may be vital toexclude any data that could potentially be erroneousTherefore a number of different enhancements havebeen made to the database to allow users to more easilyassess the drug-likeness of compounds compare bioactiv-ity values from different assays and highlight possibleerrors or duplications in the data

To help users assess the drug-likeness of compounds a setof physicochemical properties is provided for the ChEMBLcompounds The calculations are made on the parent formof the molecule after any salts have been removed andwhere the molecular weight of the compound is lt1000and the structure is comprised only of standard commonatoms (C N O H F Cl Br I S P) The exception isthe full molecular weight (FULL_MWT) which where ap-plicable is the molecular weight of the salt plus any hydratespresent The properties are calculated either using algorithmsprovided in Pipeline Pilot (version 85 Accelrys Inc 2012) orusing the ACDlabs Physchem software (version 1201Advanced Chemistry Development Inc 2010) Somefurther descriptors have been derived from these propertiessuch as the well-used Lipinski Rule-of-Five (25) the Rule-of-Three passes used to identify compounds suitable forfragment screening (26) and the weighted QuantitativeEstimate of Drug likeness (QED_WEIGHTED) for whichvalues range from 0ndash1 [1 being the most drug-like and 0 theleast drug-like (27)]

Ligand efficiencies are also increasingly used to identifynot just compounds that have high affinity for a target butthose that give maximum binding for their size number ofatoms or lipophilicitypolar atoms There are a number ofdifferent measures now described in the literature and four

Table 1 List of molecular target types included in release 17 of the ChEMBL database with a description and example of each type and the

total number of targets of that type

Target type Description Example Numberof targets

Single protein Single protein chain Phosphodiesterase 5A (CHEMBL1827) 5518Protein family Group of closely related proteins Muscarinic receptors

(CHEMBL2094109)188

Protein complex Defined protein complex consisting of multiplesubunits

GABA-A receptor alpha-3beta-3gamma-2 (CHEMBL2094120)

159

Protein complex group Poorly defined protein complex where subunitcomposition is unclear

GABA-A receptor (CHEMBL2093872) 43

Proteinndashprotein interaction Disruption of a proteinndashprotein interaction p53Mdm2 (CHEMBL1907611) 12Chimeric protein Fusion of two different proteins either a

synthetic construct or naturally occurringBcrAbl fusion protein

(CHEMBL2096618)2

Selectivity group Pair of proteins for which selectivity has beenassessed

Muscarinic receptors M2 and M3(CHEMBL2095187)

96

Proteinndashnucleic acidcomplex

Complex consisting of both protein andnucleic acid components

70S ribosome (CHEMBL2363965) 6

Nucleic acid DNA RNA or PNA Apo-B 100 mRNA (CHEMBL2364185) 28Oligosaccharide Oligosaccharide Heparin (CHEMBL2364712) 4Small molecule Small molecule such as amino acid sugar or

metaboliteGlutamine (CHEMBL2366039) 20

Macromolecule Large biological molecule other than proteincomplex

Hemozoin (CHEMBL613898) 4

Metal Metal or ion Iron (CHEMBL2363058) 8

D1086 Nucleic Acids Research 2014 Vol 42 Database issue

of the more common have now been made available in thedatabase Ligand Efficiency (LE) (28) Binding EfficiencyIndex (BEI) Surface Efficiency Index (SEI) (29) andLipophilic Ligand Efficiency (LLE) (30) The ligandefficiencies are calculated on the standardizedpChEMBL values (see later in the text) for binding datato protein targets For the BEI and SEI it is also possibleto see a plot of these for a specific target on the TargetReport Card on the ChEMBL Web site and interactivelylook at the structures and select sets of the most ligandefficient molecules that bind to a target of interest

When identifying compounds that bind to a particularprotein target for structurendashactivity relationship or leadidentification studies it is important to be using compar-able data We have therefore standardized many of theactivity types and their corresponding units For exampleIC50 IC50_mean IC50_mM and mean IC50 have all beengiven the standard activity type of IC50 and units of mMmM nmoll and 104moll and so forth have beenstandardized to nM Additionally a number of activitiesare reported in articles as the -log values (eg pKi pIC50-logIC50) we have anti-logged these values so that all ofthe IC50 values (whether reported in the original article asIC50 or pIC50) are seen as IC50 and hence all similaractivity types can be readily identified and their valuesmore easily compared

In addition to the conversion of published activitytypesvaluesunits to standard activity typesvaluesunitsan additional field called pCHEMBL_VALUE has beenadded to the activities table This value allows a numberof roughly comparable measures of half-maximal responseconcentrationpotencyaffinity to be compared on anegative logarithmic scale (eg an IC50 measurement of1 nM has a pChEMBL value of 9) The pChEMBL valueis currently defined as follows log10 (molar IC50 XC50EC50 AC50 Ki Kd or Potency)

Having standardized the activity types units andvalues it is also then possible to identify data that arepotentially erroneous and require further checking withreference to the original article These data are flaggedin the DATA_VALIDITY_COMMENT column ofthe activities table the values of which should be self-explanatory The key ones are lsquoOutside typical rangersquowhere the value is what would normally be consideredtoo high or low a value for the activity type (eg anIC50 value of 109nM) and lsquoNon standard unit for typersquowhere the unit is inappropriate for the activity type(eg an IC50 with units of or mg) A table is availablein the database (ACTIVITY_STDS_LOOKUP) thatcontains details of the activity types that have beenstandardized their permitted standardized units andtheir acceptable value ranges

Lastly the standardization of activity allows the identi-fication of potential duplicate activity values using therules outlined by Kramer et al (31) These are valueswhere an activity measurement reported in an article islikely to be a repeat citation of an earlier measurementrather than an independent measurement A particularexample would be a value reported on a compoundused as a standard in an assay We flag all data wherethe pChEMBL value between the earliest reference

and later references is lt002 (for the same compoundand target pair) with a value of 1 in thePOTENTIAL_DUPLICATE column of the activitiestable Similarly wherever the two pChEMBL valuesdiffer by exactly 3 or 6 log units the activity recordfrom the most recent publication is flagged as alsquoPotential transcription errorrsquo

DATA ACCESS

The ChEMBL interface

The ChEMBL database is accessible via a simple web-based interface at httpswwwebiacukchembl Thisinterface allows users to search the database in anumber of ways A simple key word search box at thetop of the page allows users to find compounds targetsassays or documents containing a search term of interest(by searching various name description and synonymfields) For users wishing to search for compounds bystructure rather than name the lsquoLigand Searchrsquo tabprovides a simple sketcher allowing users to draw a struc-ture or substructure of interest (or import a molfile) andretrieve related compounds (32) Alternatively com-pounds can also be retrieved by CHEMBL_ID smilesor a sequence search against biotherapeutic drugsSimilarly a lsquoTarget Searchrsquo tab allows searching oftargets by CHEMBL_ID or sequence Targets can alsobe browsed either by organism or protein family via thelsquoBrowse Targetsrsquo tab Enhancements have recently beenmade to this tree browser allowing users to search bykey word and identify relevant nodes of the tree and toexpand collapse or select multiple nodes simultaneouslyHaving identified compounds or targets of interest bio-

activity data can be retrieved either from a drop-downmenu on the search results pages or via the ReportCards provided for each ChEMBL compound targetassay or document which contain a series of clickablegraphical widgets for this purposeCompound and Target report card pages have been

improved to incorporate more extensive cross-referencesto other resources For compounds these are nowprovided by the UniChem service (33) whereas forprotein targets cross-references are derived either fromUniProt (34) or through manual annotation A newsection on the Compound and Target Report Cards alsoprovides mechanism of action information where avail-able for approved drugs and their efficacy targetsFollowing the modification of the ChEMBL target datamodel Target Report Cards now provide details of theprotein components of each target (eg in the case of aprotein complex or protein family) in the lsquoTargetComponentsrsquo section and a list of related targets (basedon overlap of protein components) in the lsquoTargetRelationsrsquo section This information allows users torapidly understand the composition of the target theyare viewing and identify any other similar targets thatmay have relevant bioactivity data (see Figure 3)A new feature of the Document Report Card is a

section that lists other publications in ChEMBL that aredeemed similar to the one featured in the Report Card

Nucleic Acids Research 2014 Vol 42 Database issue D1087

Figure 3 Screen capture showing enhancements to the ChEMBL Target Report Card The Target Components section shows which proteins arecomponents of this target (in this case members of the protein family) whereas the Target Relations section shows other targets that are related tothis one because they share one or more of those components The Approved Drugs section shows that there are approved products that are believedto exert at least part of their efficacy through inhibition of phosphodiesterase 4

D1088 Nucleic Acids Research 2014 Vol 42 Database issue

Several methods may be used to assess pairwise documentsimilarity eg overlap of Medical Subject Headings(MeSH) terms (httpwwwnlmnihgovmesh) ordocument clustering based on a term vector approach(35) In our case however the similarity between twodocuments consists of components the first one isdefined by whether a document cites or is referenced bythe other using information retrieved from EuropePMC(36) via the available web services Having establishedpairs of related documents the second component isdefined by the amount of overlap between the compoundsand biological targets reported in those documentsquantified by the Tanimoto coefficient (37) Forexample two articles reporting assay results for thesame set of compounds will have a Tanimoto score of 1as will two documents that report assay results for thesame set of targets The documents with the highestcompound and target Tanimoto similarity scores to thequery document are listed in the Related Documentssection (see Supplementary Figure S2)

In addition to the search tabs and report card pages anumber of tabs show different views of drug data withinthe database The lsquoBrowse Drugsrsquo tab lists FDA-approveddrugs and compounds from the USP DictionaryUSANdocuments together with their various properties andicons (as described in Figure 2) The new lsquoBrowse DrugTargetsrsquo tab lists mechanism of action information for allFDA-approved drugs and WHO anti-malarial drugs withlinks to the relevant Compound and Target Report Cardpages and references Finally the lsquoDrug Approvalsrsquo tabshows the most recently approved FDA drugs with linksto more detailed drug monographs

Downloads and web services

Although the ChEMBL interface provides the basic func-tionality required for many common queries some usersmay prefer to either download the entire database and useit locally (particularly for use in large-scale data miningor integration with other data sources) or retrieve dataprogrammatically via web services

The semantic web is becoming an increasingly popularplatform for large-scale data integration with many nowchoosing to use triple stores and storing data in ResourceDescription Framework (RDF) format in preference tobuilding traditional data warehouses In response todemand from our users an RDF version of ChEMBLhas now been developed and is available for downloadfrom our FTP site (ftpftpebiacukpubdatabaseschembl) The RDF model follows fairly closely the rela-tional model and uses a basic internal ontology known asthe ChEMBL Core Ontology to describe the core conceptsand relationships between them In addition multipleexternal ontologies are used including BioAssay Ontology(38) Unit Ontology (39) Quantities Units Dimensionsand Data Types Ontology (QUDT httpqudtorg) andChemical Information Ontology (CHEMINF) (40) ASPARQL endpoint and Linked Data browser (httpwwwebiacukfgptswlodestar) are also providedallowing users to query and navigate the data httpswwwebiacukrdfserviceschemblsparql

Each release of ChEMBL is also freely available fromour FTP site in a variety of other formats including OracleMySQL PostGRES a structure-data file (SDF) ofcompound structures and a FASTA format file of thetarget sequences under a Creative Commons Attribution-ShareAlike 30 Unported license (httpcreativecommonsorglicensesby-sa30)Finally a set of Representational State Transfer

(REST) based web services is also provided (togetherwith sample Java Perl and Python clients) to allow pro-grammatic retrieval of ChEMBL data in extensible mark-up language or JavaScript Object Notation (JSON)formats (see httpswwwebiacukchemblws for moredetails)

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online

FUNDING

Strategic Award for Chemogenomics from the WellcomeTrust [WT086151Z08Z] European Molecular BiologyLaboratory the Innovative Medicines Initiative JointUndertaking [115191 115002] Medicines for MalariaVenture Funding for open access charge EuropeanMolecular Biology Laboratory

Conflict of interest statement None declared

REFERENCES

1 BesnardJ RudaGF SetolaV AbecassisK RodriguizRMHuangXP NorvalS SassanoMF ShinAI WebsterLAet al (2012) Automated design of ligands to polypharmacologicalprofiles Nature 492 215ndash220

2 HuY and BajorathJ (2012) Extending the activity cliff conceptstructural categorization of activity cliffs and systematicidentification of different types of cliffs in the ChEMBL databaseJ Chem Inf Model 52 1806ndash1811

3 LounkineE KeiserMJ WhitebreadS MikhailovDHamonJ JenkinsJL LavanP WeberE DoakAK CoteSet al (2012) Large-scale prediction and testing of drug activity onside-effect targets Nature 486 361ndash367

4 WirthM ZoeteV MichielinO and SauerWH (2013)SwissBioisostere a database of molecular replacements for liganddesign Nucleic Acids Res 41 D1137ndashD1143

5 Halling-BrownMD BulusuKC PatelM TymJE andAl-LazikaniB (2012) canSAR an integrated cancer publictranslational research and drug discovery resource Nucleic AcidsRes 40 D947ndashD956

6 MagarinosMP CarmonaSJ CrowtherGJ RalphSARoosDS ShanmugamD Van VoorhisWC and AgueroF(2012) TDR Targets a chemogenomics resource for neglecteddiseases Nucleic Acids Res 40 D1118ndashD1127

7 WilliamsAJ HarlandL GrothP PettiferS ChichesterCWillighagenEL EveloCT BlombergN EckerG GobleCet al (2012) Open PHACTS semantic interoperability for drugdiscovery Drug Discov Today 17 1188ndash1198

8 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay Database Nucleic Acids Res 40D400ndashD412

9 US Department of Health and Human Services (2013) ApprovedDrug Products with Therapeutic Equivalence Evaluations 33rd ednUS Department of Health and Human Services WashingtonDC USA

Nucleic Acids Research 2014 Vol 42 Database issue D1089

10 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

11 SpangenbergT BurrowsJN KowalczykP McDonaldSWellsTN and WillisP (2013) The open access malaria box adrug discovery catalyst for neglected diseases PLoS One 8e62906

12 NwakaS BessonD RamirezB MaesL MatheeussenABickleQ MansourNR YousifF TownsonS GokoolS et al(2011) Integrated dataset of screening hits against multipleneglected disease pathogens PLoS Negl Trop Dis 5 e1412

13 DerbyshireER PrudencioM MotaMM and ClardyJ (2012)Liver-stage malaria parasites vulnerable to diverse chemicalscaffolds Proc Natl Acad Sci USA 109 8511ndash8516

14 BallellL BatesRH YoungRJ Alvarez-GomezD Alvarez-RuizE BarrosoV BlancoD CrespoB EscribanoJGonzalezR et al (2013) Fueling open-source drug discovery 177small-molecule leads against tuberculosis ChemMedChem 8313ndash321

15 GaoY DaviesSP AugustinM WoodwardA PatelUAKovelmanR and HarveyKJ (2013) A broad activity screen insupport of a chemogenomic map for kinase signalling researchand drug discovery Biochem J 451 313ndash328

16 DranchakP MacArthurR GuhaR ZuercherWJDrewryDH AuldDS and IngleseJ (2013) Profile of the GSKpublished protein kinase inhibitor set across ATP-dependentand-independent luciferases implications for reporter-gene assaysPLoS One 8 e57888

17 HeightmanTD ConwayE CorbettDF MacdonaldGJStempG WestawaySM CelestiniP GagliardiSRiccaboniM RonzoniS et al (2008) Identification of smallmolecule agonists of the motilin receptor Bioorg Med ChemLett 18 6423ndash6428

18 HeightmanTD ScottJS LongleyM BordasV DeanDKElliottR HutleyG WitheringtonJ AbberleyL PassinghamBet al (2007) Potent achiral agonists of the ghrelin (growthhormone secretagogue) receptor Bioorg Med Chem Lett 176584ndash6587

19 MiuraT KuriharaK FuruuchiT YoshidaT and AjitoK(2008) Novel 16-membered macrolides modified at C-12 andC-13 positions of midecamycin A1 and miokamycin Part 1Synthesis and evaluation of 1213-carbamate and 12-arylalkylamino-13-hydroxy analogues Bioorg Med Chem 163985ndash4002

20 OzawaN ShimizuT MoritaR YokonoY OchiaiTMunesadaK OhashiA AidaY HamaY TakiK et al (2004)Transporter database TP-Search a web-accessible comprehensivedatabase for research in pharmacokinetics of drugs Pharm Res21 2133ndash2134

21 UeharaT OnoA MaruyamaT KatoI YamadaH OhnoYand UrushidaniT (2010) The Japanese toxicogenomics projectapplication of toxicogenomics Mol Nutr Food Res 54 218ndash227

22 The United States Pharmacopeial Convention (2010) USPDictionary of USAN and International Drug Names 46th edn TheUnited States Pharmacopeial Convention Rockville MD USA

23 WHO Collaborating Centre for Drug Statistics Methodology(2012) Guidelines for ATC Classification and DDD Assignment2013 16th edn WHO Collaborating Centre for Drug StatisticsMethodology Oslo

24 KrugerFA RostomR and OveringtonJP (2012) Mappingsmall molecule binding data to structural domains BMCBioinformatics 13(Suppl 17) S11

25 LipinskiCA LombardoF DominyBW and FeeneyPJ (2001)Experimental and computational approaches to estimate solubilityand permeability in drug discovery and development settingsAdv Drug Deliv Rev 46 3ndash26

26 CongreveM CarrR MurrayC and JhotiH (2003) A lsquorule ofthreersquo for fragment-based lead discovery Drug Discov Today 8876ndash877

27 BickertonGR PaoliniGV BesnardJ MuresanS andHopkinsAL (2012) Quantifying the chemical beauty of drugsNat Chem 4 90ndash98

28 HopkinsAL GroomCR and AlexA (2004) Ligand efficiencya useful metric for lead selection Drug Discov Today 9430ndash431

29 Abad-ZapateroC and MetzJT (2005) Ligand efficiency indicesas guideposts for drug discovery Drug Discov Today 10464ndash469

30 LeesonPD and SpringthorpeB (2007) The influence of drug-likeconcepts on decision-making in medicinal chemistry Nat RevDrug Discov 6 881ndash890

31 KramerC KalliokoskiT GedeckP and VulpettiA (2012) Theexperimental uncertainty of heterogeneous public K(i) dataJ Med Chem 55 5165ndash5173

32 BienfaitB and ErtlP (2013) JSME a free molecule editor inJavaScript J Cheminform 5 24

33 ChambersJ DaviesM GaultonA HerseyA VelankarSPetryszakR HastingsJ BellisL McGlincheyS andOveringtonJP (2013) UniChem a unified chemical structure cross-referencing and identifier tracking system J Cheminform 5 3

34 The Uniprot Consortium (2013) Update on activities at theUniversal Protein Resource (UniProt) in 2013 Nucleic Acids Res41 D43ndashD47

35 AljaberBS StokesN BaileyJ and PeiJ (2010) Documentclustering of scientific texts using citation contexts Inf Retr 13101ndash131

36 McEntyreJR AnaniadouS AndrewsS BlackWJBoulderstoneR ButteryP ChaplinD ChevuruS CobleyNColemanLA et al (2011) UKPMC a full text article resourcefor the life sciences Nucleic Acids Res 39 D58ndashD65

37 WillettP (1998) Chemical similarity searching J Chem InfComput Sci 38 983ndash996

38 VisserU AbeyruwanS VempatiU SmithRP LemmonVand SchurerSC (2011) BioAssay Ontology (BAO) a semanticdescription of bioassays and high-throughput screening resultsBMC Bioinformatics 12 257

39 GkoutosGV SchofieldPN and HoehndorfR (2012) The UnitsOntology a tool for integrating units of measurement in scienceDatabase 2012 bas033

40 HastingsJ ChepelevL WillighagenE AdamsN SteinbeckCand DumontierM (2011) The chemical information ontologyprovenance and disambiguation for chemical data on thebiological semantic web PLoS One 6 e25513

D1090 Nucleic Acids Research 2014 Vol 42 Database issue

Page 3: The ChEMBL bioactivity database: an update

Modeling drug targets

The appropriate representation of drug targets in bioactiv-ity databases is a non-trivial issue In certain circum-stances it may be sufficient to consider a lsquotargetrsquo and alsquoproteinrsquo as synonymous However there are numerouscases where this oversimplification breaks down A largenumber of marketed drugs for example bind to proteincomplexes or bind non-selectively to all members of aprotein family Similarly for published assay data if ameasurement has been performed in a cell-based assay itis often not clear which of several related proteins areresponsible for the effect observed Formerly assays inChEMBL that described the interaction of a compoundwith multiple possible proteins were mapped to severaltargets However this representation was suboptimal fora number of reasons Firstly users might have retrievedmultiple rows of data for a compound of interest anderroneously believed that it has been tested in multipledifferent assays against each of the individual targets Inaddition a user querying the database with a nonndashcompound-binding subunit of a protein complex wouldstill retrieve activity data and again might incorrectlyinfer that compounds identified bind directly to thatsubunit A new target data model has therefore beendeveloped within ChEMBL to draw a clear distinctionbetween targets (the entities with which a compound

interacts to exert its effects) and the molecular compo-nents (usually proteins) that make up those targetsA key step in the development of the new data model was

the definition of new target types The original lsquoPROTEINrsquotarget type has now been subdivided into a number ofcategories In the simple case where a compound isbelieved to interact specifically with a monomeric proteinthe target type lsquoSINGLE PROTEINrsquo is now used In caseswhere either a compound is known to act non-specificallywith all members of a protein family or the assay condi-tions are such that it is not possible to determine whichmember(s) of a protein family the compound is acting on(eg a cell-based or tissue-based assay) a target type oflsquoPROTEIN FAMILYrsquo is used The target represents andis linked to the group of all proteins (components) withwhich the compound may interact Where the molecularentity with which the compound interacts is known to bea protein complex and can be precisely defined the targettype lsquoPROTEIN COMPLEXrsquo is used Again the targetrepresents the complex itself but is linked to componentsrepresenting each of the protein subunits However it is notalways possible to define protein complex targets preciselyFor example compounds are often measured for activity atgamma-aminobutyric acid (GABA-A) receptors in tissue-based assays Although GABA-A receptors are known tobe pentameric ligand-gated ion channels they can consist

Figure 2 For rapid visual comparison the ChEMBL interface displays a set of icons that summarize features of the chemical compoundingredient(green icons) as well as any marketed products (blue icons)

Nucleic Acids Research 2014 Vol 42 Database issue D1085

of various combinations of a b and g subunits (of whichthere are 12 in total) In a tissue-based format the exactsubunit combinations present are generally not known Insuch cases the target type of lsquoPROTEIN COMPLEXGROUPrsquo is used Other new target types have also beencreated for approved drugs whosemolecular targets are notproteins (eg metal chelating agents ribosome inhibitorsantisense RNA agents) Table 1 shows a full list of all mo-lecular target types in ChEMBL and the number of targetsin each categoryThe new data model also allows the annotation of

binding site information on protein targets Binding sitescan be defined at varying levels of granularity (subunit-level protein domain-level or residue-level) according tothe information available This facility has been used toannotate bioactivity results with the predicted Pfamdomain to which the compound is most likely to bind (24)and to annotate drug efficacy targets with known bindingsubunit information For example benzodiazepine drugsare known to bind to GABA-A receptors at the interfaceof a and g subunits but only in a-1 a-2 a-3 and a-5containing receptors Therefore while the ChEMBL targetfor these drugs contains all a b and g subunits thebenzodiazepine binding site definition for this targetconsists of only the a-1 a-2 a-3 a-5 and g subunitsUsers can therefore use this information to excludeprotein subunits that are not directly involved in thebinding of the drug to its target (in this case the bsubunits) Full details of the latest ChEMBL data modelcan be seen in Supplementary Figure S1

Allowing users to pinpoint high-quality data

As the volume of data in ChEMBL grows it becomesincreasingly important to empower users with the abilityto evaluate the quality and appropriateness of these datafor their particular use cases For example while a

researcher investigating a single target or compoundseries may prefer to retrieve as much data as possibleand validate this information by referring back to theoriginal publications for other applications such as thetraining of computational models it may be vital toexclude any data that could potentially be erroneousTherefore a number of different enhancements havebeen made to the database to allow users to more easilyassess the drug-likeness of compounds compare bioactiv-ity values from different assays and highlight possibleerrors or duplications in the data

To help users assess the drug-likeness of compounds a setof physicochemical properties is provided for the ChEMBLcompounds The calculations are made on the parent formof the molecule after any salts have been removed andwhere the molecular weight of the compound is lt1000and the structure is comprised only of standard commonatoms (C N O H F Cl Br I S P) The exception isthe full molecular weight (FULL_MWT) which where ap-plicable is the molecular weight of the salt plus any hydratespresent The properties are calculated either using algorithmsprovided in Pipeline Pilot (version 85 Accelrys Inc 2012) orusing the ACDlabs Physchem software (version 1201Advanced Chemistry Development Inc 2010) Somefurther descriptors have been derived from these propertiessuch as the well-used Lipinski Rule-of-Five (25) the Rule-of-Three passes used to identify compounds suitable forfragment screening (26) and the weighted QuantitativeEstimate of Drug likeness (QED_WEIGHTED) for whichvalues range from 0ndash1 [1 being the most drug-like and 0 theleast drug-like (27)]

Ligand efficiencies are also increasingly used to identifynot just compounds that have high affinity for a target butthose that give maximum binding for their size number ofatoms or lipophilicitypolar atoms There are a number ofdifferent measures now described in the literature and four

Table 1 List of molecular target types included in release 17 of the ChEMBL database with a description and example of each type and the

total number of targets of that type

Target type Description Example Numberof targets

Single protein Single protein chain Phosphodiesterase 5A (CHEMBL1827) 5518Protein family Group of closely related proteins Muscarinic receptors

(CHEMBL2094109)188

Protein complex Defined protein complex consisting of multiplesubunits

GABA-A receptor alpha-3beta-3gamma-2 (CHEMBL2094120)

159

Protein complex group Poorly defined protein complex where subunitcomposition is unclear

GABA-A receptor (CHEMBL2093872) 43

Proteinndashprotein interaction Disruption of a proteinndashprotein interaction p53Mdm2 (CHEMBL1907611) 12Chimeric protein Fusion of two different proteins either a

synthetic construct or naturally occurringBcrAbl fusion protein

(CHEMBL2096618)2

Selectivity group Pair of proteins for which selectivity has beenassessed

Muscarinic receptors M2 and M3(CHEMBL2095187)

96

Proteinndashnucleic acidcomplex

Complex consisting of both protein andnucleic acid components

70S ribosome (CHEMBL2363965) 6

Nucleic acid DNA RNA or PNA Apo-B 100 mRNA (CHEMBL2364185) 28Oligosaccharide Oligosaccharide Heparin (CHEMBL2364712) 4Small molecule Small molecule such as amino acid sugar or

metaboliteGlutamine (CHEMBL2366039) 20

Macromolecule Large biological molecule other than proteincomplex

Hemozoin (CHEMBL613898) 4

Metal Metal or ion Iron (CHEMBL2363058) 8

D1086 Nucleic Acids Research 2014 Vol 42 Database issue

of the more common have now been made available in thedatabase Ligand Efficiency (LE) (28) Binding EfficiencyIndex (BEI) Surface Efficiency Index (SEI) (29) andLipophilic Ligand Efficiency (LLE) (30) The ligandefficiencies are calculated on the standardizedpChEMBL values (see later in the text) for binding datato protein targets For the BEI and SEI it is also possibleto see a plot of these for a specific target on the TargetReport Card on the ChEMBL Web site and interactivelylook at the structures and select sets of the most ligandefficient molecules that bind to a target of interest

When identifying compounds that bind to a particularprotein target for structurendashactivity relationship or leadidentification studies it is important to be using compar-able data We have therefore standardized many of theactivity types and their corresponding units For exampleIC50 IC50_mean IC50_mM and mean IC50 have all beengiven the standard activity type of IC50 and units of mMmM nmoll and 104moll and so forth have beenstandardized to nM Additionally a number of activitiesare reported in articles as the -log values (eg pKi pIC50-logIC50) we have anti-logged these values so that all ofthe IC50 values (whether reported in the original article asIC50 or pIC50) are seen as IC50 and hence all similaractivity types can be readily identified and their valuesmore easily compared

In addition to the conversion of published activitytypesvaluesunits to standard activity typesvaluesunitsan additional field called pCHEMBL_VALUE has beenadded to the activities table This value allows a numberof roughly comparable measures of half-maximal responseconcentrationpotencyaffinity to be compared on anegative logarithmic scale (eg an IC50 measurement of1 nM has a pChEMBL value of 9) The pChEMBL valueis currently defined as follows log10 (molar IC50 XC50EC50 AC50 Ki Kd or Potency)

Having standardized the activity types units andvalues it is also then possible to identify data that arepotentially erroneous and require further checking withreference to the original article These data are flaggedin the DATA_VALIDITY_COMMENT column ofthe activities table the values of which should be self-explanatory The key ones are lsquoOutside typical rangersquowhere the value is what would normally be consideredtoo high or low a value for the activity type (eg anIC50 value of 109nM) and lsquoNon standard unit for typersquowhere the unit is inappropriate for the activity type(eg an IC50 with units of or mg) A table is availablein the database (ACTIVITY_STDS_LOOKUP) thatcontains details of the activity types that have beenstandardized their permitted standardized units andtheir acceptable value ranges

Lastly the standardization of activity allows the identi-fication of potential duplicate activity values using therules outlined by Kramer et al (31) These are valueswhere an activity measurement reported in an article islikely to be a repeat citation of an earlier measurementrather than an independent measurement A particularexample would be a value reported on a compoundused as a standard in an assay We flag all data wherethe pChEMBL value between the earliest reference

and later references is lt002 (for the same compoundand target pair) with a value of 1 in thePOTENTIAL_DUPLICATE column of the activitiestable Similarly wherever the two pChEMBL valuesdiffer by exactly 3 or 6 log units the activity recordfrom the most recent publication is flagged as alsquoPotential transcription errorrsquo

DATA ACCESS

The ChEMBL interface

The ChEMBL database is accessible via a simple web-based interface at httpswwwebiacukchembl Thisinterface allows users to search the database in anumber of ways A simple key word search box at thetop of the page allows users to find compounds targetsassays or documents containing a search term of interest(by searching various name description and synonymfields) For users wishing to search for compounds bystructure rather than name the lsquoLigand Searchrsquo tabprovides a simple sketcher allowing users to draw a struc-ture or substructure of interest (or import a molfile) andretrieve related compounds (32) Alternatively com-pounds can also be retrieved by CHEMBL_ID smilesor a sequence search against biotherapeutic drugsSimilarly a lsquoTarget Searchrsquo tab allows searching oftargets by CHEMBL_ID or sequence Targets can alsobe browsed either by organism or protein family via thelsquoBrowse Targetsrsquo tab Enhancements have recently beenmade to this tree browser allowing users to search bykey word and identify relevant nodes of the tree and toexpand collapse or select multiple nodes simultaneouslyHaving identified compounds or targets of interest bio-

activity data can be retrieved either from a drop-downmenu on the search results pages or via the ReportCards provided for each ChEMBL compound targetassay or document which contain a series of clickablegraphical widgets for this purposeCompound and Target report card pages have been

improved to incorporate more extensive cross-referencesto other resources For compounds these are nowprovided by the UniChem service (33) whereas forprotein targets cross-references are derived either fromUniProt (34) or through manual annotation A newsection on the Compound and Target Report Cards alsoprovides mechanism of action information where avail-able for approved drugs and their efficacy targetsFollowing the modification of the ChEMBL target datamodel Target Report Cards now provide details of theprotein components of each target (eg in the case of aprotein complex or protein family) in the lsquoTargetComponentsrsquo section and a list of related targets (basedon overlap of protein components) in the lsquoTargetRelationsrsquo section This information allows users torapidly understand the composition of the target theyare viewing and identify any other similar targets thatmay have relevant bioactivity data (see Figure 3)A new feature of the Document Report Card is a

section that lists other publications in ChEMBL that aredeemed similar to the one featured in the Report Card

Nucleic Acids Research 2014 Vol 42 Database issue D1087

Figure 3 Screen capture showing enhancements to the ChEMBL Target Report Card The Target Components section shows which proteins arecomponents of this target (in this case members of the protein family) whereas the Target Relations section shows other targets that are related tothis one because they share one or more of those components The Approved Drugs section shows that there are approved products that are believedto exert at least part of their efficacy through inhibition of phosphodiesterase 4

D1088 Nucleic Acids Research 2014 Vol 42 Database issue

Several methods may be used to assess pairwise documentsimilarity eg overlap of Medical Subject Headings(MeSH) terms (httpwwwnlmnihgovmesh) ordocument clustering based on a term vector approach(35) In our case however the similarity between twodocuments consists of components the first one isdefined by whether a document cites or is referenced bythe other using information retrieved from EuropePMC(36) via the available web services Having establishedpairs of related documents the second component isdefined by the amount of overlap between the compoundsand biological targets reported in those documentsquantified by the Tanimoto coefficient (37) Forexample two articles reporting assay results for thesame set of compounds will have a Tanimoto score of 1as will two documents that report assay results for thesame set of targets The documents with the highestcompound and target Tanimoto similarity scores to thequery document are listed in the Related Documentssection (see Supplementary Figure S2)

In addition to the search tabs and report card pages anumber of tabs show different views of drug data withinthe database The lsquoBrowse Drugsrsquo tab lists FDA-approveddrugs and compounds from the USP DictionaryUSANdocuments together with their various properties andicons (as described in Figure 2) The new lsquoBrowse DrugTargetsrsquo tab lists mechanism of action information for allFDA-approved drugs and WHO anti-malarial drugs withlinks to the relevant Compound and Target Report Cardpages and references Finally the lsquoDrug Approvalsrsquo tabshows the most recently approved FDA drugs with linksto more detailed drug monographs

Downloads and web services

Although the ChEMBL interface provides the basic func-tionality required for many common queries some usersmay prefer to either download the entire database and useit locally (particularly for use in large-scale data miningor integration with other data sources) or retrieve dataprogrammatically via web services

The semantic web is becoming an increasingly popularplatform for large-scale data integration with many nowchoosing to use triple stores and storing data in ResourceDescription Framework (RDF) format in preference tobuilding traditional data warehouses In response todemand from our users an RDF version of ChEMBLhas now been developed and is available for downloadfrom our FTP site (ftpftpebiacukpubdatabaseschembl) The RDF model follows fairly closely the rela-tional model and uses a basic internal ontology known asthe ChEMBL Core Ontology to describe the core conceptsand relationships between them In addition multipleexternal ontologies are used including BioAssay Ontology(38) Unit Ontology (39) Quantities Units Dimensionsand Data Types Ontology (QUDT httpqudtorg) andChemical Information Ontology (CHEMINF) (40) ASPARQL endpoint and Linked Data browser (httpwwwebiacukfgptswlodestar) are also providedallowing users to query and navigate the data httpswwwebiacukrdfserviceschemblsparql

Each release of ChEMBL is also freely available fromour FTP site in a variety of other formats including OracleMySQL PostGRES a structure-data file (SDF) ofcompound structures and a FASTA format file of thetarget sequences under a Creative Commons Attribution-ShareAlike 30 Unported license (httpcreativecommonsorglicensesby-sa30)Finally a set of Representational State Transfer

(REST) based web services is also provided (togetherwith sample Java Perl and Python clients) to allow pro-grammatic retrieval of ChEMBL data in extensible mark-up language or JavaScript Object Notation (JSON)formats (see httpswwwebiacukchemblws for moredetails)

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online

FUNDING

Strategic Award for Chemogenomics from the WellcomeTrust [WT086151Z08Z] European Molecular BiologyLaboratory the Innovative Medicines Initiative JointUndertaking [115191 115002] Medicines for MalariaVenture Funding for open access charge EuropeanMolecular Biology Laboratory

Conflict of interest statement None declared

REFERENCES

1 BesnardJ RudaGF SetolaV AbecassisK RodriguizRMHuangXP NorvalS SassanoMF ShinAI WebsterLAet al (2012) Automated design of ligands to polypharmacologicalprofiles Nature 492 215ndash220

2 HuY and BajorathJ (2012) Extending the activity cliff conceptstructural categorization of activity cliffs and systematicidentification of different types of cliffs in the ChEMBL databaseJ Chem Inf Model 52 1806ndash1811

3 LounkineE KeiserMJ WhitebreadS MikhailovDHamonJ JenkinsJL LavanP WeberE DoakAK CoteSet al (2012) Large-scale prediction and testing of drug activity onside-effect targets Nature 486 361ndash367

4 WirthM ZoeteV MichielinO and SauerWH (2013)SwissBioisostere a database of molecular replacements for liganddesign Nucleic Acids Res 41 D1137ndashD1143

5 Halling-BrownMD BulusuKC PatelM TymJE andAl-LazikaniB (2012) canSAR an integrated cancer publictranslational research and drug discovery resource Nucleic AcidsRes 40 D947ndashD956

6 MagarinosMP CarmonaSJ CrowtherGJ RalphSARoosDS ShanmugamD Van VoorhisWC and AgueroF(2012) TDR Targets a chemogenomics resource for neglecteddiseases Nucleic Acids Res 40 D1118ndashD1127

7 WilliamsAJ HarlandL GrothP PettiferS ChichesterCWillighagenEL EveloCT BlombergN EckerG GobleCet al (2012) Open PHACTS semantic interoperability for drugdiscovery Drug Discov Today 17 1188ndash1198

8 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay Database Nucleic Acids Res 40D400ndashD412

9 US Department of Health and Human Services (2013) ApprovedDrug Products with Therapeutic Equivalence Evaluations 33rd ednUS Department of Health and Human Services WashingtonDC USA

Nucleic Acids Research 2014 Vol 42 Database issue D1089

10 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

11 SpangenbergT BurrowsJN KowalczykP McDonaldSWellsTN and WillisP (2013) The open access malaria box adrug discovery catalyst for neglected diseases PLoS One 8e62906

12 NwakaS BessonD RamirezB MaesL MatheeussenABickleQ MansourNR YousifF TownsonS GokoolS et al(2011) Integrated dataset of screening hits against multipleneglected disease pathogens PLoS Negl Trop Dis 5 e1412

13 DerbyshireER PrudencioM MotaMM and ClardyJ (2012)Liver-stage malaria parasites vulnerable to diverse chemicalscaffolds Proc Natl Acad Sci USA 109 8511ndash8516

14 BallellL BatesRH YoungRJ Alvarez-GomezD Alvarez-RuizE BarrosoV BlancoD CrespoB EscribanoJGonzalezR et al (2013) Fueling open-source drug discovery 177small-molecule leads against tuberculosis ChemMedChem 8313ndash321

15 GaoY DaviesSP AugustinM WoodwardA PatelUAKovelmanR and HarveyKJ (2013) A broad activity screen insupport of a chemogenomic map for kinase signalling researchand drug discovery Biochem J 451 313ndash328

16 DranchakP MacArthurR GuhaR ZuercherWJDrewryDH AuldDS and IngleseJ (2013) Profile of the GSKpublished protein kinase inhibitor set across ATP-dependentand-independent luciferases implications for reporter-gene assaysPLoS One 8 e57888

17 HeightmanTD ConwayE CorbettDF MacdonaldGJStempG WestawaySM CelestiniP GagliardiSRiccaboniM RonzoniS et al (2008) Identification of smallmolecule agonists of the motilin receptor Bioorg Med ChemLett 18 6423ndash6428

18 HeightmanTD ScottJS LongleyM BordasV DeanDKElliottR HutleyG WitheringtonJ AbberleyL PassinghamBet al (2007) Potent achiral agonists of the ghrelin (growthhormone secretagogue) receptor Bioorg Med Chem Lett 176584ndash6587

19 MiuraT KuriharaK FuruuchiT YoshidaT and AjitoK(2008) Novel 16-membered macrolides modified at C-12 andC-13 positions of midecamycin A1 and miokamycin Part 1Synthesis and evaluation of 1213-carbamate and 12-arylalkylamino-13-hydroxy analogues Bioorg Med Chem 163985ndash4002

20 OzawaN ShimizuT MoritaR YokonoY OchiaiTMunesadaK OhashiA AidaY HamaY TakiK et al (2004)Transporter database TP-Search a web-accessible comprehensivedatabase for research in pharmacokinetics of drugs Pharm Res21 2133ndash2134

21 UeharaT OnoA MaruyamaT KatoI YamadaH OhnoYand UrushidaniT (2010) The Japanese toxicogenomics projectapplication of toxicogenomics Mol Nutr Food Res 54 218ndash227

22 The United States Pharmacopeial Convention (2010) USPDictionary of USAN and International Drug Names 46th edn TheUnited States Pharmacopeial Convention Rockville MD USA

23 WHO Collaborating Centre for Drug Statistics Methodology(2012) Guidelines for ATC Classification and DDD Assignment2013 16th edn WHO Collaborating Centre for Drug StatisticsMethodology Oslo

24 KrugerFA RostomR and OveringtonJP (2012) Mappingsmall molecule binding data to structural domains BMCBioinformatics 13(Suppl 17) S11

25 LipinskiCA LombardoF DominyBW and FeeneyPJ (2001)Experimental and computational approaches to estimate solubilityand permeability in drug discovery and development settingsAdv Drug Deliv Rev 46 3ndash26

26 CongreveM CarrR MurrayC and JhotiH (2003) A lsquorule ofthreersquo for fragment-based lead discovery Drug Discov Today 8876ndash877

27 BickertonGR PaoliniGV BesnardJ MuresanS andHopkinsAL (2012) Quantifying the chemical beauty of drugsNat Chem 4 90ndash98

28 HopkinsAL GroomCR and AlexA (2004) Ligand efficiencya useful metric for lead selection Drug Discov Today 9430ndash431

29 Abad-ZapateroC and MetzJT (2005) Ligand efficiency indicesas guideposts for drug discovery Drug Discov Today 10464ndash469

30 LeesonPD and SpringthorpeB (2007) The influence of drug-likeconcepts on decision-making in medicinal chemistry Nat RevDrug Discov 6 881ndash890

31 KramerC KalliokoskiT GedeckP and VulpettiA (2012) Theexperimental uncertainty of heterogeneous public K(i) dataJ Med Chem 55 5165ndash5173

32 BienfaitB and ErtlP (2013) JSME a free molecule editor inJavaScript J Cheminform 5 24

33 ChambersJ DaviesM GaultonA HerseyA VelankarSPetryszakR HastingsJ BellisL McGlincheyS andOveringtonJP (2013) UniChem a unified chemical structure cross-referencing and identifier tracking system J Cheminform 5 3

34 The Uniprot Consortium (2013) Update on activities at theUniversal Protein Resource (UniProt) in 2013 Nucleic Acids Res41 D43ndashD47

35 AljaberBS StokesN BaileyJ and PeiJ (2010) Documentclustering of scientific texts using citation contexts Inf Retr 13101ndash131

36 McEntyreJR AnaniadouS AndrewsS BlackWJBoulderstoneR ButteryP ChaplinD ChevuruS CobleyNColemanLA et al (2011) UKPMC a full text article resourcefor the life sciences Nucleic Acids Res 39 D58ndashD65

37 WillettP (1998) Chemical similarity searching J Chem InfComput Sci 38 983ndash996

38 VisserU AbeyruwanS VempatiU SmithRP LemmonVand SchurerSC (2011) BioAssay Ontology (BAO) a semanticdescription of bioassays and high-throughput screening resultsBMC Bioinformatics 12 257

39 GkoutosGV SchofieldPN and HoehndorfR (2012) The UnitsOntology a tool for integrating units of measurement in scienceDatabase 2012 bas033

40 HastingsJ ChepelevL WillighagenE AdamsN SteinbeckCand DumontierM (2011) The chemical information ontologyprovenance and disambiguation for chemical data on thebiological semantic web PLoS One 6 e25513

D1090 Nucleic Acids Research 2014 Vol 42 Database issue

Page 4: The ChEMBL bioactivity database: an update

of various combinations of a b and g subunits (of whichthere are 12 in total) In a tissue-based format the exactsubunit combinations present are generally not known Insuch cases the target type of lsquoPROTEIN COMPLEXGROUPrsquo is used Other new target types have also beencreated for approved drugs whosemolecular targets are notproteins (eg metal chelating agents ribosome inhibitorsantisense RNA agents) Table 1 shows a full list of all mo-lecular target types in ChEMBL and the number of targetsin each categoryThe new data model also allows the annotation of

binding site information on protein targets Binding sitescan be defined at varying levels of granularity (subunit-level protein domain-level or residue-level) according tothe information available This facility has been used toannotate bioactivity results with the predicted Pfamdomain to which the compound is most likely to bind (24)and to annotate drug efficacy targets with known bindingsubunit information For example benzodiazepine drugsare known to bind to GABA-A receptors at the interfaceof a and g subunits but only in a-1 a-2 a-3 and a-5containing receptors Therefore while the ChEMBL targetfor these drugs contains all a b and g subunits thebenzodiazepine binding site definition for this targetconsists of only the a-1 a-2 a-3 a-5 and g subunitsUsers can therefore use this information to excludeprotein subunits that are not directly involved in thebinding of the drug to its target (in this case the bsubunits) Full details of the latest ChEMBL data modelcan be seen in Supplementary Figure S1

Allowing users to pinpoint high-quality data

As the volume of data in ChEMBL grows it becomesincreasingly important to empower users with the abilityto evaluate the quality and appropriateness of these datafor their particular use cases For example while a

researcher investigating a single target or compoundseries may prefer to retrieve as much data as possibleand validate this information by referring back to theoriginal publications for other applications such as thetraining of computational models it may be vital toexclude any data that could potentially be erroneousTherefore a number of different enhancements havebeen made to the database to allow users to more easilyassess the drug-likeness of compounds compare bioactiv-ity values from different assays and highlight possibleerrors or duplications in the data

To help users assess the drug-likeness of compounds a setof physicochemical properties is provided for the ChEMBLcompounds The calculations are made on the parent formof the molecule after any salts have been removed andwhere the molecular weight of the compound is lt1000and the structure is comprised only of standard commonatoms (C N O H F Cl Br I S P) The exception isthe full molecular weight (FULL_MWT) which where ap-plicable is the molecular weight of the salt plus any hydratespresent The properties are calculated either using algorithmsprovided in Pipeline Pilot (version 85 Accelrys Inc 2012) orusing the ACDlabs Physchem software (version 1201Advanced Chemistry Development Inc 2010) Somefurther descriptors have been derived from these propertiessuch as the well-used Lipinski Rule-of-Five (25) the Rule-of-Three passes used to identify compounds suitable forfragment screening (26) and the weighted QuantitativeEstimate of Drug likeness (QED_WEIGHTED) for whichvalues range from 0ndash1 [1 being the most drug-like and 0 theleast drug-like (27)]

Ligand efficiencies are also increasingly used to identifynot just compounds that have high affinity for a target butthose that give maximum binding for their size number ofatoms or lipophilicitypolar atoms There are a number ofdifferent measures now described in the literature and four

Table 1 List of molecular target types included in release 17 of the ChEMBL database with a description and example of each type and the

total number of targets of that type

Target type Description Example Numberof targets

Single protein Single protein chain Phosphodiesterase 5A (CHEMBL1827) 5518Protein family Group of closely related proteins Muscarinic receptors

(CHEMBL2094109)188

Protein complex Defined protein complex consisting of multiplesubunits

GABA-A receptor alpha-3beta-3gamma-2 (CHEMBL2094120)

159

Protein complex group Poorly defined protein complex where subunitcomposition is unclear

GABA-A receptor (CHEMBL2093872) 43

Proteinndashprotein interaction Disruption of a proteinndashprotein interaction p53Mdm2 (CHEMBL1907611) 12Chimeric protein Fusion of two different proteins either a

synthetic construct or naturally occurringBcrAbl fusion protein

(CHEMBL2096618)2

Selectivity group Pair of proteins for which selectivity has beenassessed

Muscarinic receptors M2 and M3(CHEMBL2095187)

96

Proteinndashnucleic acidcomplex

Complex consisting of both protein andnucleic acid components

70S ribosome (CHEMBL2363965) 6

Nucleic acid DNA RNA or PNA Apo-B 100 mRNA (CHEMBL2364185) 28Oligosaccharide Oligosaccharide Heparin (CHEMBL2364712) 4Small molecule Small molecule such as amino acid sugar or

metaboliteGlutamine (CHEMBL2366039) 20

Macromolecule Large biological molecule other than proteincomplex

Hemozoin (CHEMBL613898) 4

Metal Metal or ion Iron (CHEMBL2363058) 8

D1086 Nucleic Acids Research 2014 Vol 42 Database issue

of the more common have now been made available in thedatabase Ligand Efficiency (LE) (28) Binding EfficiencyIndex (BEI) Surface Efficiency Index (SEI) (29) andLipophilic Ligand Efficiency (LLE) (30) The ligandefficiencies are calculated on the standardizedpChEMBL values (see later in the text) for binding datato protein targets For the BEI and SEI it is also possibleto see a plot of these for a specific target on the TargetReport Card on the ChEMBL Web site and interactivelylook at the structures and select sets of the most ligandefficient molecules that bind to a target of interest

When identifying compounds that bind to a particularprotein target for structurendashactivity relationship or leadidentification studies it is important to be using compar-able data We have therefore standardized many of theactivity types and their corresponding units For exampleIC50 IC50_mean IC50_mM and mean IC50 have all beengiven the standard activity type of IC50 and units of mMmM nmoll and 104moll and so forth have beenstandardized to nM Additionally a number of activitiesare reported in articles as the -log values (eg pKi pIC50-logIC50) we have anti-logged these values so that all ofthe IC50 values (whether reported in the original article asIC50 or pIC50) are seen as IC50 and hence all similaractivity types can be readily identified and their valuesmore easily compared

In addition to the conversion of published activitytypesvaluesunits to standard activity typesvaluesunitsan additional field called pCHEMBL_VALUE has beenadded to the activities table This value allows a numberof roughly comparable measures of half-maximal responseconcentrationpotencyaffinity to be compared on anegative logarithmic scale (eg an IC50 measurement of1 nM has a pChEMBL value of 9) The pChEMBL valueis currently defined as follows log10 (molar IC50 XC50EC50 AC50 Ki Kd or Potency)

Having standardized the activity types units andvalues it is also then possible to identify data that arepotentially erroneous and require further checking withreference to the original article These data are flaggedin the DATA_VALIDITY_COMMENT column ofthe activities table the values of which should be self-explanatory The key ones are lsquoOutside typical rangersquowhere the value is what would normally be consideredtoo high or low a value for the activity type (eg anIC50 value of 109nM) and lsquoNon standard unit for typersquowhere the unit is inappropriate for the activity type(eg an IC50 with units of or mg) A table is availablein the database (ACTIVITY_STDS_LOOKUP) thatcontains details of the activity types that have beenstandardized their permitted standardized units andtheir acceptable value ranges

Lastly the standardization of activity allows the identi-fication of potential duplicate activity values using therules outlined by Kramer et al (31) These are valueswhere an activity measurement reported in an article islikely to be a repeat citation of an earlier measurementrather than an independent measurement A particularexample would be a value reported on a compoundused as a standard in an assay We flag all data wherethe pChEMBL value between the earliest reference

and later references is lt002 (for the same compoundand target pair) with a value of 1 in thePOTENTIAL_DUPLICATE column of the activitiestable Similarly wherever the two pChEMBL valuesdiffer by exactly 3 or 6 log units the activity recordfrom the most recent publication is flagged as alsquoPotential transcription errorrsquo

DATA ACCESS

The ChEMBL interface

The ChEMBL database is accessible via a simple web-based interface at httpswwwebiacukchembl Thisinterface allows users to search the database in anumber of ways A simple key word search box at thetop of the page allows users to find compounds targetsassays or documents containing a search term of interest(by searching various name description and synonymfields) For users wishing to search for compounds bystructure rather than name the lsquoLigand Searchrsquo tabprovides a simple sketcher allowing users to draw a struc-ture or substructure of interest (or import a molfile) andretrieve related compounds (32) Alternatively com-pounds can also be retrieved by CHEMBL_ID smilesor a sequence search against biotherapeutic drugsSimilarly a lsquoTarget Searchrsquo tab allows searching oftargets by CHEMBL_ID or sequence Targets can alsobe browsed either by organism or protein family via thelsquoBrowse Targetsrsquo tab Enhancements have recently beenmade to this tree browser allowing users to search bykey word and identify relevant nodes of the tree and toexpand collapse or select multiple nodes simultaneouslyHaving identified compounds or targets of interest bio-

activity data can be retrieved either from a drop-downmenu on the search results pages or via the ReportCards provided for each ChEMBL compound targetassay or document which contain a series of clickablegraphical widgets for this purposeCompound and Target report card pages have been

improved to incorporate more extensive cross-referencesto other resources For compounds these are nowprovided by the UniChem service (33) whereas forprotein targets cross-references are derived either fromUniProt (34) or through manual annotation A newsection on the Compound and Target Report Cards alsoprovides mechanism of action information where avail-able for approved drugs and their efficacy targetsFollowing the modification of the ChEMBL target datamodel Target Report Cards now provide details of theprotein components of each target (eg in the case of aprotein complex or protein family) in the lsquoTargetComponentsrsquo section and a list of related targets (basedon overlap of protein components) in the lsquoTargetRelationsrsquo section This information allows users torapidly understand the composition of the target theyare viewing and identify any other similar targets thatmay have relevant bioactivity data (see Figure 3)A new feature of the Document Report Card is a

section that lists other publications in ChEMBL that aredeemed similar to the one featured in the Report Card

Nucleic Acids Research 2014 Vol 42 Database issue D1087

Figure 3 Screen capture showing enhancements to the ChEMBL Target Report Card The Target Components section shows which proteins arecomponents of this target (in this case members of the protein family) whereas the Target Relations section shows other targets that are related tothis one because they share one or more of those components The Approved Drugs section shows that there are approved products that are believedto exert at least part of their efficacy through inhibition of phosphodiesterase 4

D1088 Nucleic Acids Research 2014 Vol 42 Database issue

Several methods may be used to assess pairwise documentsimilarity eg overlap of Medical Subject Headings(MeSH) terms (httpwwwnlmnihgovmesh) ordocument clustering based on a term vector approach(35) In our case however the similarity between twodocuments consists of components the first one isdefined by whether a document cites or is referenced bythe other using information retrieved from EuropePMC(36) via the available web services Having establishedpairs of related documents the second component isdefined by the amount of overlap between the compoundsand biological targets reported in those documentsquantified by the Tanimoto coefficient (37) Forexample two articles reporting assay results for thesame set of compounds will have a Tanimoto score of 1as will two documents that report assay results for thesame set of targets The documents with the highestcompound and target Tanimoto similarity scores to thequery document are listed in the Related Documentssection (see Supplementary Figure S2)

In addition to the search tabs and report card pages anumber of tabs show different views of drug data withinthe database The lsquoBrowse Drugsrsquo tab lists FDA-approveddrugs and compounds from the USP DictionaryUSANdocuments together with their various properties andicons (as described in Figure 2) The new lsquoBrowse DrugTargetsrsquo tab lists mechanism of action information for allFDA-approved drugs and WHO anti-malarial drugs withlinks to the relevant Compound and Target Report Cardpages and references Finally the lsquoDrug Approvalsrsquo tabshows the most recently approved FDA drugs with linksto more detailed drug monographs

Downloads and web services

Although the ChEMBL interface provides the basic func-tionality required for many common queries some usersmay prefer to either download the entire database and useit locally (particularly for use in large-scale data miningor integration with other data sources) or retrieve dataprogrammatically via web services

The semantic web is becoming an increasingly popularplatform for large-scale data integration with many nowchoosing to use triple stores and storing data in ResourceDescription Framework (RDF) format in preference tobuilding traditional data warehouses In response todemand from our users an RDF version of ChEMBLhas now been developed and is available for downloadfrom our FTP site (ftpftpebiacukpubdatabaseschembl) The RDF model follows fairly closely the rela-tional model and uses a basic internal ontology known asthe ChEMBL Core Ontology to describe the core conceptsand relationships between them In addition multipleexternal ontologies are used including BioAssay Ontology(38) Unit Ontology (39) Quantities Units Dimensionsand Data Types Ontology (QUDT httpqudtorg) andChemical Information Ontology (CHEMINF) (40) ASPARQL endpoint and Linked Data browser (httpwwwebiacukfgptswlodestar) are also providedallowing users to query and navigate the data httpswwwebiacukrdfserviceschemblsparql

Each release of ChEMBL is also freely available fromour FTP site in a variety of other formats including OracleMySQL PostGRES a structure-data file (SDF) ofcompound structures and a FASTA format file of thetarget sequences under a Creative Commons Attribution-ShareAlike 30 Unported license (httpcreativecommonsorglicensesby-sa30)Finally a set of Representational State Transfer

(REST) based web services is also provided (togetherwith sample Java Perl and Python clients) to allow pro-grammatic retrieval of ChEMBL data in extensible mark-up language or JavaScript Object Notation (JSON)formats (see httpswwwebiacukchemblws for moredetails)

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online

FUNDING

Strategic Award for Chemogenomics from the WellcomeTrust [WT086151Z08Z] European Molecular BiologyLaboratory the Innovative Medicines Initiative JointUndertaking [115191 115002] Medicines for MalariaVenture Funding for open access charge EuropeanMolecular Biology Laboratory

Conflict of interest statement None declared

REFERENCES

1 BesnardJ RudaGF SetolaV AbecassisK RodriguizRMHuangXP NorvalS SassanoMF ShinAI WebsterLAet al (2012) Automated design of ligands to polypharmacologicalprofiles Nature 492 215ndash220

2 HuY and BajorathJ (2012) Extending the activity cliff conceptstructural categorization of activity cliffs and systematicidentification of different types of cliffs in the ChEMBL databaseJ Chem Inf Model 52 1806ndash1811

3 LounkineE KeiserMJ WhitebreadS MikhailovDHamonJ JenkinsJL LavanP WeberE DoakAK CoteSet al (2012) Large-scale prediction and testing of drug activity onside-effect targets Nature 486 361ndash367

4 WirthM ZoeteV MichielinO and SauerWH (2013)SwissBioisostere a database of molecular replacements for liganddesign Nucleic Acids Res 41 D1137ndashD1143

5 Halling-BrownMD BulusuKC PatelM TymJE andAl-LazikaniB (2012) canSAR an integrated cancer publictranslational research and drug discovery resource Nucleic AcidsRes 40 D947ndashD956

6 MagarinosMP CarmonaSJ CrowtherGJ RalphSARoosDS ShanmugamD Van VoorhisWC and AgueroF(2012) TDR Targets a chemogenomics resource for neglecteddiseases Nucleic Acids Res 40 D1118ndashD1127

7 WilliamsAJ HarlandL GrothP PettiferS ChichesterCWillighagenEL EveloCT BlombergN EckerG GobleCet al (2012) Open PHACTS semantic interoperability for drugdiscovery Drug Discov Today 17 1188ndash1198

8 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay Database Nucleic Acids Res 40D400ndashD412

9 US Department of Health and Human Services (2013) ApprovedDrug Products with Therapeutic Equivalence Evaluations 33rd ednUS Department of Health and Human Services WashingtonDC USA

Nucleic Acids Research 2014 Vol 42 Database issue D1089

10 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

11 SpangenbergT BurrowsJN KowalczykP McDonaldSWellsTN and WillisP (2013) The open access malaria box adrug discovery catalyst for neglected diseases PLoS One 8e62906

12 NwakaS BessonD RamirezB MaesL MatheeussenABickleQ MansourNR YousifF TownsonS GokoolS et al(2011) Integrated dataset of screening hits against multipleneglected disease pathogens PLoS Negl Trop Dis 5 e1412

13 DerbyshireER PrudencioM MotaMM and ClardyJ (2012)Liver-stage malaria parasites vulnerable to diverse chemicalscaffolds Proc Natl Acad Sci USA 109 8511ndash8516

14 BallellL BatesRH YoungRJ Alvarez-GomezD Alvarez-RuizE BarrosoV BlancoD CrespoB EscribanoJGonzalezR et al (2013) Fueling open-source drug discovery 177small-molecule leads against tuberculosis ChemMedChem 8313ndash321

15 GaoY DaviesSP AugustinM WoodwardA PatelUAKovelmanR and HarveyKJ (2013) A broad activity screen insupport of a chemogenomic map for kinase signalling researchand drug discovery Biochem J 451 313ndash328

16 DranchakP MacArthurR GuhaR ZuercherWJDrewryDH AuldDS and IngleseJ (2013) Profile of the GSKpublished protein kinase inhibitor set across ATP-dependentand-independent luciferases implications for reporter-gene assaysPLoS One 8 e57888

17 HeightmanTD ConwayE CorbettDF MacdonaldGJStempG WestawaySM CelestiniP GagliardiSRiccaboniM RonzoniS et al (2008) Identification of smallmolecule agonists of the motilin receptor Bioorg Med ChemLett 18 6423ndash6428

18 HeightmanTD ScottJS LongleyM BordasV DeanDKElliottR HutleyG WitheringtonJ AbberleyL PassinghamBet al (2007) Potent achiral agonists of the ghrelin (growthhormone secretagogue) receptor Bioorg Med Chem Lett 176584ndash6587

19 MiuraT KuriharaK FuruuchiT YoshidaT and AjitoK(2008) Novel 16-membered macrolides modified at C-12 andC-13 positions of midecamycin A1 and miokamycin Part 1Synthesis and evaluation of 1213-carbamate and 12-arylalkylamino-13-hydroxy analogues Bioorg Med Chem 163985ndash4002

20 OzawaN ShimizuT MoritaR YokonoY OchiaiTMunesadaK OhashiA AidaY HamaY TakiK et al (2004)Transporter database TP-Search a web-accessible comprehensivedatabase for research in pharmacokinetics of drugs Pharm Res21 2133ndash2134

21 UeharaT OnoA MaruyamaT KatoI YamadaH OhnoYand UrushidaniT (2010) The Japanese toxicogenomics projectapplication of toxicogenomics Mol Nutr Food Res 54 218ndash227

22 The United States Pharmacopeial Convention (2010) USPDictionary of USAN and International Drug Names 46th edn TheUnited States Pharmacopeial Convention Rockville MD USA

23 WHO Collaborating Centre for Drug Statistics Methodology(2012) Guidelines for ATC Classification and DDD Assignment2013 16th edn WHO Collaborating Centre for Drug StatisticsMethodology Oslo

24 KrugerFA RostomR and OveringtonJP (2012) Mappingsmall molecule binding data to structural domains BMCBioinformatics 13(Suppl 17) S11

25 LipinskiCA LombardoF DominyBW and FeeneyPJ (2001)Experimental and computational approaches to estimate solubilityand permeability in drug discovery and development settingsAdv Drug Deliv Rev 46 3ndash26

26 CongreveM CarrR MurrayC and JhotiH (2003) A lsquorule ofthreersquo for fragment-based lead discovery Drug Discov Today 8876ndash877

27 BickertonGR PaoliniGV BesnardJ MuresanS andHopkinsAL (2012) Quantifying the chemical beauty of drugsNat Chem 4 90ndash98

28 HopkinsAL GroomCR and AlexA (2004) Ligand efficiencya useful metric for lead selection Drug Discov Today 9430ndash431

29 Abad-ZapateroC and MetzJT (2005) Ligand efficiency indicesas guideposts for drug discovery Drug Discov Today 10464ndash469

30 LeesonPD and SpringthorpeB (2007) The influence of drug-likeconcepts on decision-making in medicinal chemistry Nat RevDrug Discov 6 881ndash890

31 KramerC KalliokoskiT GedeckP and VulpettiA (2012) Theexperimental uncertainty of heterogeneous public K(i) dataJ Med Chem 55 5165ndash5173

32 BienfaitB and ErtlP (2013) JSME a free molecule editor inJavaScript J Cheminform 5 24

33 ChambersJ DaviesM GaultonA HerseyA VelankarSPetryszakR HastingsJ BellisL McGlincheyS andOveringtonJP (2013) UniChem a unified chemical structure cross-referencing and identifier tracking system J Cheminform 5 3

34 The Uniprot Consortium (2013) Update on activities at theUniversal Protein Resource (UniProt) in 2013 Nucleic Acids Res41 D43ndashD47

35 AljaberBS StokesN BaileyJ and PeiJ (2010) Documentclustering of scientific texts using citation contexts Inf Retr 13101ndash131

36 McEntyreJR AnaniadouS AndrewsS BlackWJBoulderstoneR ButteryP ChaplinD ChevuruS CobleyNColemanLA et al (2011) UKPMC a full text article resourcefor the life sciences Nucleic Acids Res 39 D58ndashD65

37 WillettP (1998) Chemical similarity searching J Chem InfComput Sci 38 983ndash996

38 VisserU AbeyruwanS VempatiU SmithRP LemmonVand SchurerSC (2011) BioAssay Ontology (BAO) a semanticdescription of bioassays and high-throughput screening resultsBMC Bioinformatics 12 257

39 GkoutosGV SchofieldPN and HoehndorfR (2012) The UnitsOntology a tool for integrating units of measurement in scienceDatabase 2012 bas033

40 HastingsJ ChepelevL WillighagenE AdamsN SteinbeckCand DumontierM (2011) The chemical information ontologyprovenance and disambiguation for chemical data on thebiological semantic web PLoS One 6 e25513

D1090 Nucleic Acids Research 2014 Vol 42 Database issue

Page 5: The ChEMBL bioactivity database: an update

of the more common have now been made available in thedatabase Ligand Efficiency (LE) (28) Binding EfficiencyIndex (BEI) Surface Efficiency Index (SEI) (29) andLipophilic Ligand Efficiency (LLE) (30) The ligandefficiencies are calculated on the standardizedpChEMBL values (see later in the text) for binding datato protein targets For the BEI and SEI it is also possibleto see a plot of these for a specific target on the TargetReport Card on the ChEMBL Web site and interactivelylook at the structures and select sets of the most ligandefficient molecules that bind to a target of interest

When identifying compounds that bind to a particularprotein target for structurendashactivity relationship or leadidentification studies it is important to be using compar-able data We have therefore standardized many of theactivity types and their corresponding units For exampleIC50 IC50_mean IC50_mM and mean IC50 have all beengiven the standard activity type of IC50 and units of mMmM nmoll and 104moll and so forth have beenstandardized to nM Additionally a number of activitiesare reported in articles as the -log values (eg pKi pIC50-logIC50) we have anti-logged these values so that all ofthe IC50 values (whether reported in the original article asIC50 or pIC50) are seen as IC50 and hence all similaractivity types can be readily identified and their valuesmore easily compared

In addition to the conversion of published activitytypesvaluesunits to standard activity typesvaluesunitsan additional field called pCHEMBL_VALUE has beenadded to the activities table This value allows a numberof roughly comparable measures of half-maximal responseconcentrationpotencyaffinity to be compared on anegative logarithmic scale (eg an IC50 measurement of1 nM has a pChEMBL value of 9) The pChEMBL valueis currently defined as follows log10 (molar IC50 XC50EC50 AC50 Ki Kd or Potency)

Having standardized the activity types units andvalues it is also then possible to identify data that arepotentially erroneous and require further checking withreference to the original article These data are flaggedin the DATA_VALIDITY_COMMENT column ofthe activities table the values of which should be self-explanatory The key ones are lsquoOutside typical rangersquowhere the value is what would normally be consideredtoo high or low a value for the activity type (eg anIC50 value of 109nM) and lsquoNon standard unit for typersquowhere the unit is inappropriate for the activity type(eg an IC50 with units of or mg) A table is availablein the database (ACTIVITY_STDS_LOOKUP) thatcontains details of the activity types that have beenstandardized their permitted standardized units andtheir acceptable value ranges

Lastly the standardization of activity allows the identi-fication of potential duplicate activity values using therules outlined by Kramer et al (31) These are valueswhere an activity measurement reported in an article islikely to be a repeat citation of an earlier measurementrather than an independent measurement A particularexample would be a value reported on a compoundused as a standard in an assay We flag all data wherethe pChEMBL value between the earliest reference

and later references is lt002 (for the same compoundand target pair) with a value of 1 in thePOTENTIAL_DUPLICATE column of the activitiestable Similarly wherever the two pChEMBL valuesdiffer by exactly 3 or 6 log units the activity recordfrom the most recent publication is flagged as alsquoPotential transcription errorrsquo

DATA ACCESS

The ChEMBL interface

The ChEMBL database is accessible via a simple web-based interface at httpswwwebiacukchembl Thisinterface allows users to search the database in anumber of ways A simple key word search box at thetop of the page allows users to find compounds targetsassays or documents containing a search term of interest(by searching various name description and synonymfields) For users wishing to search for compounds bystructure rather than name the lsquoLigand Searchrsquo tabprovides a simple sketcher allowing users to draw a struc-ture or substructure of interest (or import a molfile) andretrieve related compounds (32) Alternatively com-pounds can also be retrieved by CHEMBL_ID smilesor a sequence search against biotherapeutic drugsSimilarly a lsquoTarget Searchrsquo tab allows searching oftargets by CHEMBL_ID or sequence Targets can alsobe browsed either by organism or protein family via thelsquoBrowse Targetsrsquo tab Enhancements have recently beenmade to this tree browser allowing users to search bykey word and identify relevant nodes of the tree and toexpand collapse or select multiple nodes simultaneouslyHaving identified compounds or targets of interest bio-

activity data can be retrieved either from a drop-downmenu on the search results pages or via the ReportCards provided for each ChEMBL compound targetassay or document which contain a series of clickablegraphical widgets for this purposeCompound and Target report card pages have been

improved to incorporate more extensive cross-referencesto other resources For compounds these are nowprovided by the UniChem service (33) whereas forprotein targets cross-references are derived either fromUniProt (34) or through manual annotation A newsection on the Compound and Target Report Cards alsoprovides mechanism of action information where avail-able for approved drugs and their efficacy targetsFollowing the modification of the ChEMBL target datamodel Target Report Cards now provide details of theprotein components of each target (eg in the case of aprotein complex or protein family) in the lsquoTargetComponentsrsquo section and a list of related targets (basedon overlap of protein components) in the lsquoTargetRelationsrsquo section This information allows users torapidly understand the composition of the target theyare viewing and identify any other similar targets thatmay have relevant bioactivity data (see Figure 3)A new feature of the Document Report Card is a

section that lists other publications in ChEMBL that aredeemed similar to the one featured in the Report Card

Nucleic Acids Research 2014 Vol 42 Database issue D1087

Figure 3 Screen capture showing enhancements to the ChEMBL Target Report Card The Target Components section shows which proteins arecomponents of this target (in this case members of the protein family) whereas the Target Relations section shows other targets that are related tothis one because they share one or more of those components The Approved Drugs section shows that there are approved products that are believedto exert at least part of their efficacy through inhibition of phosphodiesterase 4

D1088 Nucleic Acids Research 2014 Vol 42 Database issue

Several methods may be used to assess pairwise documentsimilarity eg overlap of Medical Subject Headings(MeSH) terms (httpwwwnlmnihgovmesh) ordocument clustering based on a term vector approach(35) In our case however the similarity between twodocuments consists of components the first one isdefined by whether a document cites or is referenced bythe other using information retrieved from EuropePMC(36) via the available web services Having establishedpairs of related documents the second component isdefined by the amount of overlap between the compoundsand biological targets reported in those documentsquantified by the Tanimoto coefficient (37) Forexample two articles reporting assay results for thesame set of compounds will have a Tanimoto score of 1as will two documents that report assay results for thesame set of targets The documents with the highestcompound and target Tanimoto similarity scores to thequery document are listed in the Related Documentssection (see Supplementary Figure S2)

In addition to the search tabs and report card pages anumber of tabs show different views of drug data withinthe database The lsquoBrowse Drugsrsquo tab lists FDA-approveddrugs and compounds from the USP DictionaryUSANdocuments together with their various properties andicons (as described in Figure 2) The new lsquoBrowse DrugTargetsrsquo tab lists mechanism of action information for allFDA-approved drugs and WHO anti-malarial drugs withlinks to the relevant Compound and Target Report Cardpages and references Finally the lsquoDrug Approvalsrsquo tabshows the most recently approved FDA drugs with linksto more detailed drug monographs

Downloads and web services

Although the ChEMBL interface provides the basic func-tionality required for many common queries some usersmay prefer to either download the entire database and useit locally (particularly for use in large-scale data miningor integration with other data sources) or retrieve dataprogrammatically via web services

The semantic web is becoming an increasingly popularplatform for large-scale data integration with many nowchoosing to use triple stores and storing data in ResourceDescription Framework (RDF) format in preference tobuilding traditional data warehouses In response todemand from our users an RDF version of ChEMBLhas now been developed and is available for downloadfrom our FTP site (ftpftpebiacukpubdatabaseschembl) The RDF model follows fairly closely the rela-tional model and uses a basic internal ontology known asthe ChEMBL Core Ontology to describe the core conceptsand relationships between them In addition multipleexternal ontologies are used including BioAssay Ontology(38) Unit Ontology (39) Quantities Units Dimensionsand Data Types Ontology (QUDT httpqudtorg) andChemical Information Ontology (CHEMINF) (40) ASPARQL endpoint and Linked Data browser (httpwwwebiacukfgptswlodestar) are also providedallowing users to query and navigate the data httpswwwebiacukrdfserviceschemblsparql

Each release of ChEMBL is also freely available fromour FTP site in a variety of other formats including OracleMySQL PostGRES a structure-data file (SDF) ofcompound structures and a FASTA format file of thetarget sequences under a Creative Commons Attribution-ShareAlike 30 Unported license (httpcreativecommonsorglicensesby-sa30)Finally a set of Representational State Transfer

(REST) based web services is also provided (togetherwith sample Java Perl and Python clients) to allow pro-grammatic retrieval of ChEMBL data in extensible mark-up language or JavaScript Object Notation (JSON)formats (see httpswwwebiacukchemblws for moredetails)

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online

FUNDING

Strategic Award for Chemogenomics from the WellcomeTrust [WT086151Z08Z] European Molecular BiologyLaboratory the Innovative Medicines Initiative JointUndertaking [115191 115002] Medicines for MalariaVenture Funding for open access charge EuropeanMolecular Biology Laboratory

Conflict of interest statement None declared

REFERENCES

1 BesnardJ RudaGF SetolaV AbecassisK RodriguizRMHuangXP NorvalS SassanoMF ShinAI WebsterLAet al (2012) Automated design of ligands to polypharmacologicalprofiles Nature 492 215ndash220

2 HuY and BajorathJ (2012) Extending the activity cliff conceptstructural categorization of activity cliffs and systematicidentification of different types of cliffs in the ChEMBL databaseJ Chem Inf Model 52 1806ndash1811

3 LounkineE KeiserMJ WhitebreadS MikhailovDHamonJ JenkinsJL LavanP WeberE DoakAK CoteSet al (2012) Large-scale prediction and testing of drug activity onside-effect targets Nature 486 361ndash367

4 WirthM ZoeteV MichielinO and SauerWH (2013)SwissBioisostere a database of molecular replacements for liganddesign Nucleic Acids Res 41 D1137ndashD1143

5 Halling-BrownMD BulusuKC PatelM TymJE andAl-LazikaniB (2012) canSAR an integrated cancer publictranslational research and drug discovery resource Nucleic AcidsRes 40 D947ndashD956

6 MagarinosMP CarmonaSJ CrowtherGJ RalphSARoosDS ShanmugamD Van VoorhisWC and AgueroF(2012) TDR Targets a chemogenomics resource for neglecteddiseases Nucleic Acids Res 40 D1118ndashD1127

7 WilliamsAJ HarlandL GrothP PettiferS ChichesterCWillighagenEL EveloCT BlombergN EckerG GobleCet al (2012) Open PHACTS semantic interoperability for drugdiscovery Drug Discov Today 17 1188ndash1198

8 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay Database Nucleic Acids Res 40D400ndashD412

9 US Department of Health and Human Services (2013) ApprovedDrug Products with Therapeutic Equivalence Evaluations 33rd ednUS Department of Health and Human Services WashingtonDC USA

Nucleic Acids Research 2014 Vol 42 Database issue D1089

10 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

11 SpangenbergT BurrowsJN KowalczykP McDonaldSWellsTN and WillisP (2013) The open access malaria box adrug discovery catalyst for neglected diseases PLoS One 8e62906

12 NwakaS BessonD RamirezB MaesL MatheeussenABickleQ MansourNR YousifF TownsonS GokoolS et al(2011) Integrated dataset of screening hits against multipleneglected disease pathogens PLoS Negl Trop Dis 5 e1412

13 DerbyshireER PrudencioM MotaMM and ClardyJ (2012)Liver-stage malaria parasites vulnerable to diverse chemicalscaffolds Proc Natl Acad Sci USA 109 8511ndash8516

14 BallellL BatesRH YoungRJ Alvarez-GomezD Alvarez-RuizE BarrosoV BlancoD CrespoB EscribanoJGonzalezR et al (2013) Fueling open-source drug discovery 177small-molecule leads against tuberculosis ChemMedChem 8313ndash321

15 GaoY DaviesSP AugustinM WoodwardA PatelUAKovelmanR and HarveyKJ (2013) A broad activity screen insupport of a chemogenomic map for kinase signalling researchand drug discovery Biochem J 451 313ndash328

16 DranchakP MacArthurR GuhaR ZuercherWJDrewryDH AuldDS and IngleseJ (2013) Profile of the GSKpublished protein kinase inhibitor set across ATP-dependentand-independent luciferases implications for reporter-gene assaysPLoS One 8 e57888

17 HeightmanTD ConwayE CorbettDF MacdonaldGJStempG WestawaySM CelestiniP GagliardiSRiccaboniM RonzoniS et al (2008) Identification of smallmolecule agonists of the motilin receptor Bioorg Med ChemLett 18 6423ndash6428

18 HeightmanTD ScottJS LongleyM BordasV DeanDKElliottR HutleyG WitheringtonJ AbberleyL PassinghamBet al (2007) Potent achiral agonists of the ghrelin (growthhormone secretagogue) receptor Bioorg Med Chem Lett 176584ndash6587

19 MiuraT KuriharaK FuruuchiT YoshidaT and AjitoK(2008) Novel 16-membered macrolides modified at C-12 andC-13 positions of midecamycin A1 and miokamycin Part 1Synthesis and evaluation of 1213-carbamate and 12-arylalkylamino-13-hydroxy analogues Bioorg Med Chem 163985ndash4002

20 OzawaN ShimizuT MoritaR YokonoY OchiaiTMunesadaK OhashiA AidaY HamaY TakiK et al (2004)Transporter database TP-Search a web-accessible comprehensivedatabase for research in pharmacokinetics of drugs Pharm Res21 2133ndash2134

21 UeharaT OnoA MaruyamaT KatoI YamadaH OhnoYand UrushidaniT (2010) The Japanese toxicogenomics projectapplication of toxicogenomics Mol Nutr Food Res 54 218ndash227

22 The United States Pharmacopeial Convention (2010) USPDictionary of USAN and International Drug Names 46th edn TheUnited States Pharmacopeial Convention Rockville MD USA

23 WHO Collaborating Centre for Drug Statistics Methodology(2012) Guidelines for ATC Classification and DDD Assignment2013 16th edn WHO Collaborating Centre for Drug StatisticsMethodology Oslo

24 KrugerFA RostomR and OveringtonJP (2012) Mappingsmall molecule binding data to structural domains BMCBioinformatics 13(Suppl 17) S11

25 LipinskiCA LombardoF DominyBW and FeeneyPJ (2001)Experimental and computational approaches to estimate solubilityand permeability in drug discovery and development settingsAdv Drug Deliv Rev 46 3ndash26

26 CongreveM CarrR MurrayC and JhotiH (2003) A lsquorule ofthreersquo for fragment-based lead discovery Drug Discov Today 8876ndash877

27 BickertonGR PaoliniGV BesnardJ MuresanS andHopkinsAL (2012) Quantifying the chemical beauty of drugsNat Chem 4 90ndash98

28 HopkinsAL GroomCR and AlexA (2004) Ligand efficiencya useful metric for lead selection Drug Discov Today 9430ndash431

29 Abad-ZapateroC and MetzJT (2005) Ligand efficiency indicesas guideposts for drug discovery Drug Discov Today 10464ndash469

30 LeesonPD and SpringthorpeB (2007) The influence of drug-likeconcepts on decision-making in medicinal chemistry Nat RevDrug Discov 6 881ndash890

31 KramerC KalliokoskiT GedeckP and VulpettiA (2012) Theexperimental uncertainty of heterogeneous public K(i) dataJ Med Chem 55 5165ndash5173

32 BienfaitB and ErtlP (2013) JSME a free molecule editor inJavaScript J Cheminform 5 24

33 ChambersJ DaviesM GaultonA HerseyA VelankarSPetryszakR HastingsJ BellisL McGlincheyS andOveringtonJP (2013) UniChem a unified chemical structure cross-referencing and identifier tracking system J Cheminform 5 3

34 The Uniprot Consortium (2013) Update on activities at theUniversal Protein Resource (UniProt) in 2013 Nucleic Acids Res41 D43ndashD47

35 AljaberBS StokesN BaileyJ and PeiJ (2010) Documentclustering of scientific texts using citation contexts Inf Retr 13101ndash131

36 McEntyreJR AnaniadouS AndrewsS BlackWJBoulderstoneR ButteryP ChaplinD ChevuruS CobleyNColemanLA et al (2011) UKPMC a full text article resourcefor the life sciences Nucleic Acids Res 39 D58ndashD65

37 WillettP (1998) Chemical similarity searching J Chem InfComput Sci 38 983ndash996

38 VisserU AbeyruwanS VempatiU SmithRP LemmonVand SchurerSC (2011) BioAssay Ontology (BAO) a semanticdescription of bioassays and high-throughput screening resultsBMC Bioinformatics 12 257

39 GkoutosGV SchofieldPN and HoehndorfR (2012) The UnitsOntology a tool for integrating units of measurement in scienceDatabase 2012 bas033

40 HastingsJ ChepelevL WillighagenE AdamsN SteinbeckCand DumontierM (2011) The chemical information ontologyprovenance and disambiguation for chemical data on thebiological semantic web PLoS One 6 e25513

D1090 Nucleic Acids Research 2014 Vol 42 Database issue

Page 6: The ChEMBL bioactivity database: an update

Figure 3 Screen capture showing enhancements to the ChEMBL Target Report Card The Target Components section shows which proteins arecomponents of this target (in this case members of the protein family) whereas the Target Relations section shows other targets that are related tothis one because they share one or more of those components The Approved Drugs section shows that there are approved products that are believedto exert at least part of their efficacy through inhibition of phosphodiesterase 4

D1088 Nucleic Acids Research 2014 Vol 42 Database issue

Several methods may be used to assess pairwise documentsimilarity eg overlap of Medical Subject Headings(MeSH) terms (httpwwwnlmnihgovmesh) ordocument clustering based on a term vector approach(35) In our case however the similarity between twodocuments consists of components the first one isdefined by whether a document cites or is referenced bythe other using information retrieved from EuropePMC(36) via the available web services Having establishedpairs of related documents the second component isdefined by the amount of overlap between the compoundsand biological targets reported in those documentsquantified by the Tanimoto coefficient (37) Forexample two articles reporting assay results for thesame set of compounds will have a Tanimoto score of 1as will two documents that report assay results for thesame set of targets The documents with the highestcompound and target Tanimoto similarity scores to thequery document are listed in the Related Documentssection (see Supplementary Figure S2)

In addition to the search tabs and report card pages anumber of tabs show different views of drug data withinthe database The lsquoBrowse Drugsrsquo tab lists FDA-approveddrugs and compounds from the USP DictionaryUSANdocuments together with their various properties andicons (as described in Figure 2) The new lsquoBrowse DrugTargetsrsquo tab lists mechanism of action information for allFDA-approved drugs and WHO anti-malarial drugs withlinks to the relevant Compound and Target Report Cardpages and references Finally the lsquoDrug Approvalsrsquo tabshows the most recently approved FDA drugs with linksto more detailed drug monographs

Downloads and web services

Although the ChEMBL interface provides the basic func-tionality required for many common queries some usersmay prefer to either download the entire database and useit locally (particularly for use in large-scale data miningor integration with other data sources) or retrieve dataprogrammatically via web services

The semantic web is becoming an increasingly popularplatform for large-scale data integration with many nowchoosing to use triple stores and storing data in ResourceDescription Framework (RDF) format in preference tobuilding traditional data warehouses In response todemand from our users an RDF version of ChEMBLhas now been developed and is available for downloadfrom our FTP site (ftpftpebiacukpubdatabaseschembl) The RDF model follows fairly closely the rela-tional model and uses a basic internal ontology known asthe ChEMBL Core Ontology to describe the core conceptsand relationships between them In addition multipleexternal ontologies are used including BioAssay Ontology(38) Unit Ontology (39) Quantities Units Dimensionsand Data Types Ontology (QUDT httpqudtorg) andChemical Information Ontology (CHEMINF) (40) ASPARQL endpoint and Linked Data browser (httpwwwebiacukfgptswlodestar) are also providedallowing users to query and navigate the data httpswwwebiacukrdfserviceschemblsparql

Each release of ChEMBL is also freely available fromour FTP site in a variety of other formats including OracleMySQL PostGRES a structure-data file (SDF) ofcompound structures and a FASTA format file of thetarget sequences under a Creative Commons Attribution-ShareAlike 30 Unported license (httpcreativecommonsorglicensesby-sa30)Finally a set of Representational State Transfer

(REST) based web services is also provided (togetherwith sample Java Perl and Python clients) to allow pro-grammatic retrieval of ChEMBL data in extensible mark-up language or JavaScript Object Notation (JSON)formats (see httpswwwebiacukchemblws for moredetails)

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online

FUNDING

Strategic Award for Chemogenomics from the WellcomeTrust [WT086151Z08Z] European Molecular BiologyLaboratory the Innovative Medicines Initiative JointUndertaking [115191 115002] Medicines for MalariaVenture Funding for open access charge EuropeanMolecular Biology Laboratory

Conflict of interest statement None declared

REFERENCES

1 BesnardJ RudaGF SetolaV AbecassisK RodriguizRMHuangXP NorvalS SassanoMF ShinAI WebsterLAet al (2012) Automated design of ligands to polypharmacologicalprofiles Nature 492 215ndash220

2 HuY and BajorathJ (2012) Extending the activity cliff conceptstructural categorization of activity cliffs and systematicidentification of different types of cliffs in the ChEMBL databaseJ Chem Inf Model 52 1806ndash1811

3 LounkineE KeiserMJ WhitebreadS MikhailovDHamonJ JenkinsJL LavanP WeberE DoakAK CoteSet al (2012) Large-scale prediction and testing of drug activity onside-effect targets Nature 486 361ndash367

4 WirthM ZoeteV MichielinO and SauerWH (2013)SwissBioisostere a database of molecular replacements for liganddesign Nucleic Acids Res 41 D1137ndashD1143

5 Halling-BrownMD BulusuKC PatelM TymJE andAl-LazikaniB (2012) canSAR an integrated cancer publictranslational research and drug discovery resource Nucleic AcidsRes 40 D947ndashD956

6 MagarinosMP CarmonaSJ CrowtherGJ RalphSARoosDS ShanmugamD Van VoorhisWC and AgueroF(2012) TDR Targets a chemogenomics resource for neglecteddiseases Nucleic Acids Res 40 D1118ndashD1127

7 WilliamsAJ HarlandL GrothP PettiferS ChichesterCWillighagenEL EveloCT BlombergN EckerG GobleCet al (2012) Open PHACTS semantic interoperability for drugdiscovery Drug Discov Today 17 1188ndash1198

8 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay Database Nucleic Acids Res 40D400ndashD412

9 US Department of Health and Human Services (2013) ApprovedDrug Products with Therapeutic Equivalence Evaluations 33rd ednUS Department of Health and Human Services WashingtonDC USA

Nucleic Acids Research 2014 Vol 42 Database issue D1089

10 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

11 SpangenbergT BurrowsJN KowalczykP McDonaldSWellsTN and WillisP (2013) The open access malaria box adrug discovery catalyst for neglected diseases PLoS One 8e62906

12 NwakaS BessonD RamirezB MaesL MatheeussenABickleQ MansourNR YousifF TownsonS GokoolS et al(2011) Integrated dataset of screening hits against multipleneglected disease pathogens PLoS Negl Trop Dis 5 e1412

13 DerbyshireER PrudencioM MotaMM and ClardyJ (2012)Liver-stage malaria parasites vulnerable to diverse chemicalscaffolds Proc Natl Acad Sci USA 109 8511ndash8516

14 BallellL BatesRH YoungRJ Alvarez-GomezD Alvarez-RuizE BarrosoV BlancoD CrespoB EscribanoJGonzalezR et al (2013) Fueling open-source drug discovery 177small-molecule leads against tuberculosis ChemMedChem 8313ndash321

15 GaoY DaviesSP AugustinM WoodwardA PatelUAKovelmanR and HarveyKJ (2013) A broad activity screen insupport of a chemogenomic map for kinase signalling researchand drug discovery Biochem J 451 313ndash328

16 DranchakP MacArthurR GuhaR ZuercherWJDrewryDH AuldDS and IngleseJ (2013) Profile of the GSKpublished protein kinase inhibitor set across ATP-dependentand-independent luciferases implications for reporter-gene assaysPLoS One 8 e57888

17 HeightmanTD ConwayE CorbettDF MacdonaldGJStempG WestawaySM CelestiniP GagliardiSRiccaboniM RonzoniS et al (2008) Identification of smallmolecule agonists of the motilin receptor Bioorg Med ChemLett 18 6423ndash6428

18 HeightmanTD ScottJS LongleyM BordasV DeanDKElliottR HutleyG WitheringtonJ AbberleyL PassinghamBet al (2007) Potent achiral agonists of the ghrelin (growthhormone secretagogue) receptor Bioorg Med Chem Lett 176584ndash6587

19 MiuraT KuriharaK FuruuchiT YoshidaT and AjitoK(2008) Novel 16-membered macrolides modified at C-12 andC-13 positions of midecamycin A1 and miokamycin Part 1Synthesis and evaluation of 1213-carbamate and 12-arylalkylamino-13-hydroxy analogues Bioorg Med Chem 163985ndash4002

20 OzawaN ShimizuT MoritaR YokonoY OchiaiTMunesadaK OhashiA AidaY HamaY TakiK et al (2004)Transporter database TP-Search a web-accessible comprehensivedatabase for research in pharmacokinetics of drugs Pharm Res21 2133ndash2134

21 UeharaT OnoA MaruyamaT KatoI YamadaH OhnoYand UrushidaniT (2010) The Japanese toxicogenomics projectapplication of toxicogenomics Mol Nutr Food Res 54 218ndash227

22 The United States Pharmacopeial Convention (2010) USPDictionary of USAN and International Drug Names 46th edn TheUnited States Pharmacopeial Convention Rockville MD USA

23 WHO Collaborating Centre for Drug Statistics Methodology(2012) Guidelines for ATC Classification and DDD Assignment2013 16th edn WHO Collaborating Centre for Drug StatisticsMethodology Oslo

24 KrugerFA RostomR and OveringtonJP (2012) Mappingsmall molecule binding data to structural domains BMCBioinformatics 13(Suppl 17) S11

25 LipinskiCA LombardoF DominyBW and FeeneyPJ (2001)Experimental and computational approaches to estimate solubilityand permeability in drug discovery and development settingsAdv Drug Deliv Rev 46 3ndash26

26 CongreveM CarrR MurrayC and JhotiH (2003) A lsquorule ofthreersquo for fragment-based lead discovery Drug Discov Today 8876ndash877

27 BickertonGR PaoliniGV BesnardJ MuresanS andHopkinsAL (2012) Quantifying the chemical beauty of drugsNat Chem 4 90ndash98

28 HopkinsAL GroomCR and AlexA (2004) Ligand efficiencya useful metric for lead selection Drug Discov Today 9430ndash431

29 Abad-ZapateroC and MetzJT (2005) Ligand efficiency indicesas guideposts for drug discovery Drug Discov Today 10464ndash469

30 LeesonPD and SpringthorpeB (2007) The influence of drug-likeconcepts on decision-making in medicinal chemistry Nat RevDrug Discov 6 881ndash890

31 KramerC KalliokoskiT GedeckP and VulpettiA (2012) Theexperimental uncertainty of heterogeneous public K(i) dataJ Med Chem 55 5165ndash5173

32 BienfaitB and ErtlP (2013) JSME a free molecule editor inJavaScript J Cheminform 5 24

33 ChambersJ DaviesM GaultonA HerseyA VelankarSPetryszakR HastingsJ BellisL McGlincheyS andOveringtonJP (2013) UniChem a unified chemical structure cross-referencing and identifier tracking system J Cheminform 5 3

34 The Uniprot Consortium (2013) Update on activities at theUniversal Protein Resource (UniProt) in 2013 Nucleic Acids Res41 D43ndashD47

35 AljaberBS StokesN BaileyJ and PeiJ (2010) Documentclustering of scientific texts using citation contexts Inf Retr 13101ndash131

36 McEntyreJR AnaniadouS AndrewsS BlackWJBoulderstoneR ButteryP ChaplinD ChevuruS CobleyNColemanLA et al (2011) UKPMC a full text article resourcefor the life sciences Nucleic Acids Res 39 D58ndashD65

37 WillettP (1998) Chemical similarity searching J Chem InfComput Sci 38 983ndash996

38 VisserU AbeyruwanS VempatiU SmithRP LemmonVand SchurerSC (2011) BioAssay Ontology (BAO) a semanticdescription of bioassays and high-throughput screening resultsBMC Bioinformatics 12 257

39 GkoutosGV SchofieldPN and HoehndorfR (2012) The UnitsOntology a tool for integrating units of measurement in scienceDatabase 2012 bas033

40 HastingsJ ChepelevL WillighagenE AdamsN SteinbeckCand DumontierM (2011) The chemical information ontologyprovenance and disambiguation for chemical data on thebiological semantic web PLoS One 6 e25513

D1090 Nucleic Acids Research 2014 Vol 42 Database issue

Page 7: The ChEMBL bioactivity database: an update

Several methods may be used to assess pairwise documentsimilarity eg overlap of Medical Subject Headings(MeSH) terms (httpwwwnlmnihgovmesh) ordocument clustering based on a term vector approach(35) In our case however the similarity between twodocuments consists of components the first one isdefined by whether a document cites or is referenced bythe other using information retrieved from EuropePMC(36) via the available web services Having establishedpairs of related documents the second component isdefined by the amount of overlap between the compoundsand biological targets reported in those documentsquantified by the Tanimoto coefficient (37) Forexample two articles reporting assay results for thesame set of compounds will have a Tanimoto score of 1as will two documents that report assay results for thesame set of targets The documents with the highestcompound and target Tanimoto similarity scores to thequery document are listed in the Related Documentssection (see Supplementary Figure S2)

In addition to the search tabs and report card pages anumber of tabs show different views of drug data withinthe database The lsquoBrowse Drugsrsquo tab lists FDA-approveddrugs and compounds from the USP DictionaryUSANdocuments together with their various properties andicons (as described in Figure 2) The new lsquoBrowse DrugTargetsrsquo tab lists mechanism of action information for allFDA-approved drugs and WHO anti-malarial drugs withlinks to the relevant Compound and Target Report Cardpages and references Finally the lsquoDrug Approvalsrsquo tabshows the most recently approved FDA drugs with linksto more detailed drug monographs

Downloads and web services

Although the ChEMBL interface provides the basic func-tionality required for many common queries some usersmay prefer to either download the entire database and useit locally (particularly for use in large-scale data miningor integration with other data sources) or retrieve dataprogrammatically via web services

The semantic web is becoming an increasingly popularplatform for large-scale data integration with many nowchoosing to use triple stores and storing data in ResourceDescription Framework (RDF) format in preference tobuilding traditional data warehouses In response todemand from our users an RDF version of ChEMBLhas now been developed and is available for downloadfrom our FTP site (ftpftpebiacukpubdatabaseschembl) The RDF model follows fairly closely the rela-tional model and uses a basic internal ontology known asthe ChEMBL Core Ontology to describe the core conceptsand relationships between them In addition multipleexternal ontologies are used including BioAssay Ontology(38) Unit Ontology (39) Quantities Units Dimensionsand Data Types Ontology (QUDT httpqudtorg) andChemical Information Ontology (CHEMINF) (40) ASPARQL endpoint and Linked Data browser (httpwwwebiacukfgptswlodestar) are also providedallowing users to query and navigate the data httpswwwebiacukrdfserviceschemblsparql

Each release of ChEMBL is also freely available fromour FTP site in a variety of other formats including OracleMySQL PostGRES a structure-data file (SDF) ofcompound structures and a FASTA format file of thetarget sequences under a Creative Commons Attribution-ShareAlike 30 Unported license (httpcreativecommonsorglicensesby-sa30)Finally a set of Representational State Transfer

(REST) based web services is also provided (togetherwith sample Java Perl and Python clients) to allow pro-grammatic retrieval of ChEMBL data in extensible mark-up language or JavaScript Object Notation (JSON)formats (see httpswwwebiacukchemblws for moredetails)

SUPPLEMENTARY DATA

Supplementary Data are available at NAR online

FUNDING

Strategic Award for Chemogenomics from the WellcomeTrust [WT086151Z08Z] European Molecular BiologyLaboratory the Innovative Medicines Initiative JointUndertaking [115191 115002] Medicines for MalariaVenture Funding for open access charge EuropeanMolecular Biology Laboratory

Conflict of interest statement None declared

REFERENCES

1 BesnardJ RudaGF SetolaV AbecassisK RodriguizRMHuangXP NorvalS SassanoMF ShinAI WebsterLAet al (2012) Automated design of ligands to polypharmacologicalprofiles Nature 492 215ndash220

2 HuY and BajorathJ (2012) Extending the activity cliff conceptstructural categorization of activity cliffs and systematicidentification of different types of cliffs in the ChEMBL databaseJ Chem Inf Model 52 1806ndash1811

3 LounkineE KeiserMJ WhitebreadS MikhailovDHamonJ JenkinsJL LavanP WeberE DoakAK CoteSet al (2012) Large-scale prediction and testing of drug activity onside-effect targets Nature 486 361ndash367

4 WirthM ZoeteV MichielinO and SauerWH (2013)SwissBioisostere a database of molecular replacements for liganddesign Nucleic Acids Res 41 D1137ndashD1143

5 Halling-BrownMD BulusuKC PatelM TymJE andAl-LazikaniB (2012) canSAR an integrated cancer publictranslational research and drug discovery resource Nucleic AcidsRes 40 D947ndashD956

6 MagarinosMP CarmonaSJ CrowtherGJ RalphSARoosDS ShanmugamD Van VoorhisWC and AgueroF(2012) TDR Targets a chemogenomics resource for neglecteddiseases Nucleic Acids Res 40 D1118ndashD1127

7 WilliamsAJ HarlandL GrothP PettiferS ChichesterCWillighagenEL EveloCT BlombergN EckerG GobleCet al (2012) Open PHACTS semantic interoperability for drugdiscovery Drug Discov Today 17 1188ndash1198

8 WangY XiaoJ SuzekTO ZhangJ WangJ ZhouZHanL KarapetyanK DrachevaS ShoemakerBA et al(2012) PubChemrsquos BioAssay Database Nucleic Acids Res 40D400ndashD412

9 US Department of Health and Human Services (2013) ApprovedDrug Products with Therapeutic Equivalence Evaluations 33rd ednUS Department of Health and Human Services WashingtonDC USA

Nucleic Acids Research 2014 Vol 42 Database issue D1089

10 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

11 SpangenbergT BurrowsJN KowalczykP McDonaldSWellsTN and WillisP (2013) The open access malaria box adrug discovery catalyst for neglected diseases PLoS One 8e62906

12 NwakaS BessonD RamirezB MaesL MatheeussenABickleQ MansourNR YousifF TownsonS GokoolS et al(2011) Integrated dataset of screening hits against multipleneglected disease pathogens PLoS Negl Trop Dis 5 e1412

13 DerbyshireER PrudencioM MotaMM and ClardyJ (2012)Liver-stage malaria parasites vulnerable to diverse chemicalscaffolds Proc Natl Acad Sci USA 109 8511ndash8516

14 BallellL BatesRH YoungRJ Alvarez-GomezD Alvarez-RuizE BarrosoV BlancoD CrespoB EscribanoJGonzalezR et al (2013) Fueling open-source drug discovery 177small-molecule leads against tuberculosis ChemMedChem 8313ndash321

15 GaoY DaviesSP AugustinM WoodwardA PatelUAKovelmanR and HarveyKJ (2013) A broad activity screen insupport of a chemogenomic map for kinase signalling researchand drug discovery Biochem J 451 313ndash328

16 DranchakP MacArthurR GuhaR ZuercherWJDrewryDH AuldDS and IngleseJ (2013) Profile of the GSKpublished protein kinase inhibitor set across ATP-dependentand-independent luciferases implications for reporter-gene assaysPLoS One 8 e57888

17 HeightmanTD ConwayE CorbettDF MacdonaldGJStempG WestawaySM CelestiniP GagliardiSRiccaboniM RonzoniS et al (2008) Identification of smallmolecule agonists of the motilin receptor Bioorg Med ChemLett 18 6423ndash6428

18 HeightmanTD ScottJS LongleyM BordasV DeanDKElliottR HutleyG WitheringtonJ AbberleyL PassinghamBet al (2007) Potent achiral agonists of the ghrelin (growthhormone secretagogue) receptor Bioorg Med Chem Lett 176584ndash6587

19 MiuraT KuriharaK FuruuchiT YoshidaT and AjitoK(2008) Novel 16-membered macrolides modified at C-12 andC-13 positions of midecamycin A1 and miokamycin Part 1Synthesis and evaluation of 1213-carbamate and 12-arylalkylamino-13-hydroxy analogues Bioorg Med Chem 163985ndash4002

20 OzawaN ShimizuT MoritaR YokonoY OchiaiTMunesadaK OhashiA AidaY HamaY TakiK et al (2004)Transporter database TP-Search a web-accessible comprehensivedatabase for research in pharmacokinetics of drugs Pharm Res21 2133ndash2134

21 UeharaT OnoA MaruyamaT KatoI YamadaH OhnoYand UrushidaniT (2010) The Japanese toxicogenomics projectapplication of toxicogenomics Mol Nutr Food Res 54 218ndash227

22 The United States Pharmacopeial Convention (2010) USPDictionary of USAN and International Drug Names 46th edn TheUnited States Pharmacopeial Convention Rockville MD USA

23 WHO Collaborating Centre for Drug Statistics Methodology(2012) Guidelines for ATC Classification and DDD Assignment2013 16th edn WHO Collaborating Centre for Drug StatisticsMethodology Oslo

24 KrugerFA RostomR and OveringtonJP (2012) Mappingsmall molecule binding data to structural domains BMCBioinformatics 13(Suppl 17) S11

25 LipinskiCA LombardoF DominyBW and FeeneyPJ (2001)Experimental and computational approaches to estimate solubilityand permeability in drug discovery and development settingsAdv Drug Deliv Rev 46 3ndash26

26 CongreveM CarrR MurrayC and JhotiH (2003) A lsquorule ofthreersquo for fragment-based lead discovery Drug Discov Today 8876ndash877

27 BickertonGR PaoliniGV BesnardJ MuresanS andHopkinsAL (2012) Quantifying the chemical beauty of drugsNat Chem 4 90ndash98

28 HopkinsAL GroomCR and AlexA (2004) Ligand efficiencya useful metric for lead selection Drug Discov Today 9430ndash431

29 Abad-ZapateroC and MetzJT (2005) Ligand efficiency indicesas guideposts for drug discovery Drug Discov Today 10464ndash469

30 LeesonPD and SpringthorpeB (2007) The influence of drug-likeconcepts on decision-making in medicinal chemistry Nat RevDrug Discov 6 881ndash890

31 KramerC KalliokoskiT GedeckP and VulpettiA (2012) Theexperimental uncertainty of heterogeneous public K(i) dataJ Med Chem 55 5165ndash5173

32 BienfaitB and ErtlP (2013) JSME a free molecule editor inJavaScript J Cheminform 5 24

33 ChambersJ DaviesM GaultonA HerseyA VelankarSPetryszakR HastingsJ BellisL McGlincheyS andOveringtonJP (2013) UniChem a unified chemical structure cross-referencing and identifier tracking system J Cheminform 5 3

34 The Uniprot Consortium (2013) Update on activities at theUniversal Protein Resource (UniProt) in 2013 Nucleic Acids Res41 D43ndashD47

35 AljaberBS StokesN BaileyJ and PeiJ (2010) Documentclustering of scientific texts using citation contexts Inf Retr 13101ndash131

36 McEntyreJR AnaniadouS AndrewsS BlackWJBoulderstoneR ButteryP ChaplinD ChevuruS CobleyNColemanLA et al (2011) UKPMC a full text article resourcefor the life sciences Nucleic Acids Res 39 D58ndashD65

37 WillettP (1998) Chemical similarity searching J Chem InfComput Sci 38 983ndash996

38 VisserU AbeyruwanS VempatiU SmithRP LemmonVand SchurerSC (2011) BioAssay Ontology (BAO) a semanticdescription of bioassays and high-throughput screening resultsBMC Bioinformatics 12 257

39 GkoutosGV SchofieldPN and HoehndorfR (2012) The UnitsOntology a tool for integrating units of measurement in scienceDatabase 2012 bas033

40 HastingsJ ChepelevL WillighagenE AdamsN SteinbeckCand DumontierM (2011) The chemical information ontologyprovenance and disambiguation for chemical data on thebiological semantic web PLoS One 6 e25513

D1090 Nucleic Acids Research 2014 Vol 42 Database issue

Page 8: The ChEMBL bioactivity database: an update

10 GaultonA BellisLJ BentoAP ChambersJ DaviesMHerseyA LightY McGlincheyS MichalovichD Al-LazikaniB et al (2012) ChEMBL a large-scale bioactivitydatabase for drug discovery Nucleic Acids Res 40D1100ndashD1107

11 SpangenbergT BurrowsJN KowalczykP McDonaldSWellsTN and WillisP (2013) The open access malaria box adrug discovery catalyst for neglected diseases PLoS One 8e62906

12 NwakaS BessonD RamirezB MaesL MatheeussenABickleQ MansourNR YousifF TownsonS GokoolS et al(2011) Integrated dataset of screening hits against multipleneglected disease pathogens PLoS Negl Trop Dis 5 e1412

13 DerbyshireER PrudencioM MotaMM and ClardyJ (2012)Liver-stage malaria parasites vulnerable to diverse chemicalscaffolds Proc Natl Acad Sci USA 109 8511ndash8516

14 BallellL BatesRH YoungRJ Alvarez-GomezD Alvarez-RuizE BarrosoV BlancoD CrespoB EscribanoJGonzalezR et al (2013) Fueling open-source drug discovery 177small-molecule leads against tuberculosis ChemMedChem 8313ndash321

15 GaoY DaviesSP AugustinM WoodwardA PatelUAKovelmanR and HarveyKJ (2013) A broad activity screen insupport of a chemogenomic map for kinase signalling researchand drug discovery Biochem J 451 313ndash328

16 DranchakP MacArthurR GuhaR ZuercherWJDrewryDH AuldDS and IngleseJ (2013) Profile of the GSKpublished protein kinase inhibitor set across ATP-dependentand-independent luciferases implications for reporter-gene assaysPLoS One 8 e57888

17 HeightmanTD ConwayE CorbettDF MacdonaldGJStempG WestawaySM CelestiniP GagliardiSRiccaboniM RonzoniS et al (2008) Identification of smallmolecule agonists of the motilin receptor Bioorg Med ChemLett 18 6423ndash6428

18 HeightmanTD ScottJS LongleyM BordasV DeanDKElliottR HutleyG WitheringtonJ AbberleyL PassinghamBet al (2007) Potent achiral agonists of the ghrelin (growthhormone secretagogue) receptor Bioorg Med Chem Lett 176584ndash6587

19 MiuraT KuriharaK FuruuchiT YoshidaT and AjitoK(2008) Novel 16-membered macrolides modified at C-12 andC-13 positions of midecamycin A1 and miokamycin Part 1Synthesis and evaluation of 1213-carbamate and 12-arylalkylamino-13-hydroxy analogues Bioorg Med Chem 163985ndash4002

20 OzawaN ShimizuT MoritaR YokonoY OchiaiTMunesadaK OhashiA AidaY HamaY TakiK et al (2004)Transporter database TP-Search a web-accessible comprehensivedatabase for research in pharmacokinetics of drugs Pharm Res21 2133ndash2134

21 UeharaT OnoA MaruyamaT KatoI YamadaH OhnoYand UrushidaniT (2010) The Japanese toxicogenomics projectapplication of toxicogenomics Mol Nutr Food Res 54 218ndash227

22 The United States Pharmacopeial Convention (2010) USPDictionary of USAN and International Drug Names 46th edn TheUnited States Pharmacopeial Convention Rockville MD USA

23 WHO Collaborating Centre for Drug Statistics Methodology(2012) Guidelines for ATC Classification and DDD Assignment2013 16th edn WHO Collaborating Centre for Drug StatisticsMethodology Oslo

24 KrugerFA RostomR and OveringtonJP (2012) Mappingsmall molecule binding data to structural domains BMCBioinformatics 13(Suppl 17) S11

25 LipinskiCA LombardoF DominyBW and FeeneyPJ (2001)Experimental and computational approaches to estimate solubilityand permeability in drug discovery and development settingsAdv Drug Deliv Rev 46 3ndash26

26 CongreveM CarrR MurrayC and JhotiH (2003) A lsquorule ofthreersquo for fragment-based lead discovery Drug Discov Today 8876ndash877

27 BickertonGR PaoliniGV BesnardJ MuresanS andHopkinsAL (2012) Quantifying the chemical beauty of drugsNat Chem 4 90ndash98

28 HopkinsAL GroomCR and AlexA (2004) Ligand efficiencya useful metric for lead selection Drug Discov Today 9430ndash431

29 Abad-ZapateroC and MetzJT (2005) Ligand efficiency indicesas guideposts for drug discovery Drug Discov Today 10464ndash469

30 LeesonPD and SpringthorpeB (2007) The influence of drug-likeconcepts on decision-making in medicinal chemistry Nat RevDrug Discov 6 881ndash890

31 KramerC KalliokoskiT GedeckP and VulpettiA (2012) Theexperimental uncertainty of heterogeneous public K(i) dataJ Med Chem 55 5165ndash5173

32 BienfaitB and ErtlP (2013) JSME a free molecule editor inJavaScript J Cheminform 5 24

33 ChambersJ DaviesM GaultonA HerseyA VelankarSPetryszakR HastingsJ BellisL McGlincheyS andOveringtonJP (2013) UniChem a unified chemical structure cross-referencing and identifier tracking system J Cheminform 5 3

34 The Uniprot Consortium (2013) Update on activities at theUniversal Protein Resource (UniProt) in 2013 Nucleic Acids Res41 D43ndashD47

35 AljaberBS StokesN BaileyJ and PeiJ (2010) Documentclustering of scientific texts using citation contexts Inf Retr 13101ndash131

36 McEntyreJR AnaniadouS AndrewsS BlackWJBoulderstoneR ButteryP ChaplinD ChevuruS CobleyNColemanLA et al (2011) UKPMC a full text article resourcefor the life sciences Nucleic Acids Res 39 D58ndashD65

37 WillettP (1998) Chemical similarity searching J Chem InfComput Sci 38 983ndash996

38 VisserU AbeyruwanS VempatiU SmithRP LemmonVand SchurerSC (2011) BioAssay Ontology (BAO) a semanticdescription of bioassays and high-throughput screening resultsBMC Bioinformatics 12 257

39 GkoutosGV SchofieldPN and HoehndorfR (2012) The UnitsOntology a tool for integrating units of measurement in scienceDatabase 2012 bas033

40 HastingsJ ChepelevL WillighagenE AdamsN SteinbeckCand DumontierM (2011) The chemical information ontologyprovenance and disambiguation for chemical data on thebiological semantic web PLoS One 6 e25513

D1090 Nucleic Acids Research 2014 Vol 42 Database issue